In an article recently submitted to the arXiv* server, researchers introduced a novel dataset called changing events in Wikipedia (CHEW) to evaluate the ability of large language models (LLMs) to understand and generate timelines of entities and events based on Wikipedia revisions. They aimed to check whether LLMs can accurately track and reflect changes in information over time, thereby addressing issues of temporal misalignment often observed in these models.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
LLMs are powerful neural models that can generate natural language texts based on large amounts of web text. However, they often suffer from temporal misalignment, meaning they are unaware of changes that happen over time. For example, they may not recognize changes in a person’s occupation, location, status, or new outcomes of events. This can lead to inaccurate or outdated information.
To address this, researchers have proposed methods to align LLMs to temporal information, such as in-domain pretraining, neologism-focused pretraining, knowledge editing, continual learning, and model refinement. However, there is a lack of benchmarks to test the effectiveness of these methods and the temporal capabilities of LLMs.
About the Research
In this paper, the authors proposed CHEW, a dataset comprising varying events in Wikipedia presented in naturally occurring text. CHEW is derived from the TAQA dataset, which contains Wikipedia articles paired with temporal question-answer pairs. The researchers extracted pairs of Wikipedia revisions that reflect changes in the answers to these questions, labeling them as positive (indicating change) or negative (no change). They ensured that the revisions were sufficiently similar and that the evidence for changes was contained within a single sentence.
The resulting dataset includes 7,021 pairs of Wikipedia revisions, covering a diverse range of topics and domains. The authors also introduced four distinct data splits to evaluate different aspects of temporal knowledge in LLMs, such as entity overlap, temporal order, and temporal distance.
Research Findings
The authors conducted a series of experiments using CHEW to probe the temporal knowledge and alignment of four popular open-source models: Llama2-7B, Llama2-13B, Llama3-8B, and Mistral-7B. They evaluated these models on two main tasks: generating and detecting changes.
For the generation task, the models were provided with a Wikipedia revision at a timestamp t1 and another at t2 and were asked to generate the changes that occurred between t1 and t2 for the given entity. The accuracy of the models’ responses was measured by computing the cosine similarity between the generated responses and the content in the revision at t2. The outcomes indicated that Llama2-13B excelled in this task, with some models capable of reproducing the content of the revision at t2 verbatim.
For the detection task, The models were given pairs of Wikipedia revisions and asked to provide a binary label indicating whether there was a change. The accuracy of the models' labels was compared with the ground truth labels. The results showed that Mistral-7B emerged as the best performer in this task, though all models struggled more with newer entities.
The researchers also fine-tuned the models using standard techniques like low-rank adaptation (LoRA) and soft prompt tuning (SFT). Mistral-7B showed the most significant improvement post-fine-tuning, outperforming the Llama models in three out of four data splits. Additionally, providing more context about the task to the models improved their performance.
Furthermore, the usefulness of CHEW was demonstrated in a downstream task of word-in-context temporal classification using the temporal word-in-context (TempoWiC) benchmark dataset. The embeddings generated by CHEW-finetuned LLMs showed improved performance, making them comparable to encoder-only baselines like robustly optimized bidirectional encoder representations from transformers (BERT) pretraining approach (RoBERTa).
Applications
This paper has several implications for natural language processing and beyond. First, CHEW is a valuable resource for studying temporal knowledge and alignment in LLMs and developing new methods to enhance their capabilities. Second, CHEW can help create more realistic and dynamic text generation systems, such as chatbots, summarizers, or storytellers, which adapt to changes in the world or language. Third, CHEW can be used to create more accurate and trustworthy text analysis systems, such as fact-checkers, sentiment analyzers, or topic classifiers, which consider the temporal dimension of language and knowledge.
Conclusion
In summary, CHEW proved to be a significant contribution to the field of natural language processing, providing a means to probe and enhance the temporal knowledge and alignment of LLMs.
The findings highlighted the potential for improving LLM performance through fine-tuning and contextual enhancements. Moving forward, the researchers acknowledged the limitations and suggested extending the dataset to other languages, exploring other types of changes, and applying CHEW to other temporal tasks.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Borkakoty, H., Espinosa-Anke, L. CHEW: A Dataset of CHanging Events in Wikipedia. arXiv, 2024, 2406, 19116. DOI: 10.48550/arXiv.2406.19116, https://arxiv.org/abs/2406.19116