CHEW: A Dataset for Enhancing Temporal Awareness in LLMs

Download PDF Copy

By Muhammad OsamaReviewed by Susha Cheriyedath, M.Sc.Jul 12 2024

In an article recently submitted to the arXiv* server, researchers introduced a novel dataset called changing events in Wikipedia (CHEW) to evaluate the ability of large language models (LLMs) to understand and generate timelines of entities and events based on Wikipedia revisions. They aimed to check whether LLMs can accurately track and reflect changes in information over time, thereby addressing issues of temporal misalignment often observed in these models.

*Study: CHEW: A Dataset for Enhancing Temporal Awareness in LLMs. Image Credit: Owlie Productions/Shutterstock*

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Background

LLMs are powerful neural models that can generate natural language texts based on large amounts of web text. However, they often suffer from temporal misalignment, meaning they are unaware of changes that happen over time. For example, they may not recognize changes in a person’s occupation, location, status, or new outcomes of events. This can lead to inaccurate or outdated information.

To address this, researchers have proposed methods to align LLMs to temporal information, such as in-domain pretraining, neologism-focused pretraining, knowledge editing, continual learning, and model refinement. However, there is a lack of benchmarks to test the effectiveness of these methods and the temporal capabilities of LLMs.

About the Research

In this paper, the authors proposed CHEW, a dataset comprising varying events in Wikipedia presented in naturally occurring text. CHEW is derived from the TAQA dataset, which contains Wikipedia articles paired with temporal question-answer pairs. The researchers extracted pairs of Wikipedia revisions that reflect changes in the answers to these questions, labeling them as positive (indicating change) or negative (no change). They ensured that the revisions were sufficiently similar and that the evidence for changes was contained within a single sentence.

The resulting dataset includes 7,021 pairs of Wikipedia revisions, covering a diverse range of topics and domains. The authors also introduced four distinct data splits to evaluate different aspects of temporal knowledge in LLMs, such as entity overlap, temporal order, and temporal distance.

Research Findings

The authors conducted a series of experiments using CHEW to probe the temporal knowledge and alignment of four popular open-source models: Llama2-7B, Llama2-13B, Llama3-8B, and Mistral-7B. They evaluated these models on two main tasks: generating and detecting changes.

For the generation task, the models were provided with a Wikipedia revision at a timestamp t₁ and another at t₂ and were asked to generate the changes that occurred between t₁ and t₂ for the given entity. The accuracy of the models’ responses was measured by computing the cosine similarity between the generated responses and the content in the revision at t₂. The outcomes indicated that Llama2-13B excelled in this task, with some models capable of reproducing the content of the revision at t₂ verbatim.

For the detection task, The models were given pairs of Wikipedia revisions and asked to provide a binary label indicating whether there was a change. The accuracy of the models' labels was compared with the ground truth labels. The results showed that Mistral-7B emerged as the best performer in this task, though all models struggled more with newer entities.

The researchers also fine-tuned the models using standard techniques like low-rank adaptation (LoRA) and soft prompt tuning (SFT). Mistral-7B showed the most significant improvement post-fine-tuning, outperforming the Llama models in three out of four data splits. Additionally, providing more context about the task to the models improved their performance.

Furthermore, the usefulness of CHEW was demonstrated in a downstream task of word-in-context temporal classification using the temporal word-in-context (TempoWiC) benchmark dataset. The embeddings generated by CHEW-finetuned LLMs showed improved performance, making them comparable to encoder-only baselines like robustly optimized bidirectional encoder representations from transformers (BERT) pretraining approach (RoBERTa).

Applications

This paper has several implications for natural language processing and beyond. First, CHEW is a valuable resource for studying temporal knowledge and alignment in LLMs and developing new methods to enhance their capabilities. Second, CHEW can help create more realistic and dynamic text generation systems, such as chatbots, summarizers, or storytellers, which adapt to changes in the world or language. Third, CHEW can be used to create more accurate and trustworthy text analysis systems, such as fact-checkers, sentiment analyzers, or topic classifiers, which consider the temporal dimension of language and knowledge.

Conclusion

In summary, CHEW proved to be a significant contribution to the field of natural language processing, providing a means to probe and enhance the temporal knowledge and alignment of LLMs.

The findings highlighted the potential for improving LLM performance through fine-tuning and contextual enhancements. Moving forward, the researchers acknowledged the limitations and suggested extending the dataset to other languages, exploring other types of changes, and applying CHEW to other temporal tasks.

Journal reference:

Preliminary scientific report. Borkakoty, H., Espinosa-Anke, L. CHEW: A Dataset of CHanging Events in Wikipedia. arXiv, 2024, 2406, 19116. DOI: 10.48550/arXiv.2406.19116, https://arxiv.org/abs/2406.19116

Posted in: AI Research News

Comments (0)

Written by

Muhammad Osama

Muhammad Osama is a full-time data analytics consultant and freelance technical writer based in Delhi, India. He specializes in transforming complex technical concepts into accessible content. He has a Bachelor of Technology in Mechanical Engineering with specialization in AI & Robotics from Galgotias University, India, and he has extensive experience in technical content writing, data science and analytics, and artificial intelligence.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Osama, Muhammad. (2024, July 12). CHEW: A Dataset for Enhancing Temporal Awareness in LLMs. AZoAi. Retrieved on July 15, 2025 from https://www.azoai.com/news/20240712/CHEW-A-Dataset-for-Enhancing-Temporal-Awareness-in-LLMs.aspx.
MLA
Osama, Muhammad. "CHEW: A Dataset for Enhancing Temporal Awareness in LLMs". AZoAi. 15 July 2025. <https://www.azoai.com/news/20240712/CHEW-A-Dataset-for-Enhancing-Temporal-Awareness-in-LLMs.aspx>.
Chicago
Osama, Muhammad. "CHEW: A Dataset for Enhancing Temporal Awareness in LLMs". AZoAi. https://www.azoai.com/news/20240712/CHEW-A-Dataset-for-Enhancing-Temporal-Awareness-in-LLMs.aspx. (accessed July 15, 2025).
Harvard
Osama, Muhammad. 2024. CHEW: A Dataset for Enhancing Temporal Awareness in LLMs. AZoAi, viewed 15 July 2025, https://www.azoai.com/news/20240712/CHEW-A-Dataset-for-Enhancing-Temporal-Awareness-in-LLMs.aspx.