CHEW: A Dataset for Enhancing Temporal Awareness in LLMs

In an article recently submitted to the arXiv* server, researchers introduced a novel dataset called changing events in Wikipedia (CHEW) to evaluate the ability of large language models (LLMs) to understand and generate timelines of entities and events based on Wikipedia revisions. They aimed to check whether LLMs can accurately track and reflect changes in information over time, thereby addressing issues of temporal misalignment often observed in these models.

Study: CHEW: A Dataset for Enhancing Temporal Awareness in LLMs. Image Credit: Owlie Productions/Shutterstock
Study: CHEW: A Dataset for Enhancing Temporal Awareness in LLMs. Image Credit: Owlie Productions/Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Background

LLMs are powerful neural models that can generate natural language texts based on large amounts of web text. However, they often suffer from temporal misalignment, meaning they are unaware of changes that happen over time. For example, they may not recognize changes in a person’s occupation, location, status, or new outcomes of events. This can lead to inaccurate or outdated information.

To address this, researchers have proposed methods to align LLMs to temporal information, such as in-domain pretraining, neologism-focused pretraining, knowledge editing, continual learning, and model refinement. However, there is a lack of benchmarks to test the effectiveness of these methods and the temporal capabilities of LLMs.

About the Research

In this paper, the authors proposed CHEW, a dataset comprising varying events in Wikipedia presented in naturally occurring text. CHEW is derived from the TAQA dataset, which contains Wikipedia articles paired with temporal question-answer pairs. The researchers extracted pairs of Wikipedia revisions that reflect changes in the answers to these questions, labeling them as positive (indicating change) or negative (no change). They ensured that the revisions were sufficiently similar and that the evidence for changes was contained within a single sentence.

The resulting dataset includes 7,021 pairs of Wikipedia revisions, covering a diverse range of topics and domains. The authors also introduced four distinct data splits to evaluate different aspects of temporal knowledge in LLMs, such as entity overlap, temporal order, and temporal distance.

Research Findings

The authors conducted a series of experiments using CHEW to probe the temporal knowledge and alignment of four popular open-source models: Llama2-7B, Llama2-13B, Llama3-8B, and Mistral-7B. They evaluated these models on two main tasks: generating and detecting changes.

For the generation task, the models were provided with a Wikipedia revision at a timestamp t1 and another at t2 and were asked to generate the changes that occurred between t1 and t2 for the given entity. The accuracy of the models’ responses was measured by computing the cosine similarity between the generated responses and the content in the revision at t2. The outcomes indicated that Llama2-13B excelled in this task, with some models capable of reproducing the content of the revision at t2 verbatim.

For the detection task, The models were given pairs of Wikipedia revisions and asked to provide a binary label indicating whether there was a change. The accuracy of the models' labels was compared with the ground truth labels. The results showed that Mistral-7B emerged as the best performer in this task, though all models struggled more with newer entities.

The researchers also fine-tuned the models using standard techniques like low-rank adaptation (LoRA) and soft prompt tuning (SFT). Mistral-7B showed the most significant improvement post-fine-tuning, outperforming the Llama models in three out of four data splits. Additionally, providing more context about the task to the models improved their performance.

Furthermore, the usefulness of CHEW was demonstrated in a downstream task of word-in-context temporal classification using the temporal word-in-context (TempoWiC) benchmark dataset. The embeddings generated by CHEW-finetuned LLMs showed improved performance, making them comparable to encoder-only baselines like robustly optimized bidirectional encoder representations from transformers (BERT) pretraining approach (RoBERTa).

Applications

This paper has several implications for natural language processing and beyond. First, CHEW is a valuable resource for studying temporal knowledge and alignment in LLMs and developing new methods to enhance their capabilities. Second, CHEW can help create more realistic and dynamic text generation systems, such as chatbots, summarizers, or storytellers, which adapt to changes in the world or language. Third, CHEW can be used to create more accurate and trustworthy text analysis systems, such as fact-checkers, sentiment analyzers, or topic classifiers, which consider the temporal dimension of language and knowledge.

Conclusion

In summary, CHEW proved to be a significant contribution to the field of natural language processing, providing a means to probe and enhance the temporal knowledge and alignment of LLMs.

The findings highlighted the potential for improving LLM performance through fine-tuning and contextual enhancements. Moving forward, the researchers acknowledged the limitations and suggested extending the dataset to other languages, exploring other types of changes, and applying CHEW to other temporal tasks.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Journal reference:
  • Preliminary scientific report. Borkakoty, H., Espinosa-Anke, L. CHEW: A Dataset of CHanging Events in Wikipedia. arXiv, 2024, 2406, 19116. DOI: 10.48550/arXiv.2406.19116, https://arxiv.org/abs/2406.19116
Muhammad Osama

Written by

Muhammad Osama

Muhammad Osama is a full-time data analytics consultant and freelance technical writer based in Delhi, India. He specializes in transforming complex technical concepts into accessible content. He has a Bachelor of Technology in Mechanical Engineering with specialization in AI & Robotics from Galgotias University, India, and he has extensive experience in technical content writing, data science and analytics, and artificial intelligence.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Osama, Muhammad. (2024, July 12). CHEW: A Dataset for Enhancing Temporal Awareness in LLMs. AZoAi. Retrieved on January 09, 2025 from https://www.azoai.com/news/20240712/CHEW-A-Dataset-for-Enhancing-Temporal-Awareness-in-LLMs.aspx.

  • MLA

    Osama, Muhammad. "CHEW: A Dataset for Enhancing Temporal Awareness in LLMs". AZoAi. 09 January 2025. <https://www.azoai.com/news/20240712/CHEW-A-Dataset-for-Enhancing-Temporal-Awareness-in-LLMs.aspx>.

  • Chicago

    Osama, Muhammad. "CHEW: A Dataset for Enhancing Temporal Awareness in LLMs". AZoAi. https://www.azoai.com/news/20240712/CHEW-A-Dataset-for-Enhancing-Temporal-Awareness-in-LLMs.aspx. (accessed January 09, 2025).

  • Harvard

    Osama, Muhammad. 2024. CHEW: A Dataset for Enhancing Temporal Awareness in LLMs. AZoAi, viewed 09 January 2025, https://www.azoai.com/news/20240712/CHEW-A-Dataset-for-Enhancing-Temporal-Awareness-in-LLMs.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Beyond Accuracy: New Metrics Reshape AI’s Reasoning Capabilities