Researchers Develop LongPPL and LongCE to Enhance Long-Context Language Model Evaluation

Download PDF Copy

By Dr Silpaja Chandrasekar, PhDReviewed by Joel ScanlonNov 10 2024

New metrics LongPPL and LongCE outperform perplexity to improve long-context language model performance, revolutionizing how AI models are fine-tuned for complex tasks.

Study: What is Wrong with Perplexity for Long-context Language Modeling? Image Credit: NicoElNino / Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

In an article recently submitted to the arXiv preprint* server, researchers explained why perplexity (PPL) is unreliable for evaluating long-context capabilities in large language models (LLMs). They found that PPL overlooks key tokens by averaging across all tokens, thus masking models' true long-context performance.

To address this, the analysts proposed LongPPL, a metric that prioritizes key tokens, and introduced long-context cross-entropy (LongCE), a re-weighting strategy to improve model fine-tuning. Both of these strategies showed superior predictive accuracy on long-context benchmarks.

Related Work

Past work highlighted the importance of long-context processing in LLMs for tasks like extended conversations, document summarization, and many-shot in-context learning. While methods have significantly extended context windows, PPL remains the standard metric for long-context evaluation. However, PPL has shown poor correlation with actual performance in long-context tasks, as it averages across all tokens, masking key ones crucial for understanding.

Refining Long-Context Model Evaluation

This paper critiques the conventional reliance on perplexity as a metric for evaluating language models' performance on long-context tasks, noting that it fails to account for specific token dependencies inherent in extended contexts. Perplexity typically measures a language model’s predictive certainty across all tokens, but this approach does not distinguish between tokens that depend on long contexts and those that do not. The authors illustrate that many tokens in long-context tasks are "long-context-agnostic," thus skewing the perplexity metric away from reflecting actual long-context comprehension.

To address this, they propose identifying and weighting "key tokens," which heavily rely on long contexts, by using log probability gain (LPG) and log probability value (LPV) measures. The LPG measures a token's prediction accuracy improvement when a long context is available, while LPV distinguishes answers from non-answer tokens by evaluating their predictability within the long context. Experimental results from the LongEval benchmark show that focusing on key tokens—those with high LPG and LPV—significantly aligns perplexity with long-context performance.

Building on these insights, the authors introduce a novel metric, LongPPL, which incorporates only key tokens into the perplexity calculation, thus offering a more accurate reflection of long-context understanding. Additionally, they propose a training objective, LongCE, that emphasizes these key tokens during fine-tuning, enhancing models’ ability to capture long-context dependencies without requiring separate evaluator models. This approach not only improves model performance on long-context benchmarks but also makes training more computationally efficient.

Enhanced Long-Context Metrics Evaluation

The researchers conducted extensive experiments to evaluate the effectiveness of their proposed metrics, LongPPL and LongCE, across various real-world and synthetic long-context tasks. For this, they utilized benchmark datasets including LongBench, which features tasks like multi-document question answering (QA) and summarization, as well as LongEval and RULER, focusing on synthetic challenges like key-value retrieval. They measured average performance on LongBench, accuracy on LongEval’s “lines” task, and scores on RULER. For consistency, the prompt length was standardized to 32k tokens for LongBench and RULER, and 1350 lines (approximately 32k tokens) for LongEval.

In implementing LongPPL and LongCE, the authors employed a sliding window technique for efficient token prediction within long contexts, handling discrepancies in tokenization when different models were used. For LongPPL, the researchers used the GovReport dataset, with government documents extending up to 32k tokens, to assess the correlation with long-context performance.

By evaluating LongPPL across different long-context LLMs, they discovered a strong negative correlation between LongPPL values and long-context benchmark scores, with Pearson correlation coefficients above -0.8 across LongBench, LongEval, and RULER tasks. This suggested that LongPPL effectively measures long-context capabilities, contrasting with the standard perplexity metric, which showed little to no correlation.

Further, the LongPPL metric maintained high compatibility with different evaluator models, including Llama-3.1-8B and Mistral large 2, suggesting robust adaptability across models regardless of parameter size. The experiments also revealed that a hard key-token standard yielded better correlation results than a soft reweighting approach. The fine-tuning experiments for LongCE involved models like Llama-2-7B, Mistral-7B-v0.1, and Llama-2-13B.

Training datasets included a collection of book excerpts, comprising research papers, both processed with a maximum context length of 32k tokens. LongCE’s re-weighting strategy based on key tokens demonstrated significant gains over the standard CE loss across almost all settings. For instance, models fine-tuned with LongCE on the PG-19 dataset consistently outperformed CE-fine-tuned models, achieving notable gains in long-context benchmarks such as LongBench, LongEval, and RULER.
LongCE’s effectiveness across models and training data confirmed its potential as a versatile module for enhancing long-context comprehension in LLMs.

The LongCE approach incurred modest overhead from an extra forward pass for short-context probabilities, managing to maintain about 80% of standard cross-entropy training costs. Despite this, LongCE consistently outperformed standard cross-entropy, delivering faster and more effective enhancements to long-context performance in large language models.

Conclusion

To sum up, the article provided a comprehensive explanation of why perplexity was ineffective in reflecting the long-context capabilities of LLMs. It introduced LongPPL, a novel metric focusing on key tokens crucial for long-context understanding, and demonstrated its strong correlation with long-context performance.

The authors also proposed LongCE, which reweighted the cross-entropy loss for fine-tuning, achieving up to 22% gains in LongEval accuracy. The analysis aimed to offer insights into enhancing long-context generation.

Source:

Code is available at https://github.com/PKU-ML/LongPPL

Journal reference:

Preliminary scientific report. Fang, L., et al. (2024). What is Wrong with Perplexity for Long-context Language Modeling? ArXiv. DOI: 10.48550/arXiv.2410.23771, https://arxiv.org/abs/2410.23771

Posted in: AI Research News

Comments (0)

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Chandrasekar, Silpaja. (2024, November 10). Researchers Develop LongPPL and LongCE to Enhance Long-Context Language Model Evaluation. AZoAi. Retrieved on April 19, 2025 from https://www.azoai.com/news/20241110/Researchers-Develop-LongPPL-and-LongCE-to-Enhance-Long-Context-Language-Model-Evaluation.aspx.
MLA
Chandrasekar, Silpaja. "Researchers Develop LongPPL and LongCE to Enhance Long-Context Language Model Evaluation". AZoAi. 19 April 2025. <https://www.azoai.com/news/20241110/Researchers-Develop-LongPPL-and-LongCE-to-Enhance-Long-Context-Language-Model-Evaluation.aspx>.
Chicago
Chandrasekar, Silpaja. "Researchers Develop LongPPL and LongCE to Enhance Long-Context Language Model Evaluation". AZoAi. https://www.azoai.com/news/20241110/Researchers-Develop-LongPPL-and-LongCE-to-Enhance-Long-Context-Language-Model-Evaluation.aspx. (accessed April 19, 2025).
Harvard
Chandrasekar, Silpaja. 2024. Researchers Develop LongPPL and LongCE to Enhance Long-Context Language Model Evaluation. AZoAi, viewed 19 April 2025, https://www.azoai.com/news/20241110/Researchers-Develop-LongPPL-and-LongCE-to-Enhance-Long-Context-Language-Model-Evaluation.aspx.