Modern large language models (LLMs) employ attention mechanisms and fixed context lengths for training, limiting their handling of input sequences during evaluation. To address this limitation, researchers have explored context-length extrapolation methods that modify positional encodings in the attention mechanism. In a recent submission to the arXiv* server, researchers explored these methods, including novel strategies, and tested them using the Meta AI (LLaMA) models across three evaluation tasks.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
Transformers have gained prominence in recent years due to their adaptability and capacity to handle massive datasets. These models referred to as LLMs, offer impressive capabilities but struggle with tasks requiring input sequence order. Positional encodings attempt to address this by providing positional information. However, the aspiration is to enable LLMs to process longer input sequences than their training data allows.
Investigation of extended context length capability
The current study investigates the challenges of extending context-length capability in LLMs by proposing and evaluating various techniques. The Rotary Position Embedding (RoPE), used by LLaMA, rotates segments of query and key projection matrices to improve attention scores. Yet, its limitations led to the exploration of other methods such as attention with linear biases (ALiBi), xPos, and randomized positional encodings. Novel techniques such as power scaling and truncated basis are also introduced.
To assess the usage of extended context in language models, the study introduces verifiable tasks that demand informative utilization of extended context. These tasks include key-value retrieval and question answering, which allow a comprehensive evaluation of context utilization. The novel LongChat-Lines task expands upon existing work, while the WikiQA dataset introduces Free-form QA (FFQA) and Altered Numeric QA (AltQA) tasks, bolstering context exploration with varying answer placements and question positions.
Giraffe models and their performance
The research team introduces three novel long-context models, collectively called "Giraffe," each with 13B parameters. These models comprise a 4k context model, a 16k context model (both trained from the base LLaMA-13B), and a 32k context model (trained from the base LLaMA2-13B). The analysis primarily focuses on the experimental outcomes of finetuning the base LLaMA-13B model using a modified RedPajama dataset containing precisely 4096 tokens per data sample. The study explores various positional encoding techniques and presents the resulting outcomes.
Furthermore, the study investigates the application of instruction finetuning (IFT) using the Vicuna dataset and low-rank adaptation (LoRA) on the base model. Surprisingly, IFT improves LongChat-Lines accuracy but does not significantly expand the model's contextual capacity, unlike WikiQA variants. Consequently, non-IFT models are utilized for LongChat lines, while additional IFT is performed for WikiQA.
Evaluation of context length extrapolation
The study performs evaluations on LongChat lines using different techniques for context-length extrapolation. The results highlight that xPos struggles with the task due to its divergence from the RoPE basis. Linear scaling demonstrates successful extrapolation, but a scaling factor of 16 leads to rapid accuracy deterioration beyond a context length of 17,500. The power basis performs well with short contexts but declines rapidly beyond 4200 contexts. Randomized positional encoding shows some extrapolation success due to evaluation methods, although reduced upper bounds result in performance decline. On the contrary, the truncated basis demonstrates authentic context length extrapolation, maintaining non-zero accuracies for longer contexts. While performance diminishes with length, it shows promise for improved extrapolation with further investigation.
Linear scaling and truncated basis techniques are also evaluated on WikiQA variants, indicating challenges without IFT. The results mirror those of LongChat-Lines, with linear scaling (scale factor 4) performing well up to 7500 contexts and the truncated basis showing similar behavior but struggling to exceed a context length of 8k.
Exploration of scaling factors and post-fine-tuning techniques
The study explores the impact of using different scaling factors for evaluation compared to training, revealing that models can zero-shot evaluate at a 2x scale factor. However, performance significantly declines beyond 2x. Zero-shot linear scaling is also successfully applied post-finetuning using the truncated basis. Interestingly, this approach extends the effective context length range with enhanced accuracy, diverging from linear scaling behavior.
Perplexity, commonly used for measuring long-context performance, is juxtaposed with specific tasks. Perplexity correlates with complete failure beyond certain context lengths but fails to capture performance degradation within the effective range. The truncated basis excels in lower contexts despite perplexity scores, emphasizing the value of supplementary tasks for a comprehensive evaluation of LLM capabilities.
Lastly, the analysis explores the effects of answer and question positioning on WikiQA variants. Findings reveal task-specific variability in LLM performance concerning context utilization. Minor differences in task design led to significant variations in observed trends, underscoring the importance of meticulous task-specific evaluations.
Conclusion
In summary, the current study comprehensively explores diverse strategies to enhance the context length extrapolation capability of pre-trained LLaMA and LLaMA2 LLMs. Custom tasks and perplexity provide valuable insights into long-term context capabilities. Linear interpolation proves to be the most effective technique for context extrapolation, with the truncated basis also showing potential.
Researchers highlight that accuracy declines as context length increases despite reasonable perplexity scores and coherent outputs. Future research should address accuracy degradation with growing context lengths, replicating perplexity analysis across diverse datasets, and investigating alternative positional encoding methods and models.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Pal, A., Karkhanis, D., Roberts, M., Dooley, S., Sundararajan, A., and Naidu, S. (2023). Giraffe: Adventures in Expanding Context Lengths in LLMs. arXiv. DOI: https://doi.org/10.48550/arXiv.2308.10882, https://arxiv.org/abs/2308.10882