New findings reveal how smaller learning rates are key to efficient training for large language models, offering a rule-of-thumb for transferring hyperparameters and improving overall performance.
Research: Scaling Optimal LR Across Token Horizons. Image Credit: Jamie Jin / Shutterstock
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
In an article submitted to the arXiv preprint* server, researchers at Microsoft conducted a large-scale empirical study investigating how the optimal learning rate (LR) changed with token horizon during large language model (LLM) training. The study revealed that longer training necessitated smaller LRs and that the optimal LR followed a scaling law, enabling estimation for longer horizons from shorter ones.
The analysis provided a rule-of-thumb for transferring LR across token horizons without added overhead, meaning the process does not increase computational demands beyond current practices. Finally, it was claimed that LLM meta-artificial intelligence (LLama-1) used too high of an LR and highlighted the importance of hyperparameter transfer across data size in LLM training.
Background
Past work on scaling laws for LLMs has been actively researched across areas like model architecture, multi-modality, inference, and data. LR selection has been a key focus, with larger models requiring smaller LRs and methods being developed for optimal LR selection during scaling.
Recent studies showed LR transfer across model sizes but assumed a fixed training horizon. A key challenge in scaling LLMs is accurately selecting the optimal LR across varying token horizons, as traditional methods often take a fixed training duration. It can lead to overestimating the optimal LR, especially in compute-limited settings.
Optimizing LR in LLMs
Scaling laws for LLMs have been actively explored in various areas, such as model architecture, multi-modality, inference, and data. These laws provide insights into how different factors, like model size and dataset size, influence model performance, offering guidelines for efficient scaling and optimization of LLMs as these models grow in complexity.
LR selection has emerged as a critical focus within this research. Larger models typically require smaller LRs to maintain stability during training, and several methods have been developed to determine the optimal LR when scaling models. Proper LR selection ensures efficient training and prevents divergence or slow convergence.
Studies have demonstrated that LR transfer across different model sizes is feasible, allowing analysts to estimate the optimal LR for larger models based on results from smaller ones. However, many of these studies rely on the assumption of a fixed training horizon, which limits the applicability of the findings when the length of training or token horizon varies.
A key challenge in scaling LLMs is accurately selecting the optimal LR across varying token horizons, as traditional methods often assume a constant training duration. This limitation can lead to overestimating the optimal LR, particularly in compute-constrained settings, underscoring the need for more adaptable approaches to handle changes in training duration and dataset size.
Evaluating the Impact of LR
The team explores the scaling law concerning model size and token horizon, and due to computational constraints, fully determining the joint scaling properties of both factors is beyond the scope of this study, but preliminary insights are provided. Plotting the optimal LR as a function of model size and token horizon in logarithmic scales shows that it decreases with increasing model size and token horizon. The authors fitted these relationships to data for various model sizes, achieving a strong fit and validation, with R² values as high as 0.99 for larger models.
The study demonstrates that the scaling law holds well for larger models. Still, different behavior is observed for smaller models, indicating that the derived formula may only apply sometimes in that range. The derived scaling law provides a robust framework for predicting the optimal LR across different model sizes and token horizons, especially in large-scale models.
A case study on LLama-1 was conducted to evaluate if the LRs used in the LLama-1 model were aligned with the derived scaling law. Through small-scale experiments with different token horizons, the authors found that the optimal LR for LLama-1 was significantly lower than what was used. It suggests that LLama-1's LR was overestimated, leading to potential inefficiencies in the model's training.
Using a higher-than-optimal LR may have negatively impacted LLama-1's overall performance, with an estimated upper bound of 0.027 on the validation loss penalty. The difference in validation loss between the optimal LR and the one used in LLama-1 could have been significant. Using a higher-than-optimal LR may have negatively affected the model's overall performance.
Conclusion
To sum up, the study focused on scaling token horizons and adjusting LR within fixed LLM configurations, acknowledging its limitations in scope and computational demands. It extended scaling laws to around 800 billion tokens but noted that many state-of-the-art models are trained for longer durations. The findings indicated that optimal LR decreased with increasing token horizons, allowing for hyperparameter transfer across these horizons.
Practitioners were recommended to utilize a derived formula for estimating optimal LR, which had no additional overhead and allowed efficient transfer across token horizons during training. The case study on LLama-1 demonstrated that it was trained with a significantly larger LR than optimal. The authors concluded that hyperparameter transfer across token horizons still needs to be explored in LLM training.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Bjorck, J., Benhaim, A., Chaudhary, V., Wei, F., & Song, X. (2024). Scaling Optimal LR Across Token Horizons. ArXiv. https://arxiv.org/abs/2409.19913