Scaling Laws Refined: Learning Rate Optimization for Large Language Models

Download PDF Copy

By Dr Silpaja Chandrasekar, PhDReviewed by Joel ScanlonOct 7 2024

New findings reveal how smaller learning rates are key to efficient training for large language models, offering a rule-of-thumb for transferring hyperparameters and improving overall performance.

Research: Scaling Optimal LR Across Token Horizons. Image Credit: Jamie Jin / Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

In an article submitted to the arXiv preprint* server, researchers at Microsoft conducted a large-scale empirical study investigating how the optimal learning rate (LR) changed with token horizon during large language model (LLM) training. The study revealed that longer training necessitated smaller LRs and that the optimal LR followed a scaling law, enabling estimation for longer horizons from shorter ones.

The analysis provided a rule-of-thumb for transferring LR across token horizons without added overhead, meaning the process does not increase computational demands beyond current practices. Finally, it was claimed that LLM meta-artificial intelligence (LLama-1) used too high of an LR and highlighted the importance of hyperparameter transfer across data size in LLM training.

Background

Past work on scaling laws for LLMs has been actively researched across areas like model architecture, multi-modality, inference, and data. LR selection has been a key focus, with larger models requiring smaller LRs and methods being developed for optimal LR selection during scaling.

Recent studies showed LR transfer across model sizes but assumed a fixed training horizon. A key challenge in scaling LLMs is accurately selecting the optimal LR across varying token horizons, as traditional methods often take a fixed training duration. It can lead to overestimating the optimal LR, especially in compute-limited settings.

Optimizing LR in LLMs

Final validation loss of a 350 million parameter LLM for different learning rates (LR) and token horizons. The dashed lines indicate our fitted curve and the stars indicate the estimated optimal LR. The optimal LR decreases as the token horizon increases.

Scaling laws for LLMs have been actively explored in various areas, such as model architecture, multi-modality, inference, and data. These laws provide insights into how different factors, like model size and dataset size, influence model performance, offering guidelines for efficient scaling and optimization of LLMs as these models grow in complexity.

LR selection has emerged as a critical focus within this research. Larger models typically require smaller LRs to maintain stability during training, and several methods have been developed to determine the optimal LR when scaling models. Proper LR selection ensures efficient training and prevents divergence or slow convergence.

Studies have demonstrated that LR transfer across different model sizes is feasible, allowing analysts to estimate the optimal LR for larger models based on results from smaller ones. However, many of these studies rely on the assumption of a fixed training horizon, which limits the applicability of the findings when the length of training or token horizon varies.

A key challenge in scaling LLMs is accurately selecting the optimal LR across varying token horizons, as traditional methods often assume a constant training duration. This limitation can lead to overestimating the optimal LR, particularly in compute-constrained settings, underscoring the need for more adaptable approaches to handle changes in training duration and dataset size.

Evaluating the Impact of LR

The team explores the scaling law concerning model size and token horizon, and due to computational constraints, fully determining the joint scaling properties of both factors is beyond the scope of this study, but preliminary insights are provided. Plotting the optimal LR as a function of model size and token horizon in logarithmic scales shows that it decreases with increasing model size and token horizon. The authors fitted these relationships to data for various model sizes, achieving a strong fit and validation, with R² values as high as 0.99 for larger models.

The study demonstrates that the scaling law holds well for larger models. Still, different behavior is observed for smaller models, indicating that the derived formula may only apply sometimes in that range. The derived scaling law provides a robust framework for predicting the optimal LR across different model sizes and token horizons, especially in large-scale models.

A case study on LLama-1 was conducted to evaluate if the LRs used in the LLama-1 model were aligned with the derived scaling law. Through small-scale experiments with different token horizons, the authors found that the optimal LR for LLama-1 was significantly lower than what was used. It suggests that LLama-1's LR was overestimated, leading to potential inefficiencies in the model's training.

Using a higher-than-optimal LR may have negatively impacted LLama-1's overall performance, with an estimated upper bound of 0.027 on the validation loss penalty. The difference in validation loss between the optimal LR and the one used in LLama-1 could have been significant. Using a higher-than-optimal LR may have negatively affected the model's overall performance.

Conclusion

To sum up, the study focused on scaling token horizons and adjusting LR within fixed LLM configurations, acknowledging its limitations in scope and computational demands. It extended scaling laws to around 800 billion tokens but noted that many state-of-the-art models are trained for longer durations. The findings indicated that optimal LR decreased with increasing token horizons, allowing for hyperparameter transfer across these horizons.

Practitioners were recommended to utilize a derived formula for estimating optimal LR, which had no additional overhead and allowed efficient transfer across token horizons during training. The case study on LLama-1 demonstrated that it was trained with a significantly larger LR than optimal. The authors concluded that hyperparameter transfer across token horizons still needs to be explored in LLM training.

Journal reference:

Preliminary scientific report. Bjorck, J., Benhaim, A., Chaudhary, V., Wei, F., & Song, X. (2024). Scaling Optimal LR Across Token Horizons. ArXiv. https://arxiv.org/abs/2409.19913

Posted in: AI Research News

Comments (0)

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Chandrasekar, Silpaja. (2024, October 07). Scaling Laws Refined: Learning Rate Optimization for Large Language Models. AZoAi. Retrieved on July 12, 2025 from https://www.azoai.com/news/20241007/Scaling-Laws-Refined-Learning-Rate-Optimization-for-Large-Language-Models.aspx.
MLA
Chandrasekar, Silpaja. "Scaling Laws Refined: Learning Rate Optimization for Large Language Models". AZoAi. 12 July 2025. <https://www.azoai.com/news/20241007/Scaling-Laws-Refined-Learning-Rate-Optimization-for-Large-Language-Models.aspx>.
Chicago
Chandrasekar, Silpaja. "Scaling Laws Refined: Learning Rate Optimization for Large Language Models". AZoAi. https://www.azoai.com/news/20241007/Scaling-Laws-Refined-Learning-Rate-Optimization-for-Large-Language-Models.aspx. (accessed July 12, 2025).
Harvard
Chandrasekar, Silpaja. 2024. Scaling Laws Refined: Learning Rate Optimization for Large Language Models. AZoAi, viewed 12 July 2025, https://www.azoai.com/news/20241007/Scaling-Laws-Refined-Learning-Rate-Optimization-for-Large-Language-Models.aspx.