Scaling Laws Refined: Learning Rate Optimization for Large Language Models

New findings reveal how smaller learning rates are key to efficient training for large language models, offering a rule-of-thumb for transferring hyperparameters and improving overall performance.

Research: Scaling Optimal LR Across Token Horizons. Image Credit: Jamie Jin / ShutterstockResearch: Scaling Optimal LR Across Token Horizons. Image Credit: Jamie Jin / Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

In an article submitted to the arXiv preprint* server, researchers at Microsoft conducted a large-scale empirical study investigating how the optimal learning rate (LR) changed with token horizon during large language model (LLM) training. The study revealed that longer training necessitated smaller LRs and that the optimal LR followed a scaling law, enabling estimation for longer horizons from shorter ones.

The analysis provided a rule-of-thumb for transferring LR across token horizons without added overhead, meaning the process does not increase computational demands beyond current practices. Finally, it was claimed that LLM meta-artificial intelligence (LLama-1) used too high of an LR and highlighted the importance of hyperparameter transfer across data size in LLM training.

Background

Past work on scaling laws for LLMs has been actively researched across areas like model architecture, multi-modality, inference, and data. LR selection has been a key focus, with larger models requiring smaller LRs and methods being developed for optimal LR selection during scaling.

Recent studies showed LR transfer across model sizes but assumed a fixed training horizon. A key challenge in scaling LLMs is accurately selecting the optimal LR across varying token horizons, as traditional methods often take a fixed training duration. It can lead to overestimating the optimal LR, especially in compute-limited settings.

Optimizing LR in LLMs

Scaling laws for LLMs have been actively explored in various areas, such as model architecture, multi-modality, inference, and data. These laws provide insights into how different factors, like model size and dataset size, influence model performance, offering guidelines for efficient scaling and optimization of LLMs as these models grow in complexity.

LR selection has emerged as a critical focus within this research. Larger models typically require smaller LRs to maintain stability during training, and several methods have been developed to determine the optimal LR when scaling models. Proper LR selection ensures efficient training and prevents divergence or slow convergence.

Studies have demonstrated that LR transfer across different model sizes is feasible, allowing analysts to estimate the optimal LR for larger models based on results from smaller ones. However, many of these studies rely on the assumption of a fixed training horizon, which limits the applicability of the findings when the length of training or token horizon varies.

A key challenge in scaling LLMs is accurately selecting the optimal LR across varying token horizons, as traditional methods often assume a constant training duration. This limitation can lead to overestimating the optimal LR, particularly in compute-constrained settings, underscoring the need for more adaptable approaches to handle changes in training duration and dataset size.

Evaluating the Impact of LR

The team explores the scaling law concerning model size and token horizon, and due to computational constraints, fully determining the joint scaling properties of both factors is beyond the scope of this study, but preliminary insights are provided. Plotting the optimal LR as a function of model size and token horizon in logarithmic scales shows that it decreases with increasing model size and token horizon. The authors fitted these relationships to data for various model sizes, achieving a strong fit and validation, with R² values as high as 0.99 for larger models.

The study demonstrates that the scaling law holds well for larger models. Still, different behavior is observed for smaller models, indicating that the derived formula may only apply sometimes in that range. The derived scaling law provides a robust framework for predicting the optimal LR across different model sizes and token horizons, especially in large-scale models.

A case study on LLama-1 was conducted to evaluate if the LRs used in the LLama-1 model were aligned with the derived scaling law. Through small-scale experiments with different token horizons, the authors found that the optimal LR for LLama-1 was significantly lower than what was used. It suggests that LLama-1's LR was overestimated, leading to potential inefficiencies in the model's training.

Using a higher-than-optimal LR may have negatively impacted LLama-1's overall performance, with an estimated upper bound of 0.027 on the validation loss penalty. The difference in validation loss between the optimal LR and the one used in LLama-1 could have been significant. Using a higher-than-optimal LR may have negatively affected the model's overall performance.

Conclusion

To sum up, the study focused on scaling token horizons and adjusting LR within fixed LLM configurations, acknowledging its limitations in scope and computational demands. It extended scaling laws to around 800 billion tokens but noted that many state-of-the-art models are trained for longer durations. The findings indicated that optimal LR decreased with increasing token horizons, allowing for hyperparameter transfer across these horizons.

Practitioners were recommended to utilize a derived formula for estimating optimal LR, which had no additional overhead and allowed efficient transfer across token horizons during training. The case study on LLama-1 demonstrated that it was trained with a significantly larger LR than optimal. The authors concluded that hyperparameter transfer across token horizons still needs to be explored in LLM training.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Journal reference:
  • Preliminary scientific report. Bjorck, J., Benhaim, A., Chaudhary, V., Wei, F., & Song, X. (2024). Scaling Optimal LR Across Token Horizons. ArXiv. https://arxiv.org/abs/2409.19913
Silpaja Chandrasekar

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Chandrasekar, Silpaja. (2024, October 07). Scaling Laws Refined: Learning Rate Optimization for Large Language Models. AZoAi. Retrieved on January 15, 2025 from https://www.azoai.com/news/20241007/Scaling-Laws-Refined-Learning-Rate-Optimization-for-Large-Language-Models.aspx.

  • MLA

    Chandrasekar, Silpaja. "Scaling Laws Refined: Learning Rate Optimization for Large Language Models". AZoAi. 15 January 2025. <https://www.azoai.com/news/20241007/Scaling-Laws-Refined-Learning-Rate-Optimization-for-Large-Language-Models.aspx>.

  • Chicago

    Chandrasekar, Silpaja. "Scaling Laws Refined: Learning Rate Optimization for Large Language Models". AZoAi. https://www.azoai.com/news/20241007/Scaling-Laws-Refined-Learning-Rate-Optimization-for-Large-Language-Models.aspx. (accessed January 15, 2025).

  • Harvard

    Chandrasekar, Silpaja. 2024. Scaling Laws Refined: Learning Rate Optimization for Large Language Models. AZoAi, viewed 15 January 2025, https://www.azoai.com/news/20241007/Scaling-Laws-Refined-Learning-Rate-Optimization-for-Large-Language-Models.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.