Combining Large Models Unlocks New Levels Of Performance In AI Research

By merging instruction-tuned models with up to 64 billion parameters, researchers have discovered a scalable method that enhances model performance, making merging a strong alternative to multitask training.

Research: What Matters for Model Merging at Scale? Image Credit: NicoElNino / ShutterstockResearch: What Matters for Model Merging at Scale? Image Credit: NicoElNino / Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

In an article recently submitted to the arXiv preprint* server, researchers at Google, the University of North Carolina at Chapel Hill, and Virginia Tech explored large-scale model merging, focusing on combining multiple expert models into a single, more capable model. They evaluated how factors like model size, base model quality, merging methods such as Averaging, Task Arithmetic, Dare-TIES, and TIES-Merging, and the number of expert models affected merging performance.

Through experiments on models up to 64 billion parameters, the research provided insights on generalization, scalability, and merging methods, highlighting that merging larger, strong base models improved generalization and performance across tasks. The study also found that while different merging methods performed similarly at large scales, simpler methods like Averaging were often sufficient when dealing with large models.

Background

Model merging has emerged as a promising method for creating more efficient and powerful models by combining the strengths of multiple expert models. This technique not only reduces storage and serving costs but also enhances model generalization by leveraging the complementary knowledge of different expert models.

Early work in this area primarily focused on small models (typically less than seven billion parameters), using methods like parameter averaging and task arithmetic to merge two or three models. However, these studies were limited in scope. They often focused on improving performance on tasks that the expert models were trained on (held-in tasks), with little investigation into zero-shot generalization to unseen tasks (held-out tasks).

In contrast, this paper explored merging larger models—up to 64 billion parameters—and evaluated the effects of merging up to eight expert models. It also compared the performance of pre-trained versus instruction-tuned base models and examined how model size influenced the ease of merging.

The findings provided valuable insights into how factors like model initialization, size, and the number of merged models affected both held-in and held-out performance, offering practical recommendations for applying model merging at scale. This work filled a critical gap in understanding the scalability and generalization potential of model merging.

Evaluating Model Merging

This research presented a large-scale evaluation of model merging, focusing on factors like model size, base model quality, merging method, and the number of models being merged. The authors used the T0 experimental setting, featuring eight held-in task categories (such as multiple-choice question-answer (QA), summarization, and sentiment analysis) and four held-out categories (such as sentence completion, coreference resolution, and natural language inference). Two datasets from each task category were selected for evaluation, balancing cost and diversity.

The authors employed pathways language model (PaLM)-2 models, ranging from one billion to 64 billion parameters, with both non-instruction-tuned (non-IT) and instruction-tuned (IT) variants. A total of 64 expert models were created by fully fine-tuning the base models on the held-in tasks. The researchers conducted 384 merging experiments, varying model types, sizes, merging methods, and constituent model counts.

The evaluation focused on both held-in tasks (training tasks) and held-out tasks, with normalized performance metrics used for comparison. Instruction-tuned models (PaLM-2-IT) consistently outperformed non-IT models across all configurations, indicating that stronger base models improve merged model performance. Interestingly, the study found that instruction-tuned models facilitated easier merging, allowing the merged models to retain task-specific expertise while enhancing zero-shot generalization to unseen tasks.

Model Size and Tuning in Merging

The experimental results explored how factors like model size, base model quality, merging methods, and the number of expert models influenced both training (held-in) performance and zero-shot generalization (held-out) performance. The study found that as model size increased, merging became easier and more effective, particularly with larger instruction-tuned models. Larger models, such as PaLM-2-IT with 64 billion parameters, demonstrated improved performance across both held-in and held-out tasks. In fact, when merging eight expert models, the merged models often outperformed multitask-trained models, suggesting that model merging can be a viable alternative to multitask training for large models.

Another significant finding was that merged models often performed better on unseen tasks compared to their base models, showing improved generalization. For example, the merged 64B PaLM-2-IT models outperformed their base models in zero-shot generalization, indicating the potential of large-scale model merging to generalize better than multitask training. In the case of weaker base models, increasing the model size significantly boosted the merged model's performance on these tasks. Stronger base models like PaLM-2-IT, however, demonstrated a more consistent improvement in generalization as more expert models were added.

Larger models also had the ability to merge more expert models without losing performance, particularly in the case of instruction-tuned models. This suggests that model merging can accommodate more models as they scale, with minimal performance degradation. Interestingly, when merging large models, different merging methods produced similar results, suggesting that simpler methods, such as averaging, were sufficient for large-scale models.

Conclusion

In conclusion, the study on large-scale model merging revealed that combining expert models could significantly enhance both efficiency and performance. By evaluating models ranging from a billion to 64 billion parameters, the authors highlighted the positive impact of model size and instruction tuning on merging effectiveness. The findings indicated that model merging, particularly with instruction-tuned models, can improve generalization beyond multitask training, offering a scalable approach for creating powerful, modular models.

Larger and well-tuned models not only simplified the merging process but also improved generalization to unseen tasks. The findings suggested that model merging was a viable alternative to traditional multitask training, allowing for the creation of powerful, modular models that leveraged diverse expert knowledge while maintaining robust performance across various tasks.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Journal reference:
  • Preliminary scientific report. Yadav, P., Vu, T., Lai, J., Chronopoulou, A., Faruqui, M., Bansal, M., & Munkhdalai, T. (2024). What Matters for Model Merging at Scale? ArXiv. https://arxiv.org/abs/2410.03617
Soham Nandi

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Nandi, Soham. (2024, October 14). Combining Large Models Unlocks New Levels Of Performance In AI Research. AZoAi. Retrieved on December 03, 2024 from https://www.azoai.com/news/20241014/Combining-Large-Models-Unlocks-New-Levels-Of-Performance-In-AI-Research.aspx.

  • MLA

    Nandi, Soham. "Combining Large Models Unlocks New Levels Of Performance In AI Research". AZoAi. 03 December 2024. <https://www.azoai.com/news/20241014/Combining-Large-Models-Unlocks-New-Levels-Of-Performance-In-AI-Research.aspx>.

  • Chicago

    Nandi, Soham. "Combining Large Models Unlocks New Levels Of Performance In AI Research". AZoAi. https://www.azoai.com/news/20241014/Combining-Large-Models-Unlocks-New-Levels-Of-Performance-In-AI-Research.aspx. (accessed December 03, 2024).

  • Harvard

    Nandi, Soham. 2024. Combining Large Models Unlocks New Levels Of Performance In AI Research. AZoAi, viewed 03 December 2024, https://www.azoai.com/news/20241014/Combining-Large-Models-Unlocks-New-Levels-Of-Performance-In-AI-Research.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.