Combining Large Models Unlocks New Levels Of Performance In AI Research

Download PDF Copy

By Soham NandiReviewed by Joel ScanlonOct 14 2024

By merging instruction-tuned models with up to 64 billion parameters, researchers have discovered a scalable method that enhances model performance, making merging a strong alternative to multitask training.

Research: What Matters for Model Merging at Scale? Image Credit: NicoElNino / Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

In an article recently submitted to the arXiv preprint* server, researchers at Google, the University of North Carolina at Chapel Hill, and Virginia Tech explored large-scale model merging, focusing on combining multiple expert models into a single, more capable model. They evaluated how factors like model size, base model quality, merging methods such as Averaging, Task Arithmetic, Dare-TIES, and TIES-Merging, and the number of expert models affected merging performance.

Through experiments on models up to 64 billion parameters, the research provided insights on generalization, scalability, and merging methods, highlighting that merging larger, strong base models improved generalization and performance across tasks. The study also found that while different merging methods performed similarly at large scales, simpler methods like Averaging were often sufficient when dealing with large models.

Background

Model merging has emerged as a promising method for creating more efficient and powerful models by combining the strengths of multiple expert models. This technique not only reduces storage and serving costs but also enhances model generalization by leveraging the complementary knowledge of different expert models.

Early work in this area primarily focused on small models (typically less than seven billion parameters), using methods like parameter averaging and task arithmetic to merge two or three models. However, these studies were limited in scope. They often focused on improving performance on tasks that the expert models were trained on (held-in tasks), with little investigation into zero-shot generalization to unseen tasks (held-out tasks).

In contrast, this paper explored merging larger models—up to 64 billion parameters—and evaluated the effects of merging up to eight expert models. It also compared the performance of pre-trained versus instruction-tuned base models and examined how model size influenced the ease of merging.

The findings provided valuable insights into how factors like model initialization, size, and the number of merged models affected both held-in and held-out performance, offering practical recommendations for applying model merging at scale. This work filled a critical gap in understanding the scalability and generalization potential of model merging.

Evaluating Model Merging

Merging methods comparison: While four merging methods were tested (Averaging, Task Arithmetic, Dare-TIES, and TIES-Merging), simpler methods like Averaging performed almost as well as more complex methods when scaling to large models, making them a cost-effective option for merging.

This research presented a large-scale evaluation of model merging, focusing on factors like model size, base model quality, merging method, and the number of models being merged. The authors used the T0 experimental setting, featuring eight held-in task categories (such as multiple-choice question-answer (QA), summarization, and sentiment analysis) and four held-out categories (such as sentence completion, coreference resolution, and natural language inference). Two datasets from each task category were selected for evaluation, balancing cost and diversity.

The authors employed pathways language model (PaLM)-2 models, ranging from one billion to 64 billion parameters, with both non-instruction-tuned (non-IT) and instruction-tuned (IT) variants. A total of 64 expert models were created by fully fine-tuning the base models on the held-in tasks. The researchers conducted 384 merging experiments, varying model types, sizes, merging methods, and constituent model counts.

The evaluation focused on both held-in tasks (training tasks) and held-out tasks, with normalized performance metrics used for comparison. Instruction-tuned models (PaLM-2-IT) consistently outperformed non-IT models across all configurations, indicating that stronger base models improve merged model performance. Interestingly, the study found that instruction-tuned models facilitated easier merging, allowing the merged models to retain task-specific expertise while enhancing zero-shot generalization to unseen tasks.

Model Size and Tuning in Merging

Held-In performance results from our large scale model merging experiments conducted over keys factors like base models, model sizes, merging methods, and number of experts being merged. We present results for two base models, PaLM-2 and an instruction tuned version of it, PaLM-2-IT, four different models sizes (1B, 8B, 24B, 64B), four merging methods (Averaging, Task Arithmetic, Dare-TIES, and TIES-Merging), when merging either 2 or 8 expert models. We report the performance normalized with the oracle expert’s performance which is denoted by the bold black circle of radius 1. We also present the performance of multitask baseline train on the held-in tasks. We find merging expert models created from the instruction tuned PaLM-2-IT model always performs better than merging PaLM-2 based experts. Moreover, the gap between these model increase when we merge more experts. Larger experts (64B) merge better and show the best held-in performance.

The experimental results explored how factors like model size, base model quality, merging methods, and the number of expert models influenced both training (held-in) performance and zero-shot generalization (held-out) performance. The study found that as model size increased, merging became easier and more effective, particularly with larger instruction-tuned models. Larger models, such as PaLM-2-IT with 64 billion parameters, demonstrated improved performance across both held-in and held-out tasks. In fact, when merging eight expert models, the merged models often outperformed multitask-trained models, suggesting that model merging can be a viable alternative to multitask training for large models.

Another significant finding was that merged models often performed better on unseen tasks compared to their base models, showing improved generalization. For example, the merged 64B PaLM-2-IT models outperformed their base models in zero-shot generalization, indicating the potential of large-scale model merging to generalize better than multitask training. In the case of weaker base models, increasing the model size significantly boosted the merged model's performance on these tasks. Stronger base models like PaLM-2-IT, however, demonstrated a more consistent improvement in generalization as more expert models were added.

Larger models also had the ability to merge more expert models without losing performance, particularly in the case of instruction-tuned models. This suggests that model merging can accommodate more models as they scale, with minimal performance degradation. Interestingly, when merging large models, different merging methods produced similar results, suggesting that simpler methods, such as averaging, were sufficient for large-scale models.

Conclusion

In conclusion, the study on large-scale model merging revealed that combining expert models could significantly enhance both efficiency and performance. By evaluating models ranging from a billion to 64 billion parameters, the authors highlighted the positive impact of model size and instruction tuning on merging effectiveness. The findings indicated that model merging, particularly with instruction-tuned models, can improve generalization beyond multitask training, offering a scalable approach for creating powerful, modular models.

Larger and well-tuned models not only simplified the merging process but also improved generalization to unseen tasks. The findings suggested that model merging was a viable alternative to traditional multitask training, allowing for the creation of powerful, modular models that leveraged diverse expert knowledge while maintaining robust performance across various tasks.

Journal reference:

Preliminary scientific report. Yadav, P., Vu, T., Lai, J., Chronopoulou, A., Faruqui, M., Bansal, M., & Munkhdalai, T. (2024). What Matters for Model Merging at Scale? ArXiv. https://arxiv.org/abs/2410.03617

Posted in: AI Research News

Comments (0)

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Nandi, Soham. (2024, October 14). Combining Large Models Unlocks New Levels Of Performance In AI Research. AZoAi. Retrieved on July 05, 2025 from https://www.azoai.com/news/20241014/Combining-Large-Models-Unlocks-New-Levels-Of-Performance-In-AI-Research.aspx.
MLA
Nandi, Soham. "Combining Large Models Unlocks New Levels Of Performance In AI Research". AZoAi. 05 July 2025. <https://www.azoai.com/news/20241014/Combining-Large-Models-Unlocks-New-Levels-Of-Performance-In-AI-Research.aspx>.
Chicago
Nandi, Soham. "Combining Large Models Unlocks New Levels Of Performance In AI Research". AZoAi. https://www.azoai.com/news/20241014/Combining-Large-Models-Unlocks-New-Levels-Of-Performance-In-AI-Research.aspx. (accessed July 05, 2025).
Harvard
Nandi, Soham. 2024. Combining Large Models Unlocks New Levels Of Performance In AI Research. AZoAi, viewed 05 July 2025, https://www.azoai.com/news/20241014/Combining-Large-Models-Unlocks-New-Levels-Of-Performance-In-AI-Research.aspx.