By merging instruction-tuned models with up to 64 billion parameters, researchers have discovered a scalable method that enhances model performance, making merging a strong alternative to multitask training.
Research: What Matters for Model Merging at Scale? Image Credit: NicoElNino / Shutterstock
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
In an article recently submitted to the arXiv preprint* server, researchers at Google, the University of North Carolina at Chapel Hill, and Virginia Tech explored large-scale model merging, focusing on combining multiple expert models into a single, more capable model. They evaluated how factors like model size, base model quality, merging methods such as Averaging, Task Arithmetic, Dare-TIES, and TIES-Merging, and the number of expert models affected merging performance.
Through experiments on models up to 64 billion parameters, the research provided insights on generalization, scalability, and merging methods, highlighting that merging larger, strong base models improved generalization and performance across tasks. The study also found that while different merging methods performed similarly at large scales, simpler methods like Averaging were often sufficient when dealing with large models.
Background
Model merging has emerged as a promising method for creating more efficient and powerful models by combining the strengths of multiple expert models. This technique not only reduces storage and serving costs but also enhances model generalization by leveraging the complementary knowledge of different expert models.
Early work in this area primarily focused on small models (typically less than seven billion parameters), using methods like parameter averaging and task arithmetic to merge two or three models. However, these studies were limited in scope. They often focused on improving performance on tasks that the expert models were trained on (held-in tasks), with little investigation into zero-shot generalization to unseen tasks (held-out tasks).
In contrast, this paper explored merging larger models—up to 64 billion parameters—and evaluated the effects of merging up to eight expert models. It also compared the performance of pre-trained versus instruction-tuned base models and examined how model size influenced the ease of merging.
The findings provided valuable insights into how factors like model initialization, size, and the number of merged models affected both held-in and held-out performance, offering practical recommendations for applying model merging at scale. This work filled a critical gap in understanding the scalability and generalization potential of model merging.
Evaluating Model Merging
This research presented a large-scale evaluation of model merging, focusing on factors like model size, base model quality, merging method, and the number of models being merged. The authors used the T0 experimental setting, featuring eight held-in task categories (such as multiple-choice question-answer (QA), summarization, and sentiment analysis) and four held-out categories (such as sentence completion, coreference resolution, and natural language inference). Two datasets from each task category were selected for evaluation, balancing cost and diversity.
The authors employed pathways language model (PaLM)-2 models, ranging from one billion to 64 billion parameters, with both non-instruction-tuned (non-IT) and instruction-tuned (IT) variants. A total of 64 expert models were created by fully fine-tuning the base models on the held-in tasks. The researchers conducted 384 merging experiments, varying model types, sizes, merging methods, and constituent model counts.
The evaluation focused on both held-in tasks (training tasks) and held-out tasks, with normalized performance metrics used for comparison. Instruction-tuned models (PaLM-2-IT) consistently outperformed non-IT models across all configurations, indicating that stronger base models improve merged model performance. Interestingly, the study found that instruction-tuned models facilitated easier merging, allowing the merged models to retain task-specific expertise while enhancing zero-shot generalization to unseen tasks.
Model Size and Tuning in Merging
The experimental results explored how factors like model size, base model quality, merging methods, and the number of expert models influenced both training (held-in) performance and zero-shot generalization (held-out) performance. The study found that as model size increased, merging became easier and more effective, particularly with larger instruction-tuned models. Larger models, such as PaLM-2-IT with 64 billion parameters, demonstrated improved performance across both held-in and held-out tasks. In fact, when merging eight expert models, the merged models often outperformed multitask-trained models, suggesting that model merging can be a viable alternative to multitask training for large models.
Another significant finding was that merged models often performed better on unseen tasks compared to their base models, showing improved generalization. For example, the merged 64B PaLM-2-IT models outperformed their base models in zero-shot generalization, indicating the potential of large-scale model merging to generalize better than multitask training. In the case of weaker base models, increasing the model size significantly boosted the merged model's performance on these tasks. Stronger base models like PaLM-2-IT, however, demonstrated a more consistent improvement in generalization as more expert models were added.
Larger models also had the ability to merge more expert models without losing performance, particularly in the case of instruction-tuned models. This suggests that model merging can accommodate more models as they scale, with minimal performance degradation. Interestingly, when merging large models, different merging methods produced similar results, suggesting that simpler methods, such as averaging, were sufficient for large-scale models.
Conclusion
In conclusion, the study on large-scale model merging revealed that combining expert models could significantly enhance both efficiency and performance. By evaluating models ranging from a billion to 64 billion parameters, the authors highlighted the positive impact of model size and instruction tuning on merging effectiveness. The findings indicated that model merging, particularly with instruction-tuned models, can improve generalization beyond multitask training, offering a scalable approach for creating powerful, modular models.
Larger and well-tuned models not only simplified the merging process but also improved generalization to unseen tasks. The findings suggested that model merging was a viable alternative to traditional multitask training, allowing for the creation of powerful, modular models that leveraged diverse expert knowledge while maintaining robust performance across various tasks.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Yadav, P., Vu, T., Lai, J., Chronopoulou, A., Faruqui, M., Bansal, M., & Munkhdalai, T. (2024). What Matters for Model Merging at Scale? ArXiv. https://arxiv.org/abs/2410.03617