arxiv:2311.03099

Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch

Published on Nov 6, 2023

Upvote

Authors:

Bowen Yu ,

Haiyang Yu ,

Yongbin Li

Abstract

In this paper, we uncover that Language Models (LMs), either encoder- or decoder-based, can obtain new capabilities by assimilating the parameters of homologous models without retraining or GPUs. Typically, new abilities of LMs can be imparted by Supervised Fine-Tuning (SFT), reflected in the disparity between fine-tuned and pre-trained parameters (i.e., delta parameters). We initially observe that by introducing a novel operation called DARE (Drop And REscale), most delta parameters can be directly set to zeros without affecting the capabilities of SFT LMs and larger models can tolerate a higher proportion of discarded parameters. Based on this observation, we further sparsify delta parameters of multiple SFT homologous models with DARE and subsequently merge them into a single model by parameter averaging. We conduct experiments on eight datasets from the GLUE benchmark with BERT and RoBERTa. We also merge WizardLM, WizardMath, and Code Alpaca based on Llama 2. Experimental results show that: (1) The delta parameter value ranges for SFT models are typically small, often within 0.005, and DARE can eliminate 99% of them effortlessly. However, once the models are continuously pre-trained, the value ranges can grow to around 0.03, making DARE impractical. We have also tried to remove fine-tuned instead of delta parameters and find that a 10% reduction can lead to drastically decreased performance (even to 0). This highlights that SFT merely stimulates the abilities via delta parameters rather than injecting new abilities into LMs; (2) DARE can merge multiple task-specific LMs into one LM with diverse abilities. For instance, the merger of WizardLM and WizardMath improves the GSM8K zero-shot accuracy of WizardLM from 2.2 to 66.3, retaining its instruction-following ability while surpassing WizardMath's original 64.2 performance. Codes are available at https://github.com/yule-BUAA/MergeLM.

View arXiv page View PDF Add to collection

Community

derek-thomas

Nov 27, 2023

•

edited Nov 27, 2023

I would have really liked a citation for this 😂

Human beings have always expressed their ambition to acquire additional abilities through various ways such as movies and games. For example, in X-Men’s Apocalypse, the character can absorb the powers of other mutants to strengthen himself. Likewise, the protagonist in the Super Mario games can gain superpowers like throwing fireballs by absorbing in-game items.

derek-thomas

Nov 27, 2023

•

edited Nov 27, 2023

This is a really cool idea. Id be interested to see some similar work on merging LoRAs (or quantized variants) as this lowers the barrier to entry of fine-tuning by a lot.

Pontonkid

Nov 28, 2023

This is a really cool idea. Id be interested to see some similar work on merging LoRAs (or quantized variants) as this lowers the barrier to entry of fine-tuning by a lot.

Definitely!

alielfilali01

Dec 2, 2023

@julien-c is there any efforts to integrate this amazing operation DARE with transformers or perhaps TRL ?

markpreemo

Dec 20, 2023

Has anyone investigated why the author's chose 13B models for the merge studies? In a preliminary study I did trying DARE (p=0.2, 0.8) using avg. merging for WizardLM-7B and WizardMath-7B results in incoherent garbage.

vanillaOVO

Dec 26, 2023

Has anyone investigated why the author's chose 13B models for the merge studies? In a preliminary study I did trying DARE (p=0.2, 0.8) using avg. merging for WizardLM-7B and WizardMath-7B results in incoherent garbage.

Hi,

I am the author of this paper.

The premise for model merging is that all the models are fine-tuned based on the same backbone. In your case, WizardLM-7B is fine-tuned from llama-7b-hf while WizardMath-7B is fine-tuned from Llama-2-7b-hf. Hence, these two models cannot be merged due to their different backbones.

I hope this answer can address your concerns. 🤗