Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
RishabhBhardwajΒ 
posted an update Jun 24, 2024
Post
2442
πŸŽ‰ We are thrilled to share our work on model merging. We proposed a new approach, Della-merging, which combines expert models from various domains into a single, versatile model. Della employs a magnitude-based sampling approach to eliminate redundant delta parameters, reducing interference when merging homologous models (those fine-tuned from the same backbone).

Della outperforms existing homologous model merging techniques such as DARE and TIES. Across three expert models (LM, Math, Code) and their corresponding benchmark datasets (AlpacaEval, GSM8K, MBPP), Della achieves an improvement of 3.6 points over TIES and 1.2 points over DARE.

Paper: DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling (2406.11617)
Github: https://github.com/declare-lab/della

@soujanyaporia @Tej3

(those fine-tuned from the same backbone). - --- -Sorry can you elaborate ... backbone ?

does this refer to the lora conifg ? does this make a difference when merging ?

if the models are both finetuned using the same lora setting then they are destined to keep more traits ?

when i fine tune i vary the lora configs : for different depth of training ... as for the bible when i tried to train on (16,16 or 8,16) lora config (15-20 million parameters) it was so far from the data and was taking a long time just to reduce , but when i used the better settings of 128/256 (pushing a much higher set of parameters) the task was easier to train , later when i trained the same data in a multilingual data set i used a basic 4/16 setup for training and it took very easy ! hence the depth of the training had an effect :

after choosing alpha monarch and omni beagle to merge with this model ... it was not a great result despite using the ties/ merge... and the linear soft-max merger after ... in-fact the original base model was totally lost ! hence re-merging various different strategies in an attempt to recover the model but ended up using a merge of merges (genetic technique).. to recenter the model back to the original model , basically using the merge of merges as a lora with the base model and only absorbing very low density/weights and deltas ... (no more foreign mergers)....

so what does backbone refer to please.

Β·

The backbone refers to the pretrained model used as the base model for fine-tuning the expert model.

For example, in the case of Wizard Models:

  • WizardLM-13B and WizardMath-13B are both fine-tuned from the llama2-13B model. Therefore, they can be effectively merged using Della, Dare, or TIES because they share the same backbone.

  • On the other hand, WizardCoder-13B is fine-tuned from the CodeLlama-13B-Python model. Since WizardCoder uses a different base model (backbone) compared to WizardLM-13B and WizardMath-13B, merging these three models effectively using Della, Dare, or TIES is not feasible.

To summarize, the backbone is the underlying pretrained model that serves as the starting point for fine-tuning. It is crucial in the merging process because models fine-tuned from different backbones may not merge effectively due to the differences in their initial pretrained weights and configurations.