Interesting methods and results
It's really interesting to watch this model family evolve. You got a strong result with Wernicke for all but ifeval, which is inching upward on the subequent merges! How to get strong ifeval without compromising other metrics? I've been working to merge Wernicke with select layers of tanliboy/lambda-qwen2.5-14b-dpo-test. Feel free to use any of that which helps you!
I've been trying to use evolution merging but with datasets that are not used in the official benchmarks for the leaderboard as to uphold the models integrity.
One approach I guess would be to find datasets that test for similar things to the ifeval benchmarks and tell it to focus on improving it's score on that!
Haven't really tried that yet though.
Edit: by the way so far the best performer by far has been the first attempt: CultriX/SeQwence-14Bv1
Interesting! CultriX/SeQwence-14Bv1's ifeval performance likely comes from a fresh injection of v000000/Qwen2.5-Lumen-14B. That's one of three models I favor for the 14B ifeval. The other two? tanliboy/lambda-qwen2.5-14b-dpo-test trained on ultrafeedback_binarized (same as Lumen!) and sthenno-com/miscii-14b-1028 trained on HelpSteer2.
If overfitting for the ifeval benchmark is a top concern, I can see good reason to choose Lumen, because it starts with a merge from so many models.
You've hit some great high notes with your dare_ties evolutionary approach. If I'm right in my ideas about how to capture and merge the best features, there should be some benefit to using AgoraMix's recipe, but sticking to Lumen and CultriX/SeQwence-14Bv1 to get ifeval, and CultriX/SeQwence-14B-EvolMerge + Wernicke for reasoning. I'll get started.
Lamarck-14B-v0.1-experimental, AgoraMix's recipe used to merge Lumen and your models, has passed its initial checks. See what you think!
I've some more experiments I want to try with Lamarck while restricting it to models you're using, to assess the value of the DELLA, SLERP, and other merge methods to come.
I'll be keeping an eye out on the results you will be getting with those! Great work so far! :)
Just an idea I came up with if you want to try it out: ``` YAML # Final Hybrid Model: Lamarck-14B (Balanced)
name: lamarck-14b-hybrid
merge_method: della
base_model: merges/lamarck-14b-base
tokenizer_source: Qwen/Qwen2.5-14B-Instruct
parameters:
int8_mask: false
normalize: true
rescale: false
density: 0.40
weight: 0.60
epsilon: 0.08
lambda: 0.92
models:
- model: merges/lamarck-14b-if-della
parameters:
density: 0.60
weight: 0.80 - model: merges/lamarck-14b-reason-della
parameters:
density: 0.70
weight: 1.00
dtype: bfloat16
out_dtype: bfloat16
Base Model Preparation: Lamarck-14B Base (Hybrid of Qwen Variants)
name: lamarck-14b-base
merge_method: slerp
base_model: Qwen/Qwen2.5-14B
tokenizer_source: base
parameters:
t: [ 0.00, 0.40, 0.60, 0.80, 0.90 ]
slices:
- sources:
- layer_range: [ 0, 8 ]
model: Qwen/Qwen2.5-14B-Instruct - layer_range: [ 0, 8 ]
model: Qwen/Qwen2.5-14B
t: [ 0.40 ]
- layer_range: [ 0, 8 ]
- sources:
- layer_range: [ 8, 16 ]
model: Qwen/Qwen2.5-14B-Instruct - layer_range: [ 8, 16 ]
model: Qwen/Qwen2.5-14B
t: [ 0.60 ]
- layer_range: [ 8, 16 ]
- sources:
- layer_range: [ 16, 24 ]
model: Qwen/Qwen2.5-14B - layer_range: [ 16, 24 ]
model: Qwen/Qwen2.5-14B-Instruct
t: [ 0.80 ]
- layer_range: [ 16, 24 ]
- sources:
- layer_range: [ 24, 32 ]
model: Qwen/Qwen2.5-14B - layer_range: [ 24, 32 ]
model: Qwen/Qwen2.5-14B-Instruct
t: [ 0.90 ]
dtype: bfloat16
out_dtype: bfloat16 - layer_range: [ 24, 32 ]
Instruction Following Module: Lamarck-14B-IF
name: lamarck-14b-if-della
merge_method: della
base_model: merges/lamarck-14b-base
tokenizer_source: base
parameters:
int8_mask: false
normalize: true
rescale: false
density: 0.30
weight: 0.50
epsilon: 0.09
lambda: 0.95
models:
- model: CultriX/SeQwence-14Bv1
parameters:
density: 0.80
weight: 1.00 - model: CultriX/SeQwence-14B-v5
parameters:
density: 0.50
weight: [ 0.20, 0.40, 0.50, 0.60, 0.70 ]
dtype: bfloat16
out_dtype: bfloat16
Reasoning Module: Lamarck-14B-Reason
name: lamarck-14b-reason-della
merge_method: della
base_model: merges/lamarck-14b-base
tokenizer_source: base
parameters:
int8_mask: false
normalize: true
rescale: false
density: 0.30
weight: 0.50
epsilon: 0.08
lambda: 0.92
models:
- model: CultriX/Qwen2.5-14B-Wernicke
parameters:
density: 0.90
weight: 1.00 - model: CultriX/SeQwence-14B-EvolMerge
parameters:
density: 0.70
weight: 0.80
dtype: bfloat16
out_dtype: bfloat16
Final Refinement: Lamarck-14B-Finalize
name: lamarck-14b-finalize
merge_method: ties
base_model: merges/lamarck-14b-hybrid
tokenizer_source: Qwen/Qwen2.5-14B-Instruct
parameters:
int8_mask: false
normalize: true
rescale: false
density: 1.00
weight: 1.00
models:
- model: merges/lamarck-14b-hybrid
dtype: bfloat16
out_dtype: bfloat16
Interesting! I hadn't thought to try a merge for the base model. I have to test it, to see if it behaves as I'd expect with re-instructs at the TIES merge in the end. After all, I want to try Coder merges later on.
I had tried using DELLA for the merge of the IF and reason modules, but DELLA can less forgiving than SLERP with gradients on weight and density. I do want to re-emphasize high-ranked weights though. Very interesting feedback, thank you!
Interesting! I hadn't thought to try a merge for the base model. I have to test it, to see if it behaves as I'd expect with re-instructs at the TIES merge in the end. After all, I want to try Coder merges later on.
I had tried using DELLA for the merge of the IF and reason modules, but DELLA can less forgiving than SLERP with gradients on weight and density. I do want to re-emphasize high-ranked weights though. Very interesting feedback, thank you!
I don't know if it could help you on your research but I ran a ton of benchmarks on various of the models to find their strengths and weaknesses and you can find those here:
https://privatebin.net/?fc8f0f093a7fadc3#6EAp57vYDeKjeFS6BVX7pZmj3vqETMnDvg27tPwqn4hj
password is: benchmarks-hf
Got it. This is fascinating! I'm gratified by your inclusion of Lamarck along with its ancestors and other models of yours - and wow. Lamarck is closest to SeQwence-14B-EvolMerge: versus that, it has a 2.7% boost to Winogrande, and 0.3% drops on MMLU and TruthfulQA. Otherwise, they score identically!
However, you have some interesting models I didn't include in the merge - SeQwence-14B-EvolMergev1 has a 0.6% advantage over Lamarck on Winogrande, which is Lamarck's strongest gain, and only loses to Lamarck very slightly on MMLU and TruthfulQA.
I have to hand it to you, CultriX/SeQwence-14Bv1 is a fine model. Lamarck's approach didn't capture its top-row performance on Hellaswag or Wernicke's on Arc - but overall, this is gratifying, and there's plenty of takeaway. I hope it's helped you!
This is looking better the more I look at it. Lamarck lands very near EvolMerge except in the one area in which it beats all evolution-merged ancestors. That almost has to be because it sucessfully captured Lumen's advantages without regressions. I believe that advantage can be retained while tuning other parts of the model, and feeding back into the tree.
If you wish to retain this, I suggest splitting the first 2-4 layers from the evolutionary process, while continuing to evolve the others.
Based on these encouraging results, I have started another round of Lamarck's merge process, with these selections:
IF module: Lamarck-14B-v0.1-experimental dominant, SeQwence-14B-EvolMergev1 background
Reason module: SeQwence-14Bv1 dominant, SeQwence-14B-EvolMergev1 background
Next up, finding a way to get Wernicke's exceptional Arc and GPQA merged with its peers.
You are starting to talk in ways that are quite hard for my amateur/hobbyist brain to follow now (i dont do anything in the field of AI or ML actually haha, also I suck at math 😉) but wouldn't it be an idea to just try and run a not-too-invasive finetune after the merging?
So instead of just trying to squeeze out the absolute highest benchmark scores purely by merging (which reeks like overfitting on the benchmark tasks instead of actually becoming a way better model):
- Create a very strong merge ike yours or my seqwencev1)
- Run some RLHF finetuning on it for example some light human preference finetuning (for example with axolotl or llamafactory)
- using DPO or ORPO to create a LoRA adapter that you can then easily benchmark: if it improved your desired scores merge it back into the basemodel if not try again (mess around with with lora alpha, lora rank, dataset size and learning rate to see what works without deteriorating the strong points from the base model). You could even freeze certain strong layers or target only lora layers that are relatively "weak" in that area
- (Optionally) after Merging the LoRA adapter back to the base model, run evolution Merging again making sure that the benchmarks the model before finetuning did well on are tested for but to a lesser degree thsn your desired new improvements.
With my very limited knowledge this seems like it would be an idea that is feasible for a normal person (compute and time wise), largely automatable and pretty fast to see if results are promising or not.
I mean yeah its more work than simply merging but i feel its within reach of whats doable and might actually be easier than looking for the perfect configuration for the perfect merge. Besides, no matter how perfect the merge it might just need some additional data to truly improve.
I'm a software dev but also a hobbyist in this field, no worries! Quite right that merging only goes so far, and finetuning an adapter is the way to go after merging plateaus. It takes a lot more compute, though, and I think we've got some things left to try yet.
Time to see how Arcee's new Virtuoso Small does as part of the merge!
This is proving to be an interesting merge: sometimesanotion/Lamarck-14B-v0.2-experimental
Like I said by no means an expert but I've done some basic finetuning (for example my Wernicke-DPO model finetuned on a small subset of the uptodate database I generated) on a single A40 GPU (48gb of VRAM) using QLoRA / LoRA which only took about 1-2 hours so that's less than a dollar on RunPod instances. The reason why I mentioned here is that I think that it could be a good method to get just those layers that are lacking right now (you seem to have identified those pretty well imo) and then try to turn that into an overal solid model using the methods you are already using :)!
(Edit: If im talking crap, i'm not even a software dev haha I don't do anything in the IT world for my professional life so this is all just copying what I see others do and a lot of trial and error! With a few successes so far though, also thanks to @mlabonne )
Labonne's work is definitely inspiring! And thank you for showing what the evolutionary toolkits can do. I'm learning a lot here.
It might surprise you - I'm mostly watching the first and last layers, inspired by work on models with differential attention. That inspired the emphasis of Lumen. I'm still catching up on what's going on for the middle layers. You have really strong entries here - Wernicke alone is worth more study. For that, I'm considering a sequence of model_stock, then DELLA, then breadcrumbs - and at some point refining @jeffmeloy 's ner_merging into something capable of processing the assortment of reasoning models, while using less memory.
Earlier, you had a proposed merge recipe based on Lamarck that I haven't followed up on, because,there's been one feature of it I'm not sure I understood. That's the SLERP to make a new base model, on a gradient towards instruct.
I've wondered what that'd accomplish compared to using the existing base, but it occurs to me that once one iteratively evolve-merges, this might help TIES select the right signs. Is this why? Or is this to help prevent the need for a re-instruct TIES at the end?
FYI, Lamarck 0.3's results are in! It snagged #1 for 14B BBH and #2 for 14B GPQA, and I think your models played a large part in this.
The BBH result, I attribute to the first part of the YAML recipe, in which a DELLA merge re-emphasized your SeQwence-14B-EvolMerge and Wernicke over a model stock which included them and VAGOsolutions/SauerkrautLM-v2-14b-DPO.
The #2 GPQA result is just a bit below Wernicke's #1, but it clearly inherited more of Wernicke where it counts than other merges which attempted that. This might be useful to you!
However, instruction following fell off a cliff, and I think some changes since v0.1 which you evaluated will explain that.
Coming back to this - I'd like to thank you for the ideas in your proposal! I created a Virtuoso "base" model similar to what you described, and used a LoRA from it to hopefully solve instruction-following issues in Lamarck's v0.5 release. EvolMerge has kept a large role in its reasoning branch, but interestingly, my prose model_stock merge posted some very high GPQA numbers, nearly tied with Wernicke. I hadn't even done any DELLA tweaks or adapters or anything. Just, heads up in case it helps you!
You'll find the size of your contribution in the YAML. I'm hoping this is the release that nails instruction following, reason, and prose altogether. If it does, your base model trick will be part of the reason why.