Nohobby/L3.3-Prikol-70B-v0.3 · Can I pick your brain on this merge?

Jan 16

I find this merge methodology intriguing, I hope you don't mind answering a few questions.
What was your intention behind the almost binary merge weightings of those specific tensors? (v,o,up,gate,down)
Also o_proj specifically is weighted with a 'blip/flip' in the second value position. Can you speak to the purpose of that?
Is there a specific reasoning behind merging this way vs. merging via lora extraction of both models or just negative llama?

I appreciate any insight you care to give.

Nohobby

Owner Jan 17

Uhh, I guess I'm going to disappoint you here.

So, the initial config I came up with was

dtype: bfloat16
tokenizer_source: base
merge_method: della_linear
parameters:
  density: 0.5
base_model: SchisandraVA2
models:
  - model: unsloth/Mistral-Small-Instruct-2409
    parameters:
      weight:
        - filter: v_proj
          value: [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]
        - filter: o_proj
          value: [1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1]
        - filter: up_proj
          value: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
        - filter: gate_proj
          value: [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]
        - filter: down_proj
          value: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
        - value: 0
  - model: SchisandraVA2
    parameters:
      weight:
        - filter: v_proj
          value: [1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1]
        - filter: o_proj
          value: [0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0]
        - filter: up_proj
          value: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
        - filter: gate_proj
          value: [1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1]
        - filter: down_proj
          value: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
        - value: 1

I used it as the final merge step for this model. It's based on the config of this model. The intention was to get all the smarts of the Mistral-Small while retaining the fine-tuning flavor. The filling of the model is evenly distributed over all layers, so it's mostly hopeless to try to pinpoint what weight does what. However, it's believed that the last and top layers define style, while the middle layers contain knowledge. I have also seen this model where only down_proj layers were selected to apply LoRA because

we were trying to influence the model's language generation style without significantly changing its underlying knowledge and reasoning capabilities.

So I just edited the initial merge config to whatever felt right.
Did it work?
Sure it did. In fact, it somehow got more points than the original MS-Instruct on the OpenLLM leaderboard.

Then comes this thing

dtype: bfloat16
tokenizer_source: base
merge_method: nuslerp
parameters:
  nuslerp_row_wise: true
models:
  - model: unsloth/Llama-3.3-70B-Instruct
    parameters:
      weight:
        - filter: v_proj
          value: [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]
        - filter: o_proj
          value: [1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1]
        - filter: up_proj
          value: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
        - filter: gate_proj
          value: [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]
        - filter: down_proj
          value: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
        - value: 0
  - model: Step1
    parameters:
      weight:
        - filter: v_proj
          value: [1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1]
        - filter: o_proj
          value: [0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0]
        - filter: up_proj
          value: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
        - filter: gate_proj
          value: [1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1]
        - filter: down_proj
          value: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
        - value: 1

Different architecture, different merge method. Did it work? Well, from some quick tests, the resulting model felt like a usable version of Sushi-1.4, which is what I was aiming for. So I guess it did work.

The previous version of this series of models was created using the aforementioned thing as a base. I really liked how the dialog felt, but the amount of slop in the narrative was sad. So this config was created by changing the numbers to something that also felt right for my purpose. I don't know if it turned out the way I hoped, the quants just appeared and I haven't tried it yet. Will make a proper model card as soon as I do.

In summary:

To merge an LLM, you've got to think like an LLM.

I just red a bunch of mergekit configs from more sophisticated authors and generated something similar.
Blind luck, but I'm mostly happy with what I get with this approach.

schonsense

Jan 17

No disappointment at all, I greatly appreciate the time you took to respond. Thank you I find this all fascinating, you've given me a good amount to chew on, especially the model links that served to guide you.