Merging advice?

#2
by sophosympatheia - opened

Hey, Grimulkan. Thanks again for working on this excellent model! I am eager to incorporate it into my merging process, but my first experiment with it last night produced a failed mess of a model. I suspect that's because you've trained Aurelian on top of a modified base version of Llama2 which unfortunately doesn't seem to play nicely with the other models that were trained on the stock Llama2 base model, at least when merged together naively.
Do you have any advice for a mergekit monkey like me who wants to merge your model with stock Llama2 models? I appreciate any you can spare.

That's a good point. When merging models of longLORA descendancy, you probably want to treat the embed and norm layers differently. I am not sure if mergekit supports that directly.

I can think of two options:

  • Replace the embed/norm layers of all your non-longLORA participants with those from base longLORA (Aurelian itself will not need the replacement), then merge using mergekit normally; OR
  • Merge as normal, but you don't want to retain the blended the embed & norm layers. Instead, just replace them with Aurelian's layers after merging. This is probably easiest, but unsure if it works.

EDIT: Actually, I'm not sure what mergekit does with embed & norm layers: if it blends them or just keeps them from the first model. I'm guessing it blends them.

Some useful code snippets (assumes model is already loaded using .from_pretrained():
To extract and save the embed & norm layers from any model in a file called trainable_params.bin in the folder stored in trainable_params_dir:

#Hack to save trainable params
modules_to_save = ["embed", "norm"]
state_dict = model.state_dict() #use trainer.model if it is a newly trained model coming out of the trainer class
to_save = {}
for key, value in state_dict.items():
    if any(module_name in key for module_name in modules_to_save):
        to_save[key.replace("base_model.model.", "")] = value
torch.save(to_save, os.path.join(trainable_params_dir, "trainable_params.bin"))  

To load (and replace) those layers onto any other model:

#trainable_params_dir is wherever trainable_params.bin was saved
state_dict = torch.load(trainable_params_dir, map_location=model.device)
model.load_state_dict(state_dict, strict=False)

Note that for the first bullet, if you use the original LongLORA base, it has the pre-extracted trainable_params.bin, but the vocab may not match. My modified version of the base will have the same vocab as base Llama (and Aurelian is derived from this modified base), but you need to extract trainable_params.bin using the above code. But maybe mergekit already knows what to do with vocab count mismatch (egs., Nous Hermes 70B also has added tokens).

If you extract trainable_params.bin from Aurelian for the method in the first bullet above, it should be identical to the approach in the second bullet (only slower). So first bullet makes sense if you want to combine the embed & norm from original longLORA with those from Aurelian. Second bullet is more if you want to just use Aurelian's. I don't know the effect.

I haven't tested all this, just my thoughts. Let me know how it goes! If it is still broken, maybe these longLORA models need a small amount of fine-tuning to glue everything, which I could do, but it would make it quite annoying to merge and experiment. So I hope the above workarounds work.

If you don't want to write some code, let me know and I could give you a python script to do the above things (basically same code as above with the proper scaffolding and imports).

My friend, you have provided so much more than I expected in your response. Thank you for taking the time to explain everything in such depth.

Do you think the easy way out here might be to use mergekit's tokenizer source settings? When I did my merge, I didn't specify a strategy so I think it just copied everything over from Llama2. I can try the other two approaches, either pulling from Aurelian or having mergekit try to make a union, and report back what happens.

If that doesn't work, I think you gave me enough information (and example code!) to try the method you described. Thank you so much!

The issue of tokenizer vocab mismatch is separate from the issue of merging embed/norm layers. Yes, you could try union and see if that helps.

I only mentioned tokenizer vocab size because original longLORA has an extra added pad token (Aurelian does not). But other non-longLORA models also have this, so yeah, union seems like it was designed for that case.

But if you were trying to already merge with same vocab, like base Llama and Aurelian, or Aurelian and lzlv, then the issue is something else (could be embed/norm like I mentioned).

I was testing lzlv and found a third option to prepare any model for merging with Aurelian (or any other future longLORA models):

  • Extract the embed/norm weights from the modified longLORA base and replace the corresponding ones in lzlv (or whichever model you want to merge with Aurelian).
  • Merge the original LongLORA LORA onto lzlv (adapter_model.bin).
    • You can also directly use the trainable_params.bin from that repo to merge along with the LORA, but the tokenizer vocab size may not match (they have an extra row).

With this method, I was able to get 32K versions of lzlv and Euryale to work, with no fine-tuning. They worked 'fine' with rope scaling 8 up to 32K (well, at least they weren't garbage). That tells me it is totally possible to merge longLORA into an existing model to give it 32K capabilities to an extent, and it will probably make it better for merging with Aurelian as well.

EDIT: I missed your previous message where you mentioned merging the original LongLORA adapter_model.bin onto the model after replacing the embed/norm weights. If the script below looks like it's doing what it should for those weights, I can add the remaining code to load the LongLORA PEFT adapter and merge that in before saving the resultant model. (EDIT3: I updated the script with the PEFT code.) What's your take on applying this approach after merging Aurelian with some other models? Do you think it would be better to prep all the Llama2 models before the merge rather than applying it once to the resultant blend after the merge with Aurelian? (EDIT4: I intend to test it both ways, but I'm curious what you think about it.)

EDIT2: A thought just occurred to me. Do you want to make the trainable_params.bin file containing the embed/norm weights extracted from your modified longLORA base available in your HF repo for it? It would save other people having to download the full weights to extract it themselves, and everyone could rest assured that they're working with a version that was extracted properly.

I wrote up a script that implements the code you suggested. I'm including a sample of its output to show you what it is saving to the trainable_params.bin file.

2024-01-21 11:00:08,421 - __main__ - DEBUG - Saving key  model.embed_tokens.weight  with value  tensor([[-0.0004, -0.0012, -0.0011,  ...,  0.0002, -0.0013,  0.0008],
        [-0.0003, -0.0013,  0.0017,  ..., -0.0019,  0.0032,  0.0033],
        [ 0.0050,  0.0029, -0.0038,  ...,  0.0005, -0.0082,  0.0110],
        ...,
        [ 0.0057,  0.0189,  0.0099,  ..., -0.0154,  0.0043, -0.0136],
        [-0.0028, -0.0086,  0.0019,  ...,  0.0176, -0.0158,  0.0172],
        [ 0.0102, -0.0007,  0.0031,  ...,  0.0094, -0.0045,  0.0045]],
       dtype=torch.float16). Saved key name is: model.embed_tokens.weight

Then for each layer it's saving input_layernorm.weight and post_attention_layernorm.weight.

2024-01-21 11:00:08,596 - __main__ - DEBUG - Saving key  model.layers.79.input_layernorm.weight  with value  tensor([0.2903, 0.2993, 0.2812,  ..., 0.2815, 0.1703, 0.1708],
       dtype=torch.float16). Saved key name is: model.layers.79.input_layernorm.weight
2024-01-21 11:00:08,596 - __main__ - DEBUG - Saving key  model.layers.79.post_attention_layernorm.weight  with value  tensor([0.3440, 0.3298, 0.3386,  ..., 0.3645, 0.1852, 0.2449],
       dtype=torch.float16). Saved key name is: model.layers.79.post_attention_layernorm.weight

Finally, it saves the norm.weight.

2024-01-21 11:00:08,596 - __main__ - DEBUG - Saving key  model.norm.weight  with value  tensor([1.1504, 1.0186, 1.1230,  ..., 1.1084, 1.4932, 1.3701],
       dtype=torch.float16). Saved key name is: model.norm.weight

It doesn't appear that the key.replace("base_model.model.", "") code is doing anything. Is that a problem?

Here is the full script. EDIT5: Sorry for so many edits haha. I had to make some fixes.

import os
import argparse
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModelForCausalLM
import torch
import logging
from colorama import Fore

logging.basicConfig(level=logging.DEBUG,
                    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

def get_layers_to_save(model):
    modules_to_save = ["embed", "norm"]
    state_dict = model.state_dict()
    to_save = {}
    for key, value in state_dict.items():
        if any(module_name in key for module_name in modules_to_save):
            key_name = key.replace("base_model.model.", "")
            logger.debug(f"{Fore.GREEN}Saving key  {key}  with value  {value}. Saved key name is: {key_name}{Fore.RESET}")
            to_save[key_name] = value
        else:
            logger.debug(f"Skipping key  {key}  because it is not in the list of modules to save")
    return to_save

def main(args):
   
    if args.dtype == "float16":
        dtype = torch.float16
    elif args.dtype == "float32":
        dtype = torch.float32
    else:
        raise ValueError(f"Please provide an appropriate value for dtype. Value: {args.dtype}")
    
    source_model_dir = os.path.abspath(args.source)
    destination_dir = os.path.abspath(args.destination)
    trainable_params_path = os.path.join(source_model_dir, "trainable_params/trainable_params.bin")

    logger.debug(f"Source dir: {source_model_dir}\nDestination dir: {destination_dir}\nTrainable Params dir: {trainable_params_path}")

    logger.info(f"Trainable param path: {trainable_params_path}")

    # Check if trainable_params.bin exists in the source model directory
    if os.path.exists(trainable_params_path):
        logger.info("Found existing trainable_params.bin. Using this file.")
    else:
        if not os.path.exists(args.source):
            raise FileNotFoundError(f"Source model not found at {args.source}")

        # Load source model
        logger.info(f"Loading source model from {args.source}")
        source_model = AutoModelForCausalLM.from_pretrained(args.source, torch_dtype=dtype)

        # Save the embed and norm layers from the source model
        to_save = get_layers_to_save(source_model)
        
        torch.save(to_save, trainable_params_path)
        logger.info(f"trainable_params.bin saved to {trainable_params_path}")
        del source_model # free up resources

    # Load target model and tokenizer
    if not os.path.exists(args.target):
        raise FileNotFoundError(f"Target model not found at {args.target}")
    if not os.path.exists(args.lora):
        raise FileNotFoundError(f"LoRA not found at {args.lora}")
    logger.info(f"Loading target model from {args.target}")
    target_model = AutoModelForCausalLM.from_pretrained(args.target, torch_dtype=dtype)
    tokenizer = AutoTokenizer.from_pretrained(args.target)

    # Load the saved layers onto the target model
    logger.info("Loading state dict...")
    state_dict = torch.load(trainable_params_path, map_location=target_model.device)
    target_model.load_state_dict(state_dict, strict=False)

    # Merge in the LoRA
    logger.info(f"Loading LoRA adapter from {args.lora} and merging with the target model before saving")
    target_model = PeftModelForCausalLM.from_pretrained(target_model, args.lora, dtype=dtype)
    target_model = target_model.merge_and_unload(progressbar=True) 

    # Save the updated model to the destination path, including its tokenizer settings
    logger.info(f"Saving resultant blend to {destination_dir}. This could take a while...")
    target_model.save_pretrained(save_directory=destination_dir, safe_serialization=True, max_shard_size=f"{args.shard_size}MiB")
    tokenizer.save_pretrained(destination_dir)
    logger.info(f"Updated model saved successfully at {destination_dir}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Transfer trainable parameters (embed and norm layers) from a source model to a target model using the Transformers library.")
    parser.add_argument("-s", "--source", required=True, help="Path to or identifier of the source model.")
    parser.add_argument("-t", "--target", required=True, help="Path to or identifier of the target model.")
    parser.add_argument("-d", "--destination", required=True, help="Directory to save the updated model.")
    parser.add_argument('-l', '--lora', required=True, type=str, help="A path to the adapter_model.bin file corresponding to original LongLORA base (e.g. Yukang/Llama-2-70b-longlora-32k)")
    parser.add_argument("--dtype", type=str, default="float16", choices=['float16', 'float32'], help="The torch data type to use for loading the models. Defaults to float16.")
    parser.add_argument("--shard_size", type=int, default=8000, help="Size of shards for saving the model tensors in MiB. Defaults to 8000.")

    args = parser.parse_args()
    main(args)

What's your take on applying this approach after merging Aurelian with some other models? Do you think it would be better to prep all the Llama2 models before the merge rather than applying it once to the resultant blend after the merge with Aurelian?

I only tested merging the LORA and embed/norm directly onto lzlv and Euryale, without anything to do with Aurelian. I'd imagine you'd want to do that before merging with Aurelian, as a prep step. But who knows... experiment. Applying the LORA to the resultant doesn't seem right as it will undo part of what Aurelian has done (I'm thinking you don't want it to affect Aurelian).

Code looks right. I don't remember if Aurelian and/or longLORA were saved as bfloat16 or float16. Your debug dump says float16 though.

I'm trying to remember why I stuck that key.replace in there. Probably from my code where I saved the model out of the trainer class, which renames some of the layers, and it does nothing bad if that isn't the case. You want the key to look like model.embed_tokens.weight, and it does.

EDIT:

A thought just occurred to me. Do you want to make the trainable_params.bin file containing the embed/norm weights extracted from your modified longLORA base available in your HF repo for it? It would save other people having to download the full weights to extract it themselves, and everyone could rest assured that they're working with a version that was extracted properly.

Yeah, probably a good idea. It's small, I'll just add it to the fp16 repo.

EDIT2: You can also easily just modify trainable_params.bin from the original LongLORA repo as follows, to strip out the extra row without downloading the weights:

state_dict = torch.load(trainable_params_path, map_location=model.device)
#Remove extra [PAD] token (last row)
if state_dict['model.embed_tokens.weight'].shape[0] == 32001: #Check that we have the single extra row
    state_dict['model.embed_tokens.weight'] = state_dict['model.embed_tokens.weight'][:-1, :]
#Do whatever with modified state_dict...

I basically did this + change config.json to rope 8 + remove references to the extra token in the other jsons, for my modified version.

Thanks for sharing all these tips and tricks. I feel like I'm getting an advanced crash course in Llama2 hacking!

I'll keep you posted on my merge results. I'm trying to blend Midnight Rose with your Aurelian model because I have a good feeling about what that merge could be like.

I uploaded LongLORA merged versions of 70B 32K lzlv, Euryale 1.3 and Aetheria if that's useful. No idea how well these work, but they seem to not be broken at least.

In the past I also tried merging models with LongLORA.

image.png

I found that merging chat-longlora had the least impact on "creativity"(SP), but still was severely damaging to the model.
I used this script, I'm not sure if I did it right, or if I should redo the tests.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

import os
import argparse

def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--base_model_name_or_path", type=str)
    parser.add_argument("--peft_model_path", type=str)
    parser.add_argument("--output_dir", type=str)
    parser.add_argument("--device", type=str, default="auto")
    parser.add_argument("--push_to_hub", action="store_true")

    return parser.parse_args()

def main():
    args = get_args()

    if args.device == 'auto':
        device_arg = { 'device_map': 'auto' }
    else:
        device_arg = { 'device_map': { "": args.device} }

    print(f"Loading base model: {args.base_model_name_or_path}")
    base_model = AutoModelForCausalLM.from_pretrained(
        args.base_model_name_or_path,
        return_dict=True,
        torch_dtype=torch.float16,
        **device_arg
    )

    print(f"Loading PEFT: {args.peft_model_path}")
    model = PeftModel.from_pretrained(base_model, args.peft_model_path, offload_folder = "offload/", **device_arg)
    print(f"Running merge_and_unload")
    model = model.merge_and_unload()

    tokenizer = AutoTokenizer.from_pretrained(args.base_model_name_or_path)

    if args.push_to_hub:
        print(f"Saving to hub ...")
        model.push_to_hub(f"{args.output_dir}", use_temp_dir=False)
        tokenizer.push_to_hub(f"{args.output_dir}", use_temp_dir=False)
    else:
        model.save_pretrained(f"{args.output_dir}")
        tokenizer.save_pretrained(f"{args.output_dir}")
        print(f"Model saved to {args.output_dir}")

if __name__ == "__main__" :
    main()

@grimulkan
I tried a ties merge between two longLorafied versions of Midnight Rose 70b and Aurelian, using your modified longLora Llama2 model as the base model to receive the deltas, and the end result was a hot mess. Tonight I'm going to quantize the longLora version of Midnight Rose 70b so I can verify that modified version works. (I need to rule out the possibility that I made a mistake with the longLorafication process, although I doubt that's it.) If that model works, then something about the merge process is what toasted the resulting model.
I think I'll try a ties merge / task_arithmetic merge again but using regular Llama2 as the base model. If that doesn't work, I'll explore other merge methods. Maybe a straight linear merge or a slerp merge would be better for this particular use case.
I'll keep you posted.

EDIT: I confirmed this morning that the longLora version of Midnight Rose 70b works! Even if that is all that comes of these experiments, I'll be happy. It's great to see it producing good results out to 12K context. (Just to clarify, 12K is all the context I can fit into my available memory at 4.85bpw. Presumably it can go longer and still hold up.) My initial impression is the quality of the output is holding up too. I'll merge it with Aurelian now and we'll see what I get.

@ChuckMcSneed That script doesn't seem to insert the embed and norm layers. It's what I did in the lzlv, etc., uploads in the prior post (and in the code earlier in this topic). The resulting models seem to at least not be completely broken (I checked up to the full 32K context). So perhaps that was the missing piece for merging LongLORA into other base models.

EDIT: @sophosympatheia also seems to have replicated that with Midnight Rose! So it's a start. Hopefully it also unlocks other merging methods with models like Aurelian.

EDIT2: @sophosympatheia Just to confirm which method worked: you merged longLORA LORA and embed+norm layers into Midnight Rose, and that worked? That would be the same I did for lzlv, Euryale, etc.

@grimulkan Yes, I merged the embed+norm layers into Midnight Rose, then applied the longLORA LoRA. It seems to be working, but I still can't get a viable merge with Aurelian.

This time the merge was closer in the respect that at the end of exllama2's quantization process, it gave the result a perplexity score around 9, whereas last time it was in the thousands.
However, I can't get the resultant merge to load in Textgen WebUI using the Exllamav2_HF loader or the non-HF Exllamav2 loader. I get a 'safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer' error.

Traceback (most recent call last):
  File "/home/llm/text-generation-webui/modules/ui_model_menu.py", line 213, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(selected_model, loader)
  File "/home/llm/text-generation-webui/modules/models.py", line 87, in load_model
    output = load_func_map[loader](model_name)
  File "/home/llm/text-generation-webui/modules/models.py", line 389, in ExLlamav2_HF_loader
    return Exllamav2HF.from_pretrained(model_name)
  File "/home/llm/text-generation-webui/modules/exllamav2_hf.py", line 170, in from_pretrained
    return Exllamav2HF(config)
  File "/home/llm/text-generation-webui/modules/exllamav2_hf.py", line 44, in __init__
    self.ex_model.load(split)
  File "/home/llm/mergequant/exllamav2/exllamav2/model.py", line 248, in load
    for item in f: return item
  File "/home/llm/mergequant/exllamav2/exllamav2/model.py", line 266, in load_gen
    module.load()
  File "/home/llm/mergequant/exllamav2/exllamav2/mlp.py", line 77, in load
    self.down_proj.load()
  File "/home/llm/mergequant/exllamav2/exllamav2/linear.py", line 45, in load
    if w is None: w = self.load_weight()
  File "/home/llm/mergequant/exllamav2/exllamav2/module.py", line 96, in load_weight
    qtensors = self.load_multi(["q_weight", "q_invperm", "q_scale", "q_scale_max", "q_groups", "q_perm"], override_key = override_key)
  File "/home/llm/mergequant/exllamav2/exllamav2/module.py", line 77, in load_multi
    tensors[k] = stfile.get_tensor(key + "." + k, device = self.device())
  File "/home/llm/mergequant/exllamav2/exllamav2/fasttensors.py", line 116, in get_tensor
    f = self.get_cm(device)
  File "/home/llm/mergequant/exllamav2/exllamav2/fasttensors.py", line 107, in get_cm
    f = safe_open(self.filename, framework = "pt", device = device)
safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer

What's interesting is the error occurs in Exllamav2's fasttensors.py. If I set the "no_use_fast" option with the Exllamav2_HF loader, it produces the exact same error on the same line in fasttensors.py.
The only reason I find that interesting is because I also have issues with the fast tokenizer setting when loading my longLORA version of Midnight Rose, but in that case I can get it to load by using the "no_use_fast" option with the Exllamav2_HF loader. Did you encounter that with your longLORA versions of lzlv and Euryale?

Here is the error I get when I try to load my longLORA Midnight Rose version using the fast tokenizer setting:

Traceback (most recent call last):
  File "/home/llm/text-generation-webui/modules/ui_model_menu.py", line 213, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(selected_model, loader)
  File "/home/llm/text-generation-webui/modules/models.py", line 95, in load_model
    tokenizer = load_tokenizer(model_name, model)
  File "/home/llm/text-generation-webui/modules/models.py", line 119, in load_tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
  File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 814, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2029, in from_pretrained
    return cls._from_pretrained(
  File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2261, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama_fast.py", line 124, in __init__
    super().__init__(
  File "/home/llm/.miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 111, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: No such file or directory (os error 2)

I'll dig more into this when I have time. I just thought I'd share the results so far.

UPDATE: Copying tokenizer files from a previous version of Midnight Rose fixed the fast tensor issue with the longLORA Midnight Rose version. The same trick didn't help with the MR-Aurelian hybrid model.

I didn’t encounter this error with the merges I made. For MR-Aurelian, is it now the same tokenizer error, or does the model produce nonsense?

Can you see if there are any added tokens in MR alone (you can check the accompanying jsons)? The longlora merge code may not have accounted for that. I could not merge with Nous Hermes directly for instance, because it had added special tokens. There would be no embed row for that token from the long context. If that’s the case with MR also, then it can get a bit hacky where we need to reconstruct that row somehow without finetuning.

Update on the benchmarks: I tried merging xwin the right way, and the results are much better than before:
image.png
It seems that this type of extention degrades SP score by ~30%. Not bad, if you consider 4x the context.

Actually, 8x the context?

Good to see. I will look into doing a small amount of 32K fine-tuning to try and glue the model better, and see if the degradation improves.

Edit: Also, uploaded Xwin and Goliath 120b merged with this method. It seems to work.

@grimulkan Nope, no added tokens, but there are some differences in the tokenizer configs that I suspect are causing all the problems. I'll keep you posted if I figure it out. Right now I'm stuck haha.

Sign up or log in to comment