state-spaces/mamba-130m-hf · Finetuning using LoRA

Hello,

I'm trying to finetune the model using LoRA with the sample code snippet in the model card. However, only the embeddings and in_proj layers are getting updated, despite supplying x_proj and out_proj as target_modules as well.

Library versions

transformers==4.39.0
torch==2.2.0+cu121
peft==0.13.2
causal-conv1d==1.2.0.post2
mamba-ssm==1.2.0.post1

The code used is as follows:

from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments

tokenizer = AutoTokenizer.from_pretrained("state-spaces/mamba-130m-hf")
model = AutoModelForCausalLM.from_pretrained("state-spaces/mamba-130m-hf")

dataset = load_dataset("Abirate/english_quotes", split="train")

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    logging_dir='./logs',
    logging_steps=10,
    learning_rate=2e-3
)

lora_config =  LoraConfig(
        r=8,
        target_modules=["x_proj", "embeddings", "in_proj", "out_proj"],
        task_type="CAUSAL_LM",
        bias="none"
)
peft_model = get_peft_model(model, peft_config=lora_config)

sample_inp = tokenizer(dataset['quote'][0:1], return_tensors="pt")
sample_inp = {k: v.to("cuda") for k, v in sample_inp.items()}
sample_inp["labels"] = sample_inp["input_ids"].clone()
sample_inp["labels"][:, :-1] = -100

peft_model = peft_model.to("cuda")
peft_model.train()
op = peft_model(**sample_inp)
op.loss.backward()

for name, param in peft_model.named_parameters():
    if 'lora' in name.lower():
        if param.grad is not None:
            print(f'{str(param.grad.shape):<30}{name}')
        else:
            print(f'{str(param.grad):<30}{name}')
    if 'layers.2' in name:
        break

The output is as follows:

torch.Size([8, 50280])        base_model.model.backbone.embeddings.lora_embedding_A.default
torch.Size([768, 8])          base_model.model.backbone.embeddings.lora_embedding_B.default
torch.Size([8, 768])          base_model.model.backbone.layers.0.mixer.in_proj.lora_A.default.weight
torch.Size([3072, 8])         base_model.model.backbone.layers.0.mixer.in_proj.lora_B.default.weight
None                          base_model.model.backbone.layers.0.mixer.x_proj.lora_A.default.weight
None                          base_model.model.backbone.layers.0.mixer.x_proj.lora_B.default.weight
None                          base_model.model.backbone.layers.0.mixer.out_proj.lora_A.default.weight
None                          base_model.model.backbone.layers.0.mixer.out_proj.lora_B.default.weight
torch.Size([8, 768])          base_model.model.backbone.layers.1.mixer.in_proj.lora_A.default.weight
torch.Size([3072, 8])         base_model.model.backbone.layers.1.mixer.in_proj.lora_B.default.weight
None                          base_model.model.backbone.layers.1.mixer.x_proj.lora_A.default.weight
None                          base_model.model.backbone.layers.1.mixer.x_proj.lora_B.default.weight
None                          base_model.model.backbone.layers.1.mixer.out_proj.lora_A.default.weight
None                          base_model.model.backbone.layers.1.mixer.out_proj.lora_B.default.weight

I think this is because if the mamba_ssm and causal_conv1d packages are available, the forward pass of the MambaMixer module uses the mamba_inner_fn from the mamba_ssm package. This functions requires only the weight variable of the linear layers, and does not use the forward method of torch.nn.Linear. Hence, even though the select linear layers of the model are wrapped using LoRA from the peft library, the LoRA matrices never accumulate gradients because I assume they are not added to the computational graph since the mamba_inner_fn only accesses their weight variable. I think the reason the embeddings and in_proj get gradients is because they are computed using the standard forward pass of torch.nn.Embedding and torch.nn.Linear instead of using the mamba_inner_fn.

Does anyone have a solution to this problem that doesn't involve disabling the fast path of using the mamba_inner_fn?