metadata

library_name: transformers
license: cc-by-nc-4.0
tags:
  - creative-writing
  - creative-writer
  - multiplicative-lora

An experimental model, fine-tuned using the "multiplicative-LoRA" method on c4ai-command-r-v01.

Other experimental models, based off creative-writer-v0.1-alfa-35b that attempt to encourage more diverse/creative text generation:

creative-writer-v0.1-bravo-35b - Scaled the pre-softmax logits by 1.1 during training (and then reset after training).
[CURRENTLY UPLOADING...] creative-writer-v0.1-charlie-35b - Scaled the pre-softmax logits by 0.9 during training (and didn't reset after training).
[CURRENTLY TRAINING...] creative-writer-v0.1-delta-35b - Trained using Focal Loss with gamma=2 (instead of stock Cross Entropy Loss).

Usage

Use the normal command-r chat template: '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>prompt<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>reply...'.
I suggest using no system prompt with this (and all other Cohere models!), as it writes much better without IMO...
You must used some small value of min-p with this (and the original c4ai-command-r-v01 model!), or the model will output gibberish!

The "multiplicative-LoRA" method

Uses:

h = (I + lora_B @ lora_A) @ tensor @ x = tensor @ x + lora_B @ lora_A @ tensor @ x

or equivalently:

h = tensor @ x

h' = h + lora_B @ lora_A @ h

instead of the normal "additive-LoRA" method of:

h = (tensor + lora_B @ lora_A) @ x = tensor @ x + lora_B @ lora_A @ x

I only apply this to the down_proj matrices, and skipped the last layer's down_proj matrix in the same way as creative-writing-control-vectors-v3.0.

This currently requires hacking PEFT's layer.py like so:

#self.lora_A[adapter_name] = nn.Linear(self.in_features, r, bias=False)
self.lora_A[adapter_name] = nn.Linear(self.out_features, r, bias=False)
self.lora_B[adapter_name] = nn.Linear(r, self.out_features, bias=False)

and:

#x = x.to(lora_A.weight.dtype)
temp = result.to(lora_A.weight.dtype)

if not self.use_dora[active_adapter]:
    #result = result + lora_B(lora_A(dropout(x))) * scaling
    result = result + lora_B(lora_A(dropout(temp))) * scaling

Then to merge you need to hack qlora-pipe's merge_lora.py to use:

old_type = tensor.dtype
tensor = tensor.to(torch.float32)
tensor += scale * lora_B.to(torch.float32) @ lora_A.to(torch.float32) @ tensor
tensor = tensor.to(old_type)

The rationale behind the "multiplicative-LoRA" and the link to control-vectors

There are actually 3 existing "multiplicative-LoRA" methods in PEFT/tuners:

but as explained in this conceptual guide:

all 3 methods deliberately maintain orthogonality, and thus are more restrictive in the types of transformations they can perform (ie: Rotations and/or Improper Rotations only; with no scaling and/or sheer possible...).

For example, these can't perform the orthogonal projection needed for "abliteration":

h' = h - v @ v^T @ h

whereas the general (non-orthogonal) "multiplicative-LoRA" method can do this by choosing to set u = -v like so:

h' = h + u @ v^T @ h

In general, the way to think about these (non-orthogonal) "multiplicative-LoRAs" is as a kind of "conditional control-vector":

Each vector in lora_A looks for a certain dirrection, and via the dot-product it generates (signed) weighting factor that measures the similarity between the output of the down_proj transformation.
Each corresponding vector in lora_B then gets added to the hidden state / residual stream based on the corresponding weighting factor.

So instead of having just a single vector that we add (in essence we add a bias term and create an affine transformation), we now have many different control vectors that can be added (stored in lora_B), based on how well they match another set of "directional detection vectors" (stored in lora_A).

NOTE: The LoRA+ uses a similar way of viewing the purpose of lora_A and lora_B, but where lora_A looks at the input to the down_proj transformation (for "additive-LoRAs"); instead of its output like the "multiplicative-LoRA" method does...

Training

Took just under 4 days using dual-A6000 GPUs connected via NVLink, using qlora-pipe.
The dataset consisted of approximately 1000 pre-2012 books converted to Markdown (~180M tokens) using the same dataset_combination_mode = 'concatenate' as Llama-3-70B-Instruct-Storywriter.
I used the same sequence_len = 8192 and batch_size_tokens = 8192 as Llama-3-70B-Instruct-Storywriter.

`config_creative_writer.toml`

# Paths
model = '/mnt/data/c4ai-command-r-v01'
output_dir = '/mnt/data/creative-writer-v0.1-alfa-35b'

# Lora configuration
lora_rank = 64
lora_alpha = 64
lora_dropout = 0.0
target_modules = ['down_proj']
layers_to_transform = '0:38'  # skip last layer

# Optimization configuration
epochs = 1
lr_scheduler = 'constant'
warmup_steps = 100
batch_size_tokens = 8192

# Performance settings
pipeline_stages = 2
logging_steps = 1
eval_steps = 100
save_steps = 100
checkpoint_every_n_minutes = 60
eval_before_first_step = true
model_weight_dtype = 'bfloat16'
lora_weight_dtype = 'bfloat16'
keep_states = 3
group_by_length = true
activation_checkpointing = 'unsloth'

# Resume a prior run
resume_from_checkpoint = false

# Dataset configuration
dataset_combination_mode = 'concatenate'
eval_gradient_accumulation_steps = 1

[optimizer]
type = 'adamw_kahan'
lr = 5e-6
beta1 = 0.9
beta2 = 0.99
weight_decay = 0.01

[[datasets]]
name = 'books'
dataset_type = 'textfile'
dataset_path = '/mnt/data/datasets/ebooks/*.txt'
sequence_len = 8192
eval_size = 0.01

`ds_creative_writer.json`

{
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 16,
    "gradient_clipping": 1.0,
    "steps_per_print": 1
}