|
--- |
|
library_name: transformers |
|
license: cc-by-nc-4.0 |
|
tags: |
|
- creative-writing |
|
- creative-writer |
|
- multiplicative-lora |
|
--- |
|
|
|
An experimental model, fine-tuned using the ["multiplicative-LoRA" method](#the-multiplicative-lora-method) on [c4ai-command-r-v01](https://huggingface.co/CohereForAI/c4ai-command-r-v01). |
|
|
|
Other experimental models, based off `creative-writer-v0.1-alfa-35b` that attempt to encourage more diverse/creative text generation: |
|
|
|
- [creative-writer-v0.1-bravo-35b](https://huggingface.co/jukofyork/creative-writer-v0.1-bravo-35b) - Scaled the pre-softmax logits by `1.1` during training (and then reset after training). |
|
- **[CURRENTLY UPLOADING...]** [creative-writer-v0.1-charlie-35b](https://huggingface.co/jukofyork/creative-writer-v0.1-charlie-35b) - Scaled the pre-softmax logits by `0.9` during training (and didn't reset after training). |
|
- **[CURRENTLY TRAINING...]** [creative-writer-v0.1-delta-35b](https://huggingface.co/jukofyork/creative-writer-v0.1-delta-35b) - Trained using [Focal Loss](https://arxiv.org/abs/1708.02002) with `gamma=2` (instead of stock [Cross Entropy Loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)). |
|
|
|
--- |
|
|
|
# Usage |
|
|
|
- Use the normal `command-r` chat template: `'<|START_OF_TURN_TOKEN|><|USER_TOKEN|>prompt<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>reply...'`. |
|
- I suggest using **no system prompt** with this (and all other `Cohere` models!), as it writes *much* better without it IMO... |
|
- You ***MUST*** **use some (small) value of min-p** with this such as `0.01`(and with the original `c4ai-command-r-v01` model), **or else the model will output gibberish!** |
|
|
|
--- |
|
|
|
# The "multiplicative-LoRA" method |
|
|
|
Uses: |
|
|
|
`h = (I + lora_B @ lora_A) @ tensor @ x = tensor @ x + lora_B @ lora_A @ tensor @ x` |
|
|
|
or equivalently: |
|
|
|
`h = tensor @ x` |
|
|
|
`h' = h + lora_B @ lora_A @ h` |
|
|
|
instead of the normal "additive-LoRA" method of: |
|
|
|
`h = (tensor + lora_B @ lora_A) @ x = tensor @ x + lora_B @ lora_A @ x` |
|
|
|
I only apply this to the `down_proj` matrices, and skipped the last layer's `down_proj` matrix in the same way as [creative-writing-control-vectors-v3.0](https://huggingface.co/jukofyork/creative-writing-control-vectors-v3.0). |
|
|
|
This currently requires hacking [PEFT's layer.py](https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/layer.py) like so: |
|
|
|
```python |
|
#self.lora_A[adapter_name] = nn.Linear(self.in_features, r, bias=False) |
|
self.lora_A[adapter_name] = nn.Linear(self.out_features, r, bias=False) |
|
self.lora_B[adapter_name] = nn.Linear(r, self.out_features, bias=False) |
|
``` |
|
|
|
and: |
|
|
|
```python |
|
#x = x.to(lora_A.weight.dtype) |
|
temp = result.to(lora_A.weight.dtype) |
|
|
|
if not self.use_dora[active_adapter]: |
|
#result = result + lora_B(lora_A(dropout(x))) * scaling |
|
result = result + lora_B(lora_A(dropout(temp))) * scaling |
|
``` |
|
|
|
Then to merge you need to hack [qlora-pipe's merge_lora.py](https://github.com/tdrussell/qlora-pipe/blob/main/merge_lora.py) to use: |
|
|
|
```python |
|
old_type = tensor.dtype |
|
tensor = tensor.to(torch.float32) |
|
tensor += scale * lora_B.to(torch.float32) @ lora_A.to(torch.float32) @ tensor |
|
tensor = tensor.to(old_type) |
|
``` |
|
|
|
--- |
|
|
|
# The "multiplicative-LoRA" method's link to control-vectors (and "abliteration") |
|
|
|
There are actually 3 existing "multiplicative-LoRA" methods in [PEFT/tuners](https://github.com/huggingface/peft/tree/main/src/peft/tuners): |
|
|
|
- https://github.com/huggingface/peft/tree/main/src/peft/tuners/oft (https://arxiv.org/abs/2306.07280) |
|
- https://github.com/huggingface/peft/tree/main/src/peft/tuners/boft (https://arxiv.org/abs/2311.06243) |
|
- https://github.com/huggingface/peft/tree/main/src/peft/tuners/hra (https://arxiv.org/abs/2405.17484) |
|
|
|
but as explained in [this conceptual guide](https://github.com/huggingface/peft/blob/main/docs/source/conceptual_guides/oft.md): |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/AQ_m88vjvYXZwesZxrJDj.png) |
|
|
|
all 3 methods *deliberately* maintain [orthogonality](https://en.wikipedia.org/wiki/Orthogonal_matrix), and thus are more restrictive in the types of transformations they can perform (ie: [Rotations](https://en.wikipedia.org/wiki/Rotation) and/or [Improper Rotations](https://en.wikipedia.org/wiki/Improper_rotation) only; with no scaling or sheer transformations possible...). |
|
|
|
For example, these can't perform the orthogonal projection needed for ["abliteration"](https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction): |
|
|
|
`h' = h - v @ v^T @ h` |
|
|
|
whereas the general (non-orthogonal) "multiplicative-LoRA" method can (in theory) do this by choosing to set `u = -v` like so: |
|
|
|
`h' = h + u @ v^T @ h` |
|
|
|
In general, the way to think about these (non-orthogonal) "multiplicative-LoRAs" is as a kind of "conditional control-vector": |
|
|
|
- Each vector in `lora_A` looks for a certain dirrection, and via the dot-product it generates a (signed) weighting factor that measures the similarity between the output of the `down_proj` transformation and the specific vector in `lora_A`. |
|
- Each corresponding vector in `lora_B` then gets added to the hidden state / residual stream, scaled by the corresponding (signed) weighting factor. |
|
|
|
So instead of having just a single vector that we add (and in essence adding a `'.bias'` weight to create an [affine transformation](https://en.wikipedia.org/wiki/Affine_transformation)), we now have many different control vectors that can be added (stored in `lora_B`), based on how well they match another set of "direction detection vectors" (stored in `lora_A`). |
|
|
|
**NOTE**: The [LoRA+](https://arxiv.org/abs/2402.12354) paper uses a similar way of viewing the purpose of `lora_A` and `lora_B`: |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/vZ2Gys3huKAWIVe0wz2-q.png) |
|
|
|
but whereas `lora_A` looks at the ***input*** to the transformation for "additive-LoRAs"; these new (non-orthogonal) "multiplicative-LoRAs" instead use `lora_A` to look at the ***output*** of the (`down_proj`) transformation... |
|
|
|
--- |
|
|
|
# Training |
|
|
|
- Took just over 4 days using dual-A6000 GPUs connected via NVLink, using [qlora-pipe](https://github.com/tdrussell/qlora-pipe). |
|
- The dataset consisted of approximately 1000 pre-2012 books converted to Markdown (~180M tokens) using the same `dataset_combination_mode = 'concatenate'` and `dataset_type = 'textfile'` as tdrussell's [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3) used. |
|
- I used the same `sequence_len = 8192` and `batch_size_tokens = 8192` as [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3), but since I only target `down_proj` in a very specific way; I doubt this will affect the useable context length of the model, and 8k tokens should be around 2-3 user-AI rounds' worth of interaction in real terms. |
|
- I used `pipeline_stages = 2` and `"gradient_accumulation_steps": 16` to roughly match the "tokens-per-step" as [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3) used. |
|
- I used a much lower learning-rate of `5e-6`, as the `5e-5` value used by [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3) dropped the evaluation loss *far* too quickly (likely due to adapting `down_proj` only being "almost convex"). |
|
- I set `lora_dropout = 0.0` as it doesn't really make sense to use with `epochs = 1`. |
|
- I left `weight_decay = 0.01` but not convinced this is really doing anything useful, and may actually even be harming the adaption of the early `down_proj` matrices where the gradient signal is likely to be much weaker. |
|
- I found via experimentation that setting `lora_rank` and `lora_alpha` to a very low value (as a form of [Spectral Regularization](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3)), can cause the training to get stuck at [saddle-points](https://en.wikipedia.org/wiki/Saddle_point) as explained in [this](https://arxiv.org/abs/2402.11867) paper; particularly if using stock SGD instead of Adam. |
|
- In general, I relied mainly on early stopping for Regularization and deliberately set out to *undertrain* the model (we can always increase the size of the dataset at a later time...). |
|
|
|
## `config_creative_writer.toml` |
|
|
|
```toml |
|
# Paths |
|
model = '/mnt/data/c4ai-command-r-v01' |
|
output_dir = '/mnt/data/creative-writer-v0.1-alfa-35b' |
|
|
|
# Lora configuration |
|
lora_rank = 64 |
|
lora_alpha = 64 |
|
lora_dropout = 0.0 |
|
target_modules = ['down_proj'] |
|
layers_to_transform = '0:38' # skip last layer |
|
|
|
# Optimization configuration |
|
epochs = 1 |
|
lr_scheduler = 'constant' |
|
warmup_steps = 100 |
|
batch_size_tokens = 8192 |
|
|
|
# Performance settings |
|
pipeline_stages = 2 |
|
logging_steps = 1 |
|
eval_steps = 100 |
|
save_steps = 100 |
|
checkpoint_every_n_minutes = 60 |
|
eval_before_first_step = true |
|
model_weight_dtype = 'bfloat16' |
|
lora_weight_dtype = 'bfloat16' |
|
keep_states = 3 |
|
group_by_length = true |
|
activation_checkpointing = 'unsloth' |
|
|
|
# Resume a prior run |
|
resume_from_checkpoint = false |
|
|
|
# Dataset configuration |
|
dataset_combination_mode = 'concatenate' |
|
eval_gradient_accumulation_steps = 1 |
|
|
|
[optimizer] |
|
type = 'adamw_kahan' |
|
lr = 5e-6 |
|
beta1 = 0.9 |
|
beta2 = 0.99 |
|
weight_decay = 0.01 |
|
|
|
[[datasets]] |
|
name = 'books' |
|
dataset_type = 'textfile' |
|
dataset_path = '/mnt/data/datasets/ebooks/*.txt' |
|
sequence_len = 8192 |
|
eval_size = 0.01 |
|
``` |
|
|
|
## `ds_creative_writer.json` |
|
|
|
```json |
|
{ |
|
"train_micro_batch_size_per_gpu": 1, |
|
"gradient_accumulation_steps": 16, |
|
"gradient_clipping": 1.0, |
|
"steps_per_print": 1 |
|
} |
|
``` |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/DcGilkmIa7wBQJIhCWbHP.png) |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/TnsnTqtAd9S3JE8VacxN6.png) |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/Ly3Y4TK1S2TsTCLEslzZ2.png) |
|
|
|
|