File size: 10,152 Bytes
290ca23 1991c0b 290ca23 9c123e6 290ca23 1991c0b 290ca23 1991c0b 4116199 ba2ea41 290ca23 1991c0b 290ca23 7dfc5f9 290ca23 1991c0b 290ca23 1991c0b 290ca23 661a29f 290ca23 1991c0b 290ca23 4116199 290ca23 1991c0b 290ca23 1991c0b 290ca23 1991c0b 290ca23 1991c0b 290ca23 1991c0b 290ca23 1991c0b 290ca23 1991c0b 290ca23 1991c0b 290ca23 a5f1bf4 661a29f 138a0cb 661a29f 138a0cb a23a24c 138a0cb 661a29f 2b8f4b2 661a29f 1b17f6d 138a0cb b524419 661a29f f5323d3 503aef7 eba8697 661a29f 1991c0b 0649029 190626f c5369fc 11394b4 8eb098c 21562bc 1991c0b 290ca23 1991c0b 290ca23 1991c0b 4116199 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 |
---
library_name: transformers
license: cc-by-nc-4.0
tags:
- creative-writing
- creative-writer
- multiplicative-lora
---
An experimental model, fine-tuned using the ["multiplicative-LoRA" method](#the-multiplicative-lora-method) on [c4ai-command-r-v01](https://huggingface.co/CohereForAI/c4ai-command-r-v01).
Other experimental models, based off `creative-writer-v0.1-alfa-35b` that attempt to encourage more diverse/creative text generation:
- [creative-writer-v0.1-bravo-35b](https://huggingface.co/jukofyork/creative-writer-v0.1-bravo-35b) - Scaled the pre-softmax logits by `1.1` during training (and then reset after training).
- **[CURRENTLY UPLOADING...]** [creative-writer-v0.1-charlie-35b](https://huggingface.co/jukofyork/creative-writer-v0.1-charlie-35b) - Scaled the pre-softmax logits by `0.9` during training (and didn't reset after training).
- **[CURRENTLY TRAINING...]** [creative-writer-v0.1-delta-35b](https://huggingface.co/jukofyork/creative-writer-v0.1-delta-35b) - Trained using [Focal Loss](https://arxiv.org/abs/1708.02002) with `gamma=2` (instead of stock [Cross Entropy Loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)).
---
# Usage
- Use the normal `command-r` chat template: `'<|START_OF_TURN_TOKEN|><|USER_TOKEN|>prompt<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>reply...'`.
- I suggest using **no system prompt** with this (and all other `Cohere` models!), as it writes *much* better without it IMO...
- You ***MUST*** **use some (small) value of min-p** with this such as `0.01`(and with the original `c4ai-command-r-v01` model), **or else the model will output gibberish!**
---
# The "multiplicative-LoRA" method
Uses:
`h = (I + lora_B @ lora_A) @ tensor @ x = tensor @ x + lora_B @ lora_A @ tensor @ x`
or equivalently:
`h = tensor @ x`
`h' = h + lora_B @ lora_A @ h`
instead of the normal "additive-LoRA" method of:
`h = (tensor + lora_B @ lora_A) @ x = tensor @ x + lora_B @ lora_A @ x`
I only apply this to the `down_proj` matrices, and skipped the last layer's `down_proj` matrix in the same way as [creative-writing-control-vectors-v3.0](https://huggingface.co/jukofyork/creative-writing-control-vectors-v3.0).
This currently requires hacking [PEFT's layer.py](https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/layer.py) like so:
```python
#self.lora_A[adapter_name] = nn.Linear(self.in_features, r, bias=False)
self.lora_A[adapter_name] = nn.Linear(self.out_features, r, bias=False)
self.lora_B[adapter_name] = nn.Linear(r, self.out_features, bias=False)
```
and:
```python
#x = x.to(lora_A.weight.dtype)
temp = result.to(lora_A.weight.dtype)
if not self.use_dora[active_adapter]:
#result = result + lora_B(lora_A(dropout(x))) * scaling
result = result + lora_B(lora_A(dropout(temp))) * scaling
```
Then to merge you need to hack [qlora-pipe's merge_lora.py](https://github.com/tdrussell/qlora-pipe/blob/main/merge_lora.py) to use:
```python
old_type = tensor.dtype
tensor = tensor.to(torch.float32)
tensor += scale * lora_B.to(torch.float32) @ lora_A.to(torch.float32) @ tensor
tensor = tensor.to(old_type)
```
---
# The "multiplicative-LoRA" method's link to control-vectors (and "abliteration")
There are actually 3 existing "multiplicative-LoRA" methods in [PEFT/tuners](https://github.com/huggingface/peft/tree/main/src/peft/tuners):
- https://github.com/huggingface/peft/tree/main/src/peft/tuners/oft (https://arxiv.org/abs/2306.07280)
- https://github.com/huggingface/peft/tree/main/src/peft/tuners/boft (https://arxiv.org/abs/2311.06243)
- https://github.com/huggingface/peft/tree/main/src/peft/tuners/hra (https://arxiv.org/abs/2405.17484)
but as explained in [this conceptual guide](https://github.com/huggingface/peft/blob/main/docs/source/conceptual_guides/oft.md):
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/AQ_m88vjvYXZwesZxrJDj.png)
all 3 methods *deliberately* maintain [orthogonality](https://en.wikipedia.org/wiki/Orthogonal_matrix), and thus are more restrictive in the types of transformations they can perform (ie: [Rotations](https://en.wikipedia.org/wiki/Rotation) and/or [Improper Rotations](https://en.wikipedia.org/wiki/Improper_rotation) only; with no scaling or sheer transformations possible...).
For example, these can't perform the orthogonal projection needed for ["abliteration"](https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction):
`h' = h - v @ v^T @ h`
whereas the general (non-orthogonal) "multiplicative-LoRA" method can (in theory) do this by choosing to set `u = -v` like so:
`h' = h + u @ v^T @ h`
In general, the way to think about these (non-orthogonal) "multiplicative-LoRAs" is as a kind of "conditional control-vector":
- Each vector in `lora_A` looks for a certain dirrection, and via the dot-product it generates a (signed) weighting factor that measures the similarity between the output of the `down_proj` transformation and the specific vector in `lora_A`.
- Each corresponding vector in `lora_B` then gets added to the hidden state / residual stream, scaled by the corresponding (signed) weighting factor.
So instead of having just a single vector that we add (and in essence adding a `'.bias'` weight to create an [affine transformation](https://en.wikipedia.org/wiki/Affine_transformation)), we now have many different control vectors that can be added (stored in `lora_B`), based on how well they match another set of "direction detection vectors" (stored in `lora_A`).
**NOTE**: The [LoRA+](https://arxiv.org/abs/2402.12354) paper uses a similar way of viewing the purpose of `lora_A` and `lora_B`:
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/vZ2Gys3huKAWIVe0wz2-q.png)
but whereas `lora_A` looks at the ***input*** to the transformation for "additive-LoRAs"; these new (non-orthogonal) "multiplicative-LoRAs" instead use `lora_A` to look at the ***output*** of the (`down_proj`) transformation...
---
# Training
- Took just over 4 days using dual-A6000 GPUs connected via NVLink, using [qlora-pipe](https://github.com/tdrussell/qlora-pipe).
- The dataset consisted of approximately 1000 pre-2012 books converted to Markdown (~180M tokens) using the same `dataset_combination_mode = 'concatenate'` and `dataset_type = 'textfile'` as tdrussell's [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3) used.
- I used the same `sequence_len = 8192` and `batch_size_tokens = 8192` as [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3), but since I only target `down_proj` in a very specific way; I doubt this will affect the useable context length of the model, and 8k tokens should be around 2-3 user-AI rounds' worth of interaction in real terms.
- I used `pipeline_stages = 2` and `"gradient_accumulation_steps": 16` to roughly match the "tokens-per-step" as [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3) used.
- I used a much lower learning-rate of `5e-6`, as the `5e-5` value used by [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3) dropped the evaluation loss *far* too quickly (likely due to adapting `down_proj` only being "almost convex").
- I set `lora_dropout = 0.0` as it doesn't really make sense to use with `epochs = 1`.
- I left `weight_decay = 0.01` but not convinced this is really doing anything useful, and may actually even be harming the adaption of the early `down_proj` matrices where the gradient signal is likely to be much weaker.
- I found via experimentation that setting `lora_rank` and `lora_alpha` to a very low value (as a form of [Spectral Regularization](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter/discussions/2#66524e7eb47c060e536889a3)), can cause the training to get stuck at [saddle-points](https://en.wikipedia.org/wiki/Saddle_point) as explained in [this](https://arxiv.org/abs/2402.11867) paper; particularly if using stock SGD instead of Adam.
- In general, I relied mainly on early stopping for Regularization and deliberately set out to *undertrain* the model (we can always increase the size of the dataset at a later time...).
## `config_creative_writer.toml`
```toml
# Paths
model = '/mnt/data/c4ai-command-r-v01'
output_dir = '/mnt/data/creative-writer-v0.1-alfa-35b'
# Lora configuration
lora_rank = 64
lora_alpha = 64
lora_dropout = 0.0
target_modules = ['down_proj']
layers_to_transform = '0:38' # skip last layer
# Optimization configuration
epochs = 1
lr_scheduler = 'constant'
warmup_steps = 100
batch_size_tokens = 8192
# Performance settings
pipeline_stages = 2
logging_steps = 1
eval_steps = 100
save_steps = 100
checkpoint_every_n_minutes = 60
eval_before_first_step = true
model_weight_dtype = 'bfloat16'
lora_weight_dtype = 'bfloat16'
keep_states = 3
group_by_length = true
activation_checkpointing = 'unsloth'
# Resume a prior run
resume_from_checkpoint = false
# Dataset configuration
dataset_combination_mode = 'concatenate'
eval_gradient_accumulation_steps = 1
[optimizer]
type = 'adamw_kahan'
lr = 5e-6
beta1 = 0.9
beta2 = 0.99
weight_decay = 0.01
[[datasets]]
name = 'books'
dataset_type = 'textfile'
dataset_path = '/mnt/data/datasets/ebooks/*.txt'
sequence_len = 8192
eval_size = 0.01
```
## `ds_creative_writer.json`
```json
{
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 16,
"gradient_clipping": 1.0,
"steps_per_print": 1
}
```
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/DcGilkmIa7wBQJIhCWbHP.png)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/TnsnTqtAd9S3JE8VacxN6.png)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/Ly3Y4TK1S2TsTCLEslzZ2.png)
|