File size: 7,980 Bytes
290ca23
 
 
1991c0b
 
 
 
290ca23
 
9c123e6
290ca23
1991c0b
290ca23
1991c0b
4116199
 
 
 
 
 
 
 
 
 
290ca23
1991c0b
290ca23
7dfc5f9
290ca23
1991c0b
290ca23
1991c0b
290ca23
661a29f
 
 
 
 
 
 
290ca23
1991c0b
290ca23
4116199
290ca23
1991c0b
290ca23
 
1991c0b
 
 
290ca23
 
1991c0b
290ca23
 
1991c0b
 
290ca23
1991c0b
 
 
290ca23
 
1991c0b
290ca23
 
1991c0b
 
 
 
290ca23
 
1991c0b
290ca23
f5b114f
661a29f
 
 
 
 
 
 
138a0cb
661a29f
138a0cb
 
 
 
 
661a29f
 
 
 
 
 
 
 
 
138a0cb
 
 
 
661a29f
138a0cb
661a29f
 
 
1991c0b
 
 
 
4116199
1991c0b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
290ca23
 
1991c0b
290ca23
1991c0b
 
 
 
 
 
 
4116199
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
---
library_name: transformers
license: cc-by-nc-4.0
tags:
- creative-writing
- creative-writer
- multiplicative-lora
---

An experimental model, fine-tuned using the ["multiplicative-LoRA" method](#the-multiplicative-lora-method) on [c4ai-command-r-v01](https://huggingface.co/CohereForAI/c4ai-command-r-v01).

Other experimental models, based off `creative-writer-v0.1-alfa-35b` that attempt to encourage more diverse/creative text generation:

- [creative-writer-v0.1-bravo-35b](https://huggingface.co/jukofyork/creative-writer-v0.1-bravo-35b) - Scaled the pre-softmax logits by `1.1` during training (and then reset after training).
- **[CURRENTLY UPLOADING...]** [creative-writer-v0.1-charlie-35b](https://huggingface.co/jukofyork/creative-writer-v0.1-charlie-35b) - Scaled the pre-softmax logits by `0.9` during training (and didn't reset after training).
- **[CURRENTLY TRAINING...]** [creative-writer-v0.1-delta-35b](https://huggingface.co/jukofyork/creative-writer-v0.1-delta-35b) - Trained using [Focal Loss](https://arxiv.org/abs/1708.02002) with `gamma=2` (instead of stock [Cross Entropy Loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)).

---

# Usage

- Use the normal `command-r` chat template: `'<|START_OF_TURN_TOKEN|><|USER_TOKEN|>prompt<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>reply...'`.
- I suggest using **no system prompt** with this (and all other `Cohere` models!), as it writes *much* better without IMO...
- You **must used some small value of min-p** with this (and the original `c4ai-command-r-v01` model!), or the model will output gibberish!

---

# The "multiplicative-LoRA" method

Uses:

`h = (I + lora_B @ lora_A) @ tensor @ x = tensor @ x + lora_B @ lora_A @ tensor @ x`

or equivalently:

`h = tensor @ x`

`h' = h + lora_B @ lora_A @ h`

instead of the normal "additive-LoRA" method of:

`h = (tensor + lora_B @ lora_A) @ x = tensor @ x + lora_B @ lora_A @ x`

I only apply this to the `down_proj` matrices, and skipped the last layer's `down_proj` matrix in the same way as [creative-writing-control-vectors-v3.0](https://huggingface.co/jukofyork/creative-writing-control-vectors-v3.0).

This currently requires hacking [PEFT's layer.py](https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/layer.py) like so:

```python
#self.lora_A[adapter_name] = nn.Linear(self.in_features, r, bias=False)
self.lora_A[adapter_name] = nn.Linear(self.out_features, r, bias=False)
self.lora_B[adapter_name] = nn.Linear(r, self.out_features, bias=False)
```

and:

```python
#x = x.to(lora_A.weight.dtype)
temp = result.to(lora_A.weight.dtype)

if not self.use_dora[active_adapter]:
    #result = result + lora_B(lora_A(dropout(x))) * scaling
    result = result + lora_B(lora_A(dropout(temp))) * scaling
```

Then to merge you need to hack [qlora-pipe's merge_lora.py](https://github.com/tdrussell/qlora-pipe/blob/main/merge_lora.py) to use:

```python
old_type = tensor.dtype
tensor = tensor.to(torch.float32)
tensor += scale * lora_B.to(torch.float32) @ lora_A.to(torch.float32) @ tensor
tensor = tensor.to(old_type)
```

---

# The rationale behind the "multiplicative-LoRA" and the link to control-vectors

There are actually 3 existing "multiplicative-LoRA" methods in [PEFT/tuners](https://github.com/huggingface/peft/tree/main/src/peft/tuners):

- https://github.com/huggingface/peft/tree/main/src/peft/tuners/oft (https://arxiv.org/abs/2306.07280)
- https://github.com/huggingface/peft/tree/main/src/peft/tuners/boft (https://arxiv.org/abs/2311.06243)
- https://github.com/huggingface/peft/tree/main/src/peft/tuners/hra (https://arxiv.org/abs/2405.17484)

but as explained in [this conceptual guide](https://github.com/huggingface/peft/blob/main/docs/source/conceptual_guides/oft.md):

![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/AQ_m88vjvYXZwesZxrJDj.png)

all 3 methods *deliberately* maintain [orthogonality](https://en.wikipedia.org/wiki/Orthogonal_matrix), and thus are more restrictive in the types of transformations they can perform (ie: [Rotations](https://en.wikipedia.org/wiki/Rotation) and/or [Improper Rotations](https://en.wikipedia.org/wiki/Improper_rotation) only; with no scaling and/or sheer possible...).

For example, these can't perform the orthogonal projection needed for ["abliteration"](https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction):

`h' = h - v @ v^T @ h`

whereas the general (non-orthogonal) "multiplicative-LoRA" method can do this by choosing to set `u = -v` like so:

`h' = h + u @ v^T @ h`

In general, the way to think about these (non-orthogonal) "multiplicative-LoRAs" is as a kind of "conditional control-vector":

- Each vector in `lora_A` looks for a certain dirrection, and via the dot-product it generates (signed) weighting factor that measures the similarity between the output of the `down_proj` transformation.
- Each corresponding vector in `lora_B` then gets added to the hidden state / residual stream based on the corresponding weighting factor.

So instead of having just a single vector that we add (in essence we add a bias term and create an [affine transformation](https://en.wikipedia.org/wiki/Affine_transformation)), we now have many different control vectors that can be added (stored in `lora_B`), based on how well they match another set of "directional detection vectors" (stored in `lora_A`).

**NOTE**: The [LoRA+](https://arxiv.org/abs/2402.12354) uses a similar way of viewing the purpose of `lora_A` and `lora_B`, but where `lora_A` looks at the ***input*** to the `down_proj` transformation (for "additive-LoRAs"); instead of its ***output*** like the "multiplicative-LoRA" method does...

---

# Training

- Took just under 4 days using dual-A6000 GPUs connected via NVLink, using [qlora-pipe](https://github.com/tdrussell/qlora-pipe).
- The dataset consisted of approximately 1000 pre-2012 books converted to Markdown (~180M tokens) using the same `dataset_combination_mode = 'concatenate'` as [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter).
- I used the same `sequence_len = 8192` and `batch_size_tokens = 8192` as [Llama-3-70B-Instruct-Storywriter](https://huggingface.co/tdrussell/Llama-3-70B-Instruct-Storywriter).

## `config_creative_writer.toml`

```toml
# Paths
model = '/mnt/data/c4ai-command-r-v01'
output_dir = '/mnt/data/creative-writer-v0.1-alfa-35b'

# Lora configuration
lora_rank = 64
lora_alpha = 64
lora_dropout = 0.0
target_modules = ['down_proj']
layers_to_transform = '0:38'  # skip last layer

# Optimization configuration
epochs = 1
lr_scheduler = 'constant'
warmup_steps = 100
batch_size_tokens = 8192

# Performance settings
pipeline_stages = 2
logging_steps = 1
eval_steps = 100
save_steps = 100
checkpoint_every_n_minutes = 60
eval_before_first_step = true
model_weight_dtype = 'bfloat16'
lora_weight_dtype = 'bfloat16'
keep_states = 3
group_by_length = true
activation_checkpointing = 'unsloth'

# Resume a prior run
resume_from_checkpoint = false

# Dataset configuration
dataset_combination_mode = 'concatenate'
eval_gradient_accumulation_steps = 1

[optimizer]
type = 'adamw_kahan'
lr = 5e-6
beta1 = 0.9
beta2 = 0.99
weight_decay = 0.01

[[datasets]]
name = 'books'
dataset_type = 'textfile'
dataset_path = '/mnt/data/datasets/ebooks/*.txt'
sequence_len = 8192
eval_size = 0.01
```

## `ds_creative_writer.json`

```json
{
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 16,
    "gradient_clipping": 1.0,
    "steps_per_print": 1
}
```

![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/DcGilkmIa7wBQJIhCWbHP.png)

![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/TnsnTqtAd9S3JE8VacxN6.png)

![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/Ly3Y4TK1S2TsTCLEslzZ2.png)