jimmycarter commited on
Commit
8f3eb58
·
verified ·
1 Parent(s): 4f27b38

Upload 3 files

Browse files
Files changed (3) hide show
  1. README.md +5 -3
  2. lycoris_config.3090.json +56 -0
  3. lycoris_config.h100.json +164 -0
README.md CHANGED
@@ -119,7 +119,7 @@ This part is actually really easy. You just train it on the normal flow-matching
119
 
120
  FLUX models use a text model called T5-XXL to get most of its conditioning for the text-to-image task. Importantly, they pad the text out to either 256 (schnell) or 512 (dev) tokens. 512 tokens is the maximum trained length for the model. By padding, I mean they repeat the last token until the sequence is this length.
121
 
122
- This results in the model using these padding tokens to [store information](https://arxiv.org/abs/2309.16588). When you [visualize the attention maps of the tokens in the padding segment of the text encoder](https://github.com/kaibioinfo/FluxAttentionMap/blob/main/attentionmap.ipynb), you can see that about 10-40 tokens shortly after the last token of the text and about 10-40 tokens at the end of the padding contain information which the model uses to make images. Because these are normally used to store information, it means that any prompt long enough to not have some of these padding tokens will end up with degraded performance.
123
 
124
  It's easy to prevent this by masking out these padding token during attention. BFL and their engineers know this, but they probably decided against it because it works as is and most fast implementations of attention only work with causal (LLM) types of padding and so would let them train faster.
125
 
@@ -171,6 +171,8 @@ One problem with diffusion models is that they are [multi-task](https://arxiv.or
171
  2. Implement multi-rank stratified sampling so that during each step the model trained timesteps were selected per batch based on regions, which normalizes the gradients significantly like using a higher batch size would.
172
 
173
  ```py
 
 
174
  alpha = 2.0
175
  beta = 1.6
176
  num_processes = self.accelerator.num_processes
@@ -192,7 +194,7 @@ No one talks about what datasets they train anymore, but I used open ones from t
192
 
193
  ## Training
194
 
195
- I started training for over a month on a 5x 3090s and about 500,000 images. I used a 600m LoKr for this. The model looked okay after. Then, I [unexpectedly gained access to 7x H100s for compute resources](https://rundiffusion.com), so I merged my PEFT model in and began training on a new LoKr with 3.2b parameters.
196
 
197
  ## Post-hoc "EMA"
198
 
@@ -268,7 +270,7 @@ As far as what I think of the FLUX "open source", many models being trained and
268
 
269
  <img src="https://huggingface.co/jimmycarter/LibreFLUX/resolve/main/assets/opensource.png" style="max-width: 100%;">
270
 
271
- I would like to thank [RunDiffusion](https://rundiffusion.com) for the H100 access.
272
 
273
  ## Contacting me and grants
274
 
 
119
 
120
  FLUX models use a text model called T5-XXL to get most of its conditioning for the text-to-image task. Importantly, they pad the text out to either 256 (schnell) or 512 (dev) tokens. 512 tokens is the maximum trained length for the model. By padding, I mean they repeat the last token until the sequence is this length.
121
 
122
+ This results in the model using these padding tokens to [store information](https://arxiv.org/abs/2309.16588). When you [visualize the attention maps of the tokens in the padding segment of the text encoder](https://github.com/kaibioinfo/FluxAttentionMap/resolve/main/attentionmap.ipynb), you can see that about 10-40 tokens shortly after the last token of the text and about 10-40 tokens at the end of the padding contain information which the model uses to make images. Because these are normally used to store information, it means that any prompt long enough to not have some of these padding tokens will end up with degraded performance.
123
 
124
  It's easy to prevent this by masking out these padding token during attention. BFL and their engineers know this, but they probably decided against it because it works as is and most fast implementations of attention only work with causal (LLM) types of padding and so would let them train faster.
125
 
 
171
  2. Implement multi-rank stratified sampling so that during each step the model trained timesteps were selected per batch based on regions, which normalizes the gradients significantly like using a higher batch size would.
172
 
173
  ```py
174
+ from scipy.stats import beta as sp_beta
175
+
176
  alpha = 2.0
177
  beta = 1.6
178
  num_processes = self.accelerator.num_processes
 
194
 
195
  ## Training
196
 
197
+ I started training for over a month on a 5x 3090s and about 500,000 images. I used a [600m LoKr](https://huggingface.co/jimmycarter/LibreFLUX/blob/main/lycoris_config.3090.json) for this at batch size 1 (effective batch size 5 via DDP). The model looked okay after. Then, I [unexpectedly gained access to 7x H100s for compute resources](https://runware.ai), so I merged my PEFT model in and began training on a new LoKr with [3.2b parameters](https://huggingface.co/jimmycarter/LibreFLUX/blob/main/lycoris_config.h100.json). For the 7x H100 run I ran a batch size of 6 (effective batch size 42 via DDP).
198
 
199
  ## Post-hoc "EMA"
200
 
 
270
 
271
  <img src="https://huggingface.co/jimmycarter/LibreFLUX/resolve/main/assets/opensource.png" style="max-width: 100%;">
272
 
273
+ I would like to thank [RunWare](https://runware.ai) for the H100 access.
274
 
275
  ## Contacting me and grants
276
 
lycoris_config.3090.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "algo": "lokr",
3
+ "multiplier": 1.0,
4
+ "linear_dim": 1000000,
5
+ "linear_alpha": 1,
6
+ "factor": 2,
7
+ "full_matrix": true,
8
+ "apply_preset": {
9
+ "name_algo_map": {
10
+ "transformer_blocks.[0-7]*": {
11
+ "algo": "lokr",
12
+ "factor": 4,
13
+ "linear_dim": 1000000,
14
+ "linear_alpha": 1,
15
+ "full_matrix": true
16
+ },
17
+ "transformer_blocks.[8-15]*": {
18
+ "algo": "lokr",
19
+ "factor": 5,
20
+ "linear_dim": 1000000,
21
+ "linear_alpha": 1,
22
+ "full_matrix": true
23
+ },
24
+ "transformer_blocks.[16-18]*": {
25
+ "algo": "lokr",
26
+ "factor": 10,
27
+ "linear_dim": 1000000,
28
+ "linear_alpha": 1,
29
+ "full_matrix": true
30
+ },
31
+ "single_transformer_blocks.[0-15]*": {
32
+ "algo": "lokr",
33
+ "factor": 8,
34
+ "linear_dim": 1000000,
35
+ "linear_alpha": 1,
36
+ "full_matrix": true
37
+ },
38
+ "single_transformer_blocks.[16-23]*": {
39
+ "algo": "lokr",
40
+ "factor": 5,
41
+ "linear_dim": 1000000,
42
+ "linear_alpha": 1,
43
+ "full_matrix": true
44
+ },
45
+ "single_transformer_blocks.[24-37]*": {
46
+ "algo": "lokr",
47
+ "factor": 4,
48
+ "linear_dim": 1000000,
49
+ "linear_alpha": 1,
50
+ "use_scalar": true,
51
+ "full_matrix": true
52
+ }
53
+ },
54
+ "use_fnmatch": true
55
+ }
56
+ }
lycoris_config.h100.json ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "algo": "lokr",
3
+ "multiplier": 1.0,
4
+ "linear_dim": 1000000,
5
+ "linear_alpha": 1,
6
+ "factor": 2,
7
+ "full_matrix": true,
8
+ "apply_preset": {
9
+ "target_module": [
10
+ "FluxTransformerBlock",
11
+ "FluxSingleTransformerBlock"
12
+ ],
13
+ "name_algo_map": {
14
+ "transformer_blocks.[0-7].ff*": {
15
+ "algo": "lokr",
16
+ "factor": 1,
17
+ "linear_dim": 1000000,
18
+ "linear_alpha": 1,
19
+ "full_matrix": true
20
+ },
21
+ "transformer_blocks.[0-7].norm1": {
22
+ "algo": "lokr",
23
+ "factor": 1,
24
+ "linear_dim": 1000000,
25
+ "linear_alpha": 1,
26
+ "full_matrix": true
27
+ },
28
+ "transformer_blocks.[0-7].norm1_context": {
29
+ "algo": "lokr",
30
+ "factor": 1,
31
+ "linear_dim": 1000000,
32
+ "linear_alpha": 1,
33
+ "full_matrix": true
34
+ },
35
+ "transformer_blocks.[0-7]*": {
36
+ "algo": "lokr",
37
+ "factor": 2,
38
+ "linear_dim": 1000000,
39
+ "linear_alpha": 1,
40
+ "full_matrix": true
41
+ },
42
+ "transformer_blocks.[8-15].ff*": {
43
+ "algo": "lokr",
44
+ "factor": 1,
45
+ "linear_dim": 1000000,
46
+ "linear_alpha": 1,
47
+ "full_matrix": true
48
+ },
49
+ "transformer_blocks.[8-15].norm1": {
50
+ "algo": "lokr",
51
+ "factor": 1,
52
+ "linear_dim": 1000000,
53
+ "linear_alpha": 1,
54
+ "full_matrix": true
55
+ },
56
+ "transformer_blocks.[8-15].norm1_context": {
57
+ "algo": "lokr",
58
+ "factor": 1,
59
+ "linear_dim": 1000000,
60
+ "linear_alpha": 1,
61
+ "full_matrix": true
62
+ },
63
+ "transformer_blocks.[8-15]*": {
64
+ "algo": "lokr",
65
+ "factor": 2,
66
+ "linear_dim": 1000000,
67
+ "linear_alpha": 1,
68
+ "full_matrix": true
69
+ },
70
+ "transformer_blocks.[16-18].ff*": {
71
+ "algo": "lokr",
72
+ "factor": 4,
73
+ "linear_dim": 1000000,
74
+ "linear_alpha": 1,
75
+ "full_matrix": true
76
+ },
77
+ "transformer_blocks.[16-18].norm1": {
78
+ "algo": "lokr",
79
+ "factor": 4,
80
+ "linear_dim": 1000000,
81
+ "linear_alpha": 1,
82
+ "full_matrix": true
83
+ },
84
+ "transformer_blocks.[16-18].norm1_context": {
85
+ "algo": "lokr",
86
+ "factor": 4,
87
+ "linear_dim": 1000000,
88
+ "linear_alpha": 1,
89
+ "full_matrix": true
90
+ },
91
+ "transformer_blocks.[16-18]*": {
92
+ "algo": "lokr",
93
+ "factor": 8,
94
+ "linear_dim": 1000000,
95
+ "linear_alpha": 1,
96
+ "full_matrix": true
97
+ },
98
+ "single_transformer_blocks.[0-15].ff*": {
99
+ "algo": "lokr",
100
+ "factor": 3,
101
+ "linear_dim": 1000000,
102
+ "linear_alpha": 1,
103
+ "full_matrix": true
104
+ },
105
+ "single_transformer_blocks.[0-15].norm": {
106
+ "algo": "lokr",
107
+ "factor": 3,
108
+ "linear_dim": 1000000,
109
+ "linear_alpha": 1,
110
+ "full_matrix": true
111
+ },
112
+ "single_transformer_blocks.[0-15]*": {
113
+ "algo": "lokr",
114
+ "factor": 6,
115
+ "linear_dim": 1000000,
116
+ "linear_alpha": 1,
117
+ "full_matrix": true
118
+ },
119
+ "single_transformer_blocks.[16-23].ff*": {
120
+ "algo": "lokr",
121
+ "factor": 2,
122
+ "linear_dim": 1000000,
123
+ "linear_alpha": 1,
124
+ "full_matrix": true
125
+ },
126
+ "single_transformer_blocks.[16-23].norm": {
127
+ "algo": "lokr",
128
+ "factor": 2,
129
+ "linear_dim": 1000000,
130
+ "linear_alpha": 1,
131
+ "full_matrix": true
132
+ },
133
+ "single_transformer_blocks.[16-23]*": {
134
+ "algo": "lokr",
135
+ "factor": 4,
136
+ "linear_dim": 1000000,
137
+ "linear_alpha": 1,
138
+ "full_matrix": true
139
+ },
140
+ "single_transformer_blocks.[24-37].ff*": {
141
+ "algo": "lokr",
142
+ "factor": 1,
143
+ "linear_dim": 1000000,
144
+ "linear_alpha": 1,
145
+ "full_matrix": true
146
+ },
147
+ "single_transformer_blocks.[24-37].norm": {
148
+ "algo": "lokr",
149
+ "factor": 1,
150
+ "linear_dim": 1000000,
151
+ "linear_alpha": 1,
152
+ "full_matrix": true
153
+ },
154
+ "single_transformer_blocks.[24-37]*": {
155
+ "algo": "lokr",
156
+ "factor": 2,
157
+ "linear_dim": 1000000,
158
+ "linear_alpha": 1,
159
+ "full_matrix": true
160
+ }
161
+ },
162
+ "use_fnmatch": true
163
+ }
164
+ }