jimmycarter
commited on
Upload 3 files
Browse files- README.md +5 -3
- lycoris_config.3090.json +56 -0
- lycoris_config.h100.json +164 -0
README.md
CHANGED
@@ -119,7 +119,7 @@ This part is actually really easy. You just train it on the normal flow-matching
|
|
119 |
|
120 |
FLUX models use a text model called T5-XXL to get most of its conditioning for the text-to-image task. Importantly, they pad the text out to either 256 (schnell) or 512 (dev) tokens. 512 tokens is the maximum trained length for the model. By padding, I mean they repeat the last token until the sequence is this length.
|
121 |
|
122 |
-
This results in the model using these padding tokens to [store information](https://arxiv.org/abs/2309.16588). When you [visualize the attention maps of the tokens in the padding segment of the text encoder](https://github.com/kaibioinfo/FluxAttentionMap/
|
123 |
|
124 |
It's easy to prevent this by masking out these padding token during attention. BFL and their engineers know this, but they probably decided against it because it works as is and most fast implementations of attention only work with causal (LLM) types of padding and so would let them train faster.
|
125 |
|
@@ -171,6 +171,8 @@ One problem with diffusion models is that they are [multi-task](https://arxiv.or
|
|
171 |
2. Implement multi-rank stratified sampling so that during each step the model trained timesteps were selected per batch based on regions, which normalizes the gradients significantly like using a higher batch size would.
|
172 |
|
173 |
```py
|
|
|
|
|
174 |
alpha = 2.0
|
175 |
beta = 1.6
|
176 |
num_processes = self.accelerator.num_processes
|
@@ -192,7 +194,7 @@ No one talks about what datasets they train anymore, but I used open ones from t
|
|
192 |
|
193 |
## Training
|
194 |
|
195 |
-
I started training for over a month on a 5x 3090s and about 500,000 images. I used a 600m LoKr for this. The model looked okay after. Then, I [unexpectedly gained access to 7x H100s for compute resources](https://
|
196 |
|
197 |
## Post-hoc "EMA"
|
198 |
|
@@ -268,7 +270,7 @@ As far as what I think of the FLUX "open source", many models being trained and
|
|
268 |
|
269 |
<img src="https://huggingface.co/jimmycarter/LibreFLUX/resolve/main/assets/opensource.png" style="max-width: 100%;">
|
270 |
|
271 |
-
I would like to thank [
|
272 |
|
273 |
## Contacting me and grants
|
274 |
|
|
|
119 |
|
120 |
FLUX models use a text model called T5-XXL to get most of its conditioning for the text-to-image task. Importantly, they pad the text out to either 256 (schnell) or 512 (dev) tokens. 512 tokens is the maximum trained length for the model. By padding, I mean they repeat the last token until the sequence is this length.
|
121 |
|
122 |
+
This results in the model using these padding tokens to [store information](https://arxiv.org/abs/2309.16588). When you [visualize the attention maps of the tokens in the padding segment of the text encoder](https://github.com/kaibioinfo/FluxAttentionMap/resolve/main/attentionmap.ipynb), you can see that about 10-40 tokens shortly after the last token of the text and about 10-40 tokens at the end of the padding contain information which the model uses to make images. Because these are normally used to store information, it means that any prompt long enough to not have some of these padding tokens will end up with degraded performance.
|
123 |
|
124 |
It's easy to prevent this by masking out these padding token during attention. BFL and their engineers know this, but they probably decided against it because it works as is and most fast implementations of attention only work with causal (LLM) types of padding and so would let them train faster.
|
125 |
|
|
|
171 |
2. Implement multi-rank stratified sampling so that during each step the model trained timesteps were selected per batch based on regions, which normalizes the gradients significantly like using a higher batch size would.
|
172 |
|
173 |
```py
|
174 |
+
from scipy.stats import beta as sp_beta
|
175 |
+
|
176 |
alpha = 2.0
|
177 |
beta = 1.6
|
178 |
num_processes = self.accelerator.num_processes
|
|
|
194 |
|
195 |
## Training
|
196 |
|
197 |
+
I started training for over a month on a 5x 3090s and about 500,000 images. I used a [600m LoKr](https://huggingface.co/jimmycarter/LibreFLUX/blob/main/lycoris_config.3090.json) for this at batch size 1 (effective batch size 5 via DDP). The model looked okay after. Then, I [unexpectedly gained access to 7x H100s for compute resources](https://runware.ai), so I merged my PEFT model in and began training on a new LoKr with [3.2b parameters](https://huggingface.co/jimmycarter/LibreFLUX/blob/main/lycoris_config.h100.json). For the 7x H100 run I ran a batch size of 6 (effective batch size 42 via DDP).
|
198 |
|
199 |
## Post-hoc "EMA"
|
200 |
|
|
|
270 |
|
271 |
<img src="https://huggingface.co/jimmycarter/LibreFLUX/resolve/main/assets/opensource.png" style="max-width: 100%;">
|
272 |
|
273 |
+
I would like to thank [RunWare](https://runware.ai) for the H100 access.
|
274 |
|
275 |
## Contacting me and grants
|
276 |
|
lycoris_config.3090.json
ADDED
@@ -0,0 +1,56 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"algo": "lokr",
|
3 |
+
"multiplier": 1.0,
|
4 |
+
"linear_dim": 1000000,
|
5 |
+
"linear_alpha": 1,
|
6 |
+
"factor": 2,
|
7 |
+
"full_matrix": true,
|
8 |
+
"apply_preset": {
|
9 |
+
"name_algo_map": {
|
10 |
+
"transformer_blocks.[0-7]*": {
|
11 |
+
"algo": "lokr",
|
12 |
+
"factor": 4,
|
13 |
+
"linear_dim": 1000000,
|
14 |
+
"linear_alpha": 1,
|
15 |
+
"full_matrix": true
|
16 |
+
},
|
17 |
+
"transformer_blocks.[8-15]*": {
|
18 |
+
"algo": "lokr",
|
19 |
+
"factor": 5,
|
20 |
+
"linear_dim": 1000000,
|
21 |
+
"linear_alpha": 1,
|
22 |
+
"full_matrix": true
|
23 |
+
},
|
24 |
+
"transformer_blocks.[16-18]*": {
|
25 |
+
"algo": "lokr",
|
26 |
+
"factor": 10,
|
27 |
+
"linear_dim": 1000000,
|
28 |
+
"linear_alpha": 1,
|
29 |
+
"full_matrix": true
|
30 |
+
},
|
31 |
+
"single_transformer_blocks.[0-15]*": {
|
32 |
+
"algo": "lokr",
|
33 |
+
"factor": 8,
|
34 |
+
"linear_dim": 1000000,
|
35 |
+
"linear_alpha": 1,
|
36 |
+
"full_matrix": true
|
37 |
+
},
|
38 |
+
"single_transformer_blocks.[16-23]*": {
|
39 |
+
"algo": "lokr",
|
40 |
+
"factor": 5,
|
41 |
+
"linear_dim": 1000000,
|
42 |
+
"linear_alpha": 1,
|
43 |
+
"full_matrix": true
|
44 |
+
},
|
45 |
+
"single_transformer_blocks.[24-37]*": {
|
46 |
+
"algo": "lokr",
|
47 |
+
"factor": 4,
|
48 |
+
"linear_dim": 1000000,
|
49 |
+
"linear_alpha": 1,
|
50 |
+
"use_scalar": true,
|
51 |
+
"full_matrix": true
|
52 |
+
}
|
53 |
+
},
|
54 |
+
"use_fnmatch": true
|
55 |
+
}
|
56 |
+
}
|
lycoris_config.h100.json
ADDED
@@ -0,0 +1,164 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"algo": "lokr",
|
3 |
+
"multiplier": 1.0,
|
4 |
+
"linear_dim": 1000000,
|
5 |
+
"linear_alpha": 1,
|
6 |
+
"factor": 2,
|
7 |
+
"full_matrix": true,
|
8 |
+
"apply_preset": {
|
9 |
+
"target_module": [
|
10 |
+
"FluxTransformerBlock",
|
11 |
+
"FluxSingleTransformerBlock"
|
12 |
+
],
|
13 |
+
"name_algo_map": {
|
14 |
+
"transformer_blocks.[0-7].ff*": {
|
15 |
+
"algo": "lokr",
|
16 |
+
"factor": 1,
|
17 |
+
"linear_dim": 1000000,
|
18 |
+
"linear_alpha": 1,
|
19 |
+
"full_matrix": true
|
20 |
+
},
|
21 |
+
"transformer_blocks.[0-7].norm1": {
|
22 |
+
"algo": "lokr",
|
23 |
+
"factor": 1,
|
24 |
+
"linear_dim": 1000000,
|
25 |
+
"linear_alpha": 1,
|
26 |
+
"full_matrix": true
|
27 |
+
},
|
28 |
+
"transformer_blocks.[0-7].norm1_context": {
|
29 |
+
"algo": "lokr",
|
30 |
+
"factor": 1,
|
31 |
+
"linear_dim": 1000000,
|
32 |
+
"linear_alpha": 1,
|
33 |
+
"full_matrix": true
|
34 |
+
},
|
35 |
+
"transformer_blocks.[0-7]*": {
|
36 |
+
"algo": "lokr",
|
37 |
+
"factor": 2,
|
38 |
+
"linear_dim": 1000000,
|
39 |
+
"linear_alpha": 1,
|
40 |
+
"full_matrix": true
|
41 |
+
},
|
42 |
+
"transformer_blocks.[8-15].ff*": {
|
43 |
+
"algo": "lokr",
|
44 |
+
"factor": 1,
|
45 |
+
"linear_dim": 1000000,
|
46 |
+
"linear_alpha": 1,
|
47 |
+
"full_matrix": true
|
48 |
+
},
|
49 |
+
"transformer_blocks.[8-15].norm1": {
|
50 |
+
"algo": "lokr",
|
51 |
+
"factor": 1,
|
52 |
+
"linear_dim": 1000000,
|
53 |
+
"linear_alpha": 1,
|
54 |
+
"full_matrix": true
|
55 |
+
},
|
56 |
+
"transformer_blocks.[8-15].norm1_context": {
|
57 |
+
"algo": "lokr",
|
58 |
+
"factor": 1,
|
59 |
+
"linear_dim": 1000000,
|
60 |
+
"linear_alpha": 1,
|
61 |
+
"full_matrix": true
|
62 |
+
},
|
63 |
+
"transformer_blocks.[8-15]*": {
|
64 |
+
"algo": "lokr",
|
65 |
+
"factor": 2,
|
66 |
+
"linear_dim": 1000000,
|
67 |
+
"linear_alpha": 1,
|
68 |
+
"full_matrix": true
|
69 |
+
},
|
70 |
+
"transformer_blocks.[16-18].ff*": {
|
71 |
+
"algo": "lokr",
|
72 |
+
"factor": 4,
|
73 |
+
"linear_dim": 1000000,
|
74 |
+
"linear_alpha": 1,
|
75 |
+
"full_matrix": true
|
76 |
+
},
|
77 |
+
"transformer_blocks.[16-18].norm1": {
|
78 |
+
"algo": "lokr",
|
79 |
+
"factor": 4,
|
80 |
+
"linear_dim": 1000000,
|
81 |
+
"linear_alpha": 1,
|
82 |
+
"full_matrix": true
|
83 |
+
},
|
84 |
+
"transformer_blocks.[16-18].norm1_context": {
|
85 |
+
"algo": "lokr",
|
86 |
+
"factor": 4,
|
87 |
+
"linear_dim": 1000000,
|
88 |
+
"linear_alpha": 1,
|
89 |
+
"full_matrix": true
|
90 |
+
},
|
91 |
+
"transformer_blocks.[16-18]*": {
|
92 |
+
"algo": "lokr",
|
93 |
+
"factor": 8,
|
94 |
+
"linear_dim": 1000000,
|
95 |
+
"linear_alpha": 1,
|
96 |
+
"full_matrix": true
|
97 |
+
},
|
98 |
+
"single_transformer_blocks.[0-15].ff*": {
|
99 |
+
"algo": "lokr",
|
100 |
+
"factor": 3,
|
101 |
+
"linear_dim": 1000000,
|
102 |
+
"linear_alpha": 1,
|
103 |
+
"full_matrix": true
|
104 |
+
},
|
105 |
+
"single_transformer_blocks.[0-15].norm": {
|
106 |
+
"algo": "lokr",
|
107 |
+
"factor": 3,
|
108 |
+
"linear_dim": 1000000,
|
109 |
+
"linear_alpha": 1,
|
110 |
+
"full_matrix": true
|
111 |
+
},
|
112 |
+
"single_transformer_blocks.[0-15]*": {
|
113 |
+
"algo": "lokr",
|
114 |
+
"factor": 6,
|
115 |
+
"linear_dim": 1000000,
|
116 |
+
"linear_alpha": 1,
|
117 |
+
"full_matrix": true
|
118 |
+
},
|
119 |
+
"single_transformer_blocks.[16-23].ff*": {
|
120 |
+
"algo": "lokr",
|
121 |
+
"factor": 2,
|
122 |
+
"linear_dim": 1000000,
|
123 |
+
"linear_alpha": 1,
|
124 |
+
"full_matrix": true
|
125 |
+
},
|
126 |
+
"single_transformer_blocks.[16-23].norm": {
|
127 |
+
"algo": "lokr",
|
128 |
+
"factor": 2,
|
129 |
+
"linear_dim": 1000000,
|
130 |
+
"linear_alpha": 1,
|
131 |
+
"full_matrix": true
|
132 |
+
},
|
133 |
+
"single_transformer_blocks.[16-23]*": {
|
134 |
+
"algo": "lokr",
|
135 |
+
"factor": 4,
|
136 |
+
"linear_dim": 1000000,
|
137 |
+
"linear_alpha": 1,
|
138 |
+
"full_matrix": true
|
139 |
+
},
|
140 |
+
"single_transformer_blocks.[24-37].ff*": {
|
141 |
+
"algo": "lokr",
|
142 |
+
"factor": 1,
|
143 |
+
"linear_dim": 1000000,
|
144 |
+
"linear_alpha": 1,
|
145 |
+
"full_matrix": true
|
146 |
+
},
|
147 |
+
"single_transformer_blocks.[24-37].norm": {
|
148 |
+
"algo": "lokr",
|
149 |
+
"factor": 1,
|
150 |
+
"linear_dim": 1000000,
|
151 |
+
"linear_alpha": 1,
|
152 |
+
"full_matrix": true
|
153 |
+
},
|
154 |
+
"single_transformer_blocks.[24-37]*": {
|
155 |
+
"algo": "lokr",
|
156 |
+
"factor": 2,
|
157 |
+
"linear_dim": 1000000,
|
158 |
+
"linear_alpha": 1,
|
159 |
+
"full_matrix": true
|
160 |
+
}
|
161 |
+
},
|
162 |
+
"use_fnmatch": true
|
163 |
+
}
|
164 |
+
}
|