Spaces:
Running
Running
Merge pull request #3 from khalidsaifullaah/patch-1
Browse files
README.md
CHANGED
@@ -1,5 +1,22 @@
|
|
1 |
## DALL-E Mini - Generate image from text
|
2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
## TODO
|
4 |
|
5 |
* experiment with flax/jax and setup of the TPU instance that we should get shortly
|
|
|
1 |
## DALL-E Mini - Generate image from text
|
2 |
|
3 |
+
## Tentative Strategy of training (proposed by Luke and Suraj)
|
4 |
+
|
5 |
+
### Data:
|
6 |
+
* [Conceptual 12M](https://github.com/google-research-datasets/conceptual-12m) Dataset (already loaded and preprocessed in TPU VM by Luke).
|
7 |
+
* [YFCC100M Subset](https://github.com/openai/CLIP/blob/main/data/yfcc100m.md)
|
8 |
+
* [Coneptual Captions 3M](https://github.com/google-research-datasets/conceptual-captions)
|
9 |
+
|
10 |
+
### Architecture:
|
11 |
+
* Use the Taming Transformers VQ-GAN (with 16384 tokens)
|
12 |
+
* Use a seq2seq (language encoder --> image decoder) model with a pretrained non-autoregressive encoder (e.g. BERT) and an autoregressive decoder (like GPT).
|
13 |
+
|
14 |
+
### Remaining Architecture Questions:
|
15 |
+
* Whether to freeze the text encoder?
|
16 |
+
* Whether to finetune the VQ-GAN?
|
17 |
+
* Which text encoder to use (e.g. BERT, RoBERTa, etc.)?
|
18 |
+
* Hyperparameter choices for the decoder (e.g. positional embedding, initialization, etc.)
|
19 |
+
|
20 |
## TODO
|
21 |
|
22 |
* experiment with flax/jax and setup of the TPU instance that we should get shortly
|