Original text prompts for val_encs.npy
Hello Alexandru, I am loving your diffusion transformer github code and learning a ton from it!
I was wondering what the original text prompts for the clip encodings in val_encs.npy were. I've been generating images from them as I run your code, and it would be super helpful to know the original text strings for reference.
Thank you!
Hey
@BigBrane
- good questions - unfortunately I can't locate the original strings - I would recommend regenerating some that are relevant to your usecase - you can use the encode_text
function here: https://github.com/apapiu/transformer_latent_diffusion/blob/5448c8afabdd3384612c43085740d1079439fa7e/tld/data.py#L28.
Awesome, thanks for the tip! On a related note, I was wondering where you downloaded the MJ dataset/prompts you have uploaded here, or perhaps you've scraped them yourself. I thought I might ask before I attempt to use your scripts to convert this dataset I found, in case it was the same one: https://bridges.monash.edu/articles/dataset/Midjourney_2023_Dataset/25038404
In addition, you mentioned in a reddit comment of yours, as well in your github readme, that you trained on an additional 500k photos in addition to the midjourney images. I would be extremely grateful if you were to share the dataset/latents for that here as well. I have been modifying your code while trying to reproduce similar results to the checkpoint model you uploaded, using only the midjourney dataset you've provided here. While my models seem pick up the MJ style, they lack a bit of photorealism, and I suspect that the missing piece may lie in the data.
Hey @BigBrane I am fairly sure I used this one for mj - https://huggingface.co/datasets/wanng/midjourney-v5-202304-clean - only the upscaled ones. although the non-upscaled one wold be interesting to use too but you'd need to split them up into 4 separate images.
And good question - I don't remember the exact process for the 500k real images - it is mostly various datasets I found on hugginface filtered by aesthetic score.
Here are two datasets:
https://huggingface.co/datasets/zzliang/GRIT
https://huggingface.co/datasets/kakaobrain/coyo-700m
For the coyo one you can filter on the aesthetic_score_laion_v2 - this makes a difference since a lot of images are pretty poor quality and will influence the model generation.