|
--- |
|
license: openrail++ |
|
tags: |
|
- stable-diffusion |
|
- text-to-image |
|
pinned: true |
|
--- |
|
|
|
# Model Card for flex-diffusion-2-1 |
|
|
|
<!-- Provide a quick summary of what the model is/does. [Optional] --> |
|
stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned with different aspect ratios. |
|
|
|
## TLDR: |
|
|
|
### There are 2 models in this repo: |
|
- One based on stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned for 6k steps. |
|
- One based on stable-diffusion-2-base (stabilityai/stable-diffusion-2-base) finetuned for 6k steps, on the same dataset. |
|
|
|
For usage, see - [How to Get Started with the Model](#how-to-get-started-with-the-model) |
|
|
|
### It aims to solve the following issues: |
|
1. Generated images looks like they are cropped from a larger image. |
|
|
|
2. Generating non-square images creates weird results, due to the model being trained on square images. |
|
Examples: |
|
|
|
| resolution | model | stable diffusion | flex diffusion | |
|
|:---------------:|:-------:|:----------------------------:|:-----------------------------:| |
|
| 576x1024 (9:16) | v2-1 | ![img](imgs/21-576-1024.png) | ![img](imgs/21f-576-1024.png) | |
|
| 576x1024 (9:16) | v2-base | ![img](imgs/2b-576-1024.png) | ![img](imgs/2bf-576-1024.png) | |
|
| 1024x576 (16:9) | v2-1 | ![img](imgs/21-1024-576.png) | ![img](imgs/21f-1024-576.png) | |
|
| 1024x576 (16:9) | v2-base | ![img](imgs/2b-1024-576.png) | ![img](imgs/2bf-1024-576.png) | |
|
|
|
### Limitations: |
|
1. It's trained on a small dataset, so it's improvements may be limited. |
|
2. For each aspect ratio, it's trained on only a fixed resolution. So it may not be able to generate images of different resolutions. |
|
For 1:1 aspect ratio, it's fine-tuned at 512x512, although flex-diffusion-2-1 was last finetuned at 768x768. |
|
|
|
### Potential improvements: |
|
1. Train on a larger dataset. |
|
2. Train on different resolutions even for the same aspect ratio. |
|
3. Train on specific aspect ratios, instead of a range of aspect ratios. |
|
|
|
|
|
# Table of Contents |
|
|
|
- [Model Card for flex-diffusion-2-1](#model-card-for--model_id-) |
|
- [Table of Contents](#table-of-contents) |
|
- [Table of Contents](#table-of-contents-1) |
|
- [Model Details](#model-details) |
|
- [Model Description](#model-description) |
|
- [Uses](#uses) |
|
- [Direct Use](#direct-use) |
|
- [Downstream Use [Optional]](#downstream-use-optional) |
|
- [Out-of-Scope Use](#out-of-scope-use) |
|
- [Bias, Risks, and Limitations](#bias-risks-and-limitations) |
|
- [Recommendations](#recommendations) |
|
- [Training Details](#training-details) |
|
- [Training Data](#training-data) |
|
- [Training Procedure](#training-procedure) |
|
- [Preprocessing](#preprocessing) |
|
- [Speeds, Sizes, Times](#speeds-sizes-times) |
|
- [Evaluation](#evaluation) |
|
- [Testing Data, Factors & Metrics](#testing-data-factors--metrics) |
|
- [Testing Data](#testing-data) |
|
- [Factors](#factors) |
|
- [Metrics](#metrics) |
|
- [Results](#results) |
|
- [Model Examination](#model-examination) |
|
- [Environmental Impact](#environmental-impact) |
|
- [Technical Specifications [optional]](#technical-specifications-optional) |
|
- [Model Architecture and Objective](#model-architecture-and-objective) |
|
- [Compute Infrastructure](#compute-infrastructure) |
|
- [Hardware](#hardware) |
|
- [Software](#software) |
|
- [Citation](#citation) |
|
- [Glossary [optional]](#glossary-optional) |
|
- [More Information [optional]](#more-information-optional) |
|
- [Model Card Authors [optional]](#model-card-authors-optional) |
|
- [Model Card Contact](#model-card-contact) |
|
- [How to Get Started with the Model](#how-to-get-started-with-the-model) |
|
|
|
|
|
# Model Details |
|
|
|
## Model Description |
|
|
|
<!-- Provide a longer summary of what this model is/does. --> |
|
stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned for dynamic aspect ratios. |
|
|
|
finetuned resolutions: |
|
| | width | height | aspect ratio | |
|
|---:|--------:|---------:|:---------------| |
|
| 0 | 512 | 1024 | 1:2 | |
|
| 1 | 576 | 1024 | 9:16 | |
|
| 2 | 576 | 960 | 3:5 | |
|
| 3 | 640 | 1024 | 5:8 | |
|
| 4 | 512 | 768 | 2:3 | |
|
| 5 | 640 | 896 | 5:7 | |
|
| 6 | 576 | 768 | 3:4 | |
|
| 7 | 512 | 640 | 4:5 | |
|
| 8 | 640 | 768 | 5:6 | |
|
| 9 | 640 | 704 | 10:11 | |
|
| 10 | 512 | 512 | 1:1 | |
|
| 11 | 704 | 640 | 11:10 | |
|
| 12 | 768 | 640 | 6:5 | |
|
| 13 | 640 | 512 | 5:4 | |
|
| 14 | 768 | 576 | 4:3 | |
|
| 15 | 896 | 640 | 7:5 | |
|
| 16 | 768 | 512 | 3:2 | |
|
| 17 | 1024 | 640 | 8:5 | |
|
| 18 | 960 | 576 | 5:3 | |
|
| 19 | 1024 | 576 | 16:9 | |
|
| 20 | 1024 | 512 | 2:1 | |
|
|
|
- **Developed by:** Jonathan Chang |
|
- **Model type:** Diffusion-based text-to-image generation model |
|
- **Language(s)**: English |
|
- **License:** creativeml-openrail-m |
|
- **Parent Model:** https://huggingface.co/stabilityai/stable-diffusion-2-1 |
|
- **Resources for more information:** More information needed |
|
|
|
# Uses |
|
|
|
- see https://huggingface.co/stabilityai/stable-diffusion-2-1 |
|
|
|
|
|
# Training Details |
|
|
|
## Training Data |
|
|
|
- LAION aesthetic dataset, subset of it with 6+ rating |
|
- https://laion.ai/blog/laion-aesthetics/ |
|
- https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus |
|
- I only used a small portion of that, see [Preprocessing](#preprocessing) |
|
|
|
|
|
- most common aspect ratios in the dataset (before preprocessing) |
|
|
|
| | aspect_ratio | counts | |
|
|---:|:---------------|---------:| |
|
| 0 | 1:1 | 154727 | |
|
| 1 | 3:2 | 119615 | |
|
| 2 | 2:3 | 61197 | |
|
| 3 | 4:3 | 52276 | |
|
| 4 | 16:9 | 38862 | |
|
| 5 | 400:267 | 21893 | |
|
| 6 | 3:4 | 16893 | |
|
| 7 | 8:5 | 16258 | |
|
| 8 | 4:5 | 15684 | |
|
| 9 | 6:5 | 12228 | |
|
| 10 | 1000:667 | 12097 | |
|
| 11 | 2:1 | 11006 | |
|
| 12 | 800:533 | 10259 | |
|
| 13 | 5:4 | 9753 | |
|
| 14 | 500:333 | 9700 | |
|
| 15 | 250:167 | 9114 | |
|
| 16 | 5:3 | 8460 | |
|
| 17 | 200:133 | 7832 | |
|
| 18 | 1024:683 | 7176 | |
|
| 19 | 11:10 | 6470 | |
|
|
|
- predefined aspect ratios |
|
|
|
| | width | height | aspect ratio | |
|
|---:|--------:|---------:|:---------------| |
|
| 0 | 512 | 1024 | 1:2 | |
|
| 1 | 576 | 1024 | 9:16 | |
|
| 2 | 576 | 960 | 3:5 | |
|
| 3 | 640 | 1024 | 5:8 | |
|
| 4 | 512 | 768 | 2:3 | |
|
| 5 | 640 | 896 | 5:7 | |
|
| 6 | 576 | 768 | 3:4 | |
|
| 7 | 512 | 640 | 4:5 | |
|
| 8 | 640 | 768 | 5:6 | |
|
| 9 | 640 | 704 | 10:11 | |
|
| 10 | 512 | 512 | 1:1 | |
|
| 11 | 704 | 640 | 11:10 | |
|
| 12 | 768 | 640 | 6:5 | |
|
| 13 | 640 | 512 | 5:4 | |
|
| 14 | 768 | 576 | 4:3 | |
|
| 15 | 896 | 640 | 7:5 | |
|
| 16 | 768 | 512 | 3:2 | |
|
| 17 | 1024 | 640 | 8:5 | |
|
| 18 | 960 | 576 | 5:3 | |
|
| 19 | 1024 | 576 | 16:9 | |
|
| 20 | 1024 | 512 | 2:1 | |
|
|
|
|
|
## Training Procedure |
|
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
|
|
### Preprocessing |
|
|
|
|
|
1. download files with url & caption from https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus |
|
- I only used the first file `train-00000-of-00007-29aec9150af50f9f.parquet` |
|
2. use img2dataset to convert to webdataset |
|
- https://github.com/rom1504/img2dataset |
|
- I put train-00000-of-00007-29aec9150af50f9f.parquet in a folder called `first-file` |
|
- the output folder is `/mnt/aesthetics6plus`, change this to your own folder |
|
|
|
```bash |
|
echo INPUT_FOLDER=first-file |
|
echo OUTPUT_FOLDER=/mnt/aesthetics6plus |
|
img2dataset --url_list $INPUT_FOLDER --input_format "parquet"\ |
|
--url_col "URL" --caption_col "TEXT" --output_format webdataset\ |
|
--output_folder $OUTPUT_FOLDER --processes_count 3 --thread_count 6 --image_size 1024 --resize_only_if_bigger --resize_mode=keep_ratio_largest \ |
|
--save_additional_columns '["WIDTH","HEIGHT","punsafe","similarity"]' --enable_wandb True |
|
``` |
|
|
|
3. The data-loading code will do preprocessing on the fly, so no need to do anything else. But it's not optimized for speed, the GPU utilization fluctuates between 80% and 100%. And it's not written for multi-GPU training, so use it with caution. The code will do the following: |
|
- use webdataset to load the data |
|
- calculate the aspect ratio of each image |
|
- find the closest aspect ratio & it's associated resolution from the predefined resolutions: `argmin(abs(aspect_ratio - predefined_aspect_ratios))`. E.g. if the aspect ratio is 1:3, the closest resolution is 1:2. and it's associated resolution is 512x1024. |
|
- keeping the aspect ratio, resize the image such that it's larger or equal to the associated resolution on each side. E.g. resize to 512x(512*3) = 512x1536 |
|
- random crop the image to the associated resolution. E.g. crop to 512x1024 |
|
- if more than 10% of the image is lost in the cropping, discard this example. |
|
- batch examples by aspect ratio, so all examples in a batch have the same aspect ratio |
|
|
|
|
|
### Speeds, Sizes, Times |
|
|
|
- Dataset size: 100k image-caption pairs, before filtering. |
|
- I didn't wait for the whole dataset to be downloaded, I copied the first 10 tar files and their index files to a new folder called `aesthetics6plus-small`, with 100k image-caption pairs in total. The full dataset is a lot bigger. |
|
|
|
- Hardware: 1 RTX3090 GPUs |
|
|
|
- Optimizer: 8bit Adam |
|
|
|
- Batch size: 32 |
|
- actual batch size: 2 |
|
- gradient_accumulation_steps: 16 |
|
- effective batch size: 32 |
|
|
|
- Learning rate: warmup to 2e-6 for 500 steps and then kept constant |
|
|
|
- Learning rate: 2e-6 |
|
- Training steps: 6k |
|
- Epoch size (approximate): 32 * 6k / 100k = 1.92 (not accounting for the filtering) |
|
- Each example is seen 1.92 times on average. |
|
|
|
- Training time: approximately 1 day |
|
|
|
## Results |
|
|
|
More information needed |
|
|
|
# Model Card Authors |
|
|
|
Jonathan Chang |
|
|
|
|
|
# How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
|
|
```python |
|
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler, UNet2DConditionModel |
|
|
|
def use_DPM_solver(pipe): |
|
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) |
|
return pipe |
|
|
|
pipe = StableDiffusionPipeline.from_pretrained( |
|
"stabilityai/stable-diffusion-2-1", |
|
unet = UNet2DConditionModel.from_pretrained("ttj/flex-diffusion-2-1", subfolder="2-1/unet", torch_dtype=torch.float16), |
|
torch_dtype=torch.float16, |
|
) |
|
# for v2-base, use the following line instead |
|
#pipe = StableDiffusionPipeline.from_pretrained( |
|
# "stabilityai/stable-diffusion-2-base", |
|
# unet = UNet2DConditionModel.from_pretrained("ttj/flex-diffusion-2-1", subfolder="2-base/unet", torch_dtype=torch.float16), |
|
# torch_dtype=torch.float16) |
|
pipe = use_DPM_solver(pipe).to("cuda") |
|
pipe = pipe.to("cuda") |
|
|
|
prompt = "a professional photograph of an astronaut riding a horse" |
|
image = pipe(prompt).images[0] |
|
|
|
image.save("astronaut_rides_horse.png") |
|
``` |