flex-diffusion-2-1 / README.md
Jonathan Chang
Add examples
d84df28 unverified
---
license: openrail++
tags:
- stable-diffusion
- text-to-image
pinned: true
---
# Model Card for flex-diffusion-2-1
<!-- Provide a quick summary of what the model is/does. [Optional] -->
stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned with different aspect ratios.
## TLDR:
### There are 2 models in this repo:
- One based on stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned for 6k steps.
- One based on stable-diffusion-2-base (stabilityai/stable-diffusion-2-base) finetuned for 6k steps, on the same dataset.
For usage, see - [How to Get Started with the Model](#how-to-get-started-with-the-model)
### It aims to solve the following issues:
1. Generated images looks like they are cropped from a larger image.
2. Generating non-square images creates weird results, due to the model being trained on square images.
Examples:
| resolution | model | stable diffusion | flex diffusion |
|:---------------:|:-------:|:----------------------------:|:-----------------------------:|
| 576x1024 (9:16) | v2-1 | ![img](imgs/21-576-1024.png) | ![img](imgs/21f-576-1024.png) |
| 576x1024 (9:16) | v2-base | ![img](imgs/2b-576-1024.png) | ![img](imgs/2bf-576-1024.png) |
| 1024x576 (16:9) | v2-1 | ![img](imgs/21-1024-576.png) | ![img](imgs/21f-1024-576.png) |
| 1024x576 (16:9) | v2-base | ![img](imgs/2b-1024-576.png) | ![img](imgs/2bf-1024-576.png) |
### Limitations:
1. It's trained on a small dataset, so it's improvements may be limited.
2. For each aspect ratio, it's trained on only a fixed resolution. So it may not be able to generate images of different resolutions.
For 1:1 aspect ratio, it's fine-tuned at 512x512, although flex-diffusion-2-1 was last finetuned at 768x768.
### Potential improvements:
1. Train on a larger dataset.
2. Train on different resolutions even for the same aspect ratio.
3. Train on specific aspect ratios, instead of a range of aspect ratios.
# Table of Contents
- [Model Card for flex-diffusion-2-1](#model-card-for--model_id-)
- [Table of Contents](#table-of-contents)
- [Table of Contents](#table-of-contents-1)
- [Model Details](#model-details)
- [Model Description](#model-description)
- [Uses](#uses)
- [Direct Use](#direct-use)
- [Downstream Use [Optional]](#downstream-use-optional)
- [Out-of-Scope Use](#out-of-scope-use)
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
- [Recommendations](#recommendations)
- [Training Details](#training-details)
- [Training Data](#training-data)
- [Training Procedure](#training-procedure)
- [Preprocessing](#preprocessing)
- [Speeds, Sizes, Times](#speeds-sizes-times)
- [Evaluation](#evaluation)
- [Testing Data, Factors & Metrics](#testing-data-factors--metrics)
- [Testing Data](#testing-data)
- [Factors](#factors)
- [Metrics](#metrics)
- [Results](#results)
- [Model Examination](#model-examination)
- [Environmental Impact](#environmental-impact)
- [Technical Specifications [optional]](#technical-specifications-optional)
- [Model Architecture and Objective](#model-architecture-and-objective)
- [Compute Infrastructure](#compute-infrastructure)
- [Hardware](#hardware)
- [Software](#software)
- [Citation](#citation)
- [Glossary [optional]](#glossary-optional)
- [More Information [optional]](#more-information-optional)
- [Model Card Authors [optional]](#model-card-authors-optional)
- [Model Card Contact](#model-card-contact)
- [How to Get Started with the Model](#how-to-get-started-with-the-model)
# Model Details
## Model Description
<!-- Provide a longer summary of what this model is/does. -->
stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned for dynamic aspect ratios.
finetuned resolutions:
| | width | height | aspect ratio |
|---:|--------:|---------:|:---------------|
| 0 | 512 | 1024 | 1:2 |
| 1 | 576 | 1024 | 9:16 |
| 2 | 576 | 960 | 3:5 |
| 3 | 640 | 1024 | 5:8 |
| 4 | 512 | 768 | 2:3 |
| 5 | 640 | 896 | 5:7 |
| 6 | 576 | 768 | 3:4 |
| 7 | 512 | 640 | 4:5 |
| 8 | 640 | 768 | 5:6 |
| 9 | 640 | 704 | 10:11 |
| 10 | 512 | 512 | 1:1 |
| 11 | 704 | 640 | 11:10 |
| 12 | 768 | 640 | 6:5 |
| 13 | 640 | 512 | 5:4 |
| 14 | 768 | 576 | 4:3 |
| 15 | 896 | 640 | 7:5 |
| 16 | 768 | 512 | 3:2 |
| 17 | 1024 | 640 | 8:5 |
| 18 | 960 | 576 | 5:3 |
| 19 | 1024 | 576 | 16:9 |
| 20 | 1024 | 512 | 2:1 |
- **Developed by:** Jonathan Chang
- **Model type:** Diffusion-based text-to-image generation model
- **Language(s)**: English
- **License:** creativeml-openrail-m
- **Parent Model:** https://huggingface.co/stabilityai/stable-diffusion-2-1
- **Resources for more information:** More information needed
# Uses
- see https://huggingface.co/stabilityai/stable-diffusion-2-1
# Training Details
## Training Data
- LAION aesthetic dataset, subset of it with 6+ rating
- https://laion.ai/blog/laion-aesthetics/
- https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus
- I only used a small portion of that, see [Preprocessing](#preprocessing)
- most common aspect ratios in the dataset (before preprocessing)
| | aspect_ratio | counts |
|---:|:---------------|---------:|
| 0 | 1:1 | 154727 |
| 1 | 3:2 | 119615 |
| 2 | 2:3 | 61197 |
| 3 | 4:3 | 52276 |
| 4 | 16:9 | 38862 |
| 5 | 400:267 | 21893 |
| 6 | 3:4 | 16893 |
| 7 | 8:5 | 16258 |
| 8 | 4:5 | 15684 |
| 9 | 6:5 | 12228 |
| 10 | 1000:667 | 12097 |
| 11 | 2:1 | 11006 |
| 12 | 800:533 | 10259 |
| 13 | 5:4 | 9753 |
| 14 | 500:333 | 9700 |
| 15 | 250:167 | 9114 |
| 16 | 5:3 | 8460 |
| 17 | 200:133 | 7832 |
| 18 | 1024:683 | 7176 |
| 19 | 11:10 | 6470 |
- predefined aspect ratios
| | width | height | aspect ratio |
|---:|--------:|---------:|:---------------|
| 0 | 512 | 1024 | 1:2 |
| 1 | 576 | 1024 | 9:16 |
| 2 | 576 | 960 | 3:5 |
| 3 | 640 | 1024 | 5:8 |
| 4 | 512 | 768 | 2:3 |
| 5 | 640 | 896 | 5:7 |
| 6 | 576 | 768 | 3:4 |
| 7 | 512 | 640 | 4:5 |
| 8 | 640 | 768 | 5:6 |
| 9 | 640 | 704 | 10:11 |
| 10 | 512 | 512 | 1:1 |
| 11 | 704 | 640 | 11:10 |
| 12 | 768 | 640 | 6:5 |
| 13 | 640 | 512 | 5:4 |
| 14 | 768 | 576 | 4:3 |
| 15 | 896 | 640 | 7:5 |
| 16 | 768 | 512 | 3:2 |
| 17 | 1024 | 640 | 8:5 |
| 18 | 960 | 576 | 5:3 |
| 19 | 1024 | 576 | 16:9 |
| 20 | 1024 | 512 | 2:1 |
## Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
### Preprocessing
1. download files with url &amp; caption from https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus
- I only used the first file `train-00000-of-00007-29aec9150af50f9f.parquet`
2. use img2dataset to convert to webdataset
- https://github.com/rom1504/img2dataset
- I put train-00000-of-00007-29aec9150af50f9f.parquet in a folder called `first-file`
- the output folder is `/mnt/aesthetics6plus`, change this to your own folder
```bash
echo INPUT_FOLDER=first-file
echo OUTPUT_FOLDER=/mnt/aesthetics6plus
img2dataset --url_list $INPUT_FOLDER --input_format "parquet"\
--url_col "URL" --caption_col "TEXT" --output_format webdataset\
--output_folder $OUTPUT_FOLDER --processes_count 3 --thread_count 6 --image_size 1024 --resize_only_if_bigger --resize_mode=keep_ratio_largest \
--save_additional_columns '["WIDTH","HEIGHT","punsafe","similarity"]' --enable_wandb True
```
3. The data-loading code will do preprocessing on the fly, so no need to do anything else. But it's not optimized for speed, the GPU utilization fluctuates between 80% and 100%. And it's not written for multi-GPU training, so use it with caution. The code will do the following:
- use webdataset to load the data
- calculate the aspect ratio of each image
- find the closest aspect ratio & it's associated resolution from the predefined resolutions: `argmin(abs(aspect_ratio - predefined_aspect_ratios))`. E.g. if the aspect ratio is 1:3, the closest resolution is 1:2. and it's associated resolution is 512x1024.
- keeping the aspect ratio, resize the image such that it's larger or equal to the associated resolution on each side. E.g. resize to 512x(512*3) = 512x1536
- random crop the image to the associated resolution. E.g. crop to 512x1024
- if more than 10% of the image is lost in the cropping, discard this example.
- batch examples by aspect ratio, so all examples in a batch have the same aspect ratio
### Speeds, Sizes, Times
- Dataset size: 100k image-caption pairs, before filtering.
- I didn't wait for the whole dataset to be downloaded, I copied the first 10 tar files and their index files to a new folder called `aesthetics6plus-small`, with 100k image-caption pairs in total. The full dataset is a lot bigger.
- Hardware: 1 RTX3090 GPUs
- Optimizer: 8bit Adam
- Batch size: 32
- actual batch size: 2
- gradient_accumulation_steps: 16
- effective batch size: 32
- Learning rate: warmup to 2e-6 for 500 steps and then kept constant
- Learning rate: 2e-6
- Training steps: 6k
- Epoch size (approximate): 32 * 6k / 100k = 1.92 (not accounting for the filtering)
- Each example is seen 1.92 times on average.
- Training time: approximately 1 day
## Results
More information needed
# Model Card Authors
Jonathan Chang
# How to Get Started with the Model
Use the code below to get started with the model.
```python
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler, UNet2DConditionModel
def use_DPM_solver(pipe):
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
return pipe
pipe = StableDiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-1",
unet = UNet2DConditionModel.from_pretrained("ttj/flex-diffusion-2-1", subfolder="2-1/unet", torch_dtype=torch.float16),
torch_dtype=torch.float16,
)
# for v2-base, use the following line instead
#pipe = StableDiffusionPipeline.from_pretrained(
# "stabilityai/stable-diffusion-2-base",
# unet = UNet2DConditionModel.from_pretrained("ttj/flex-diffusion-2-1", subfolder="2-base/unet", torch_dtype=torch.float16),
# torch_dtype=torch.float16)
pipe = use_DPM_solver(pipe).to("cuda")
pipe = pipe.to("cuda")
prompt = "a professional photograph of an astronaut riding a horse"
image = pipe(prompt).images[0]
image.save("astronaut_rides_horse.png")
```