Disty0's picture
Update README.md
523d3b7 verified
---
pipeline_tag: text-to-image
license: other
license_name: faipl-1.0-sd
license_link: LICENSE
decoder:
- Disty0/sotediffusion-wuerstchen3-alpha1-decoder
---
# SoteDiffusion Wuerstchen3
Anime finetune of Würstchen V3.
Currently is in early state in training.
No commercial use thanks to StabilityAI.
# Release Notes
- Ran LLaVa on the images that has "english text" tag in it.
This adds `The text says "text"` tag.
If LLaVa has no idea what the text is, it describes the image instead.
<style>
.image {
float: left;
margin-left: 10px;
}
</style>
<table>
<img class="image" src="https://cdn-uploads.huggingface.co/production/uploads/6456af6195082f722d178522/pFev-xGFut0o3qlZQJpwb.png" width="320">
<img class="image" src="https://cdn-uploads.huggingface.co/production/uploads/6456af6195082f722d178522/adhnnnmRBkTULFNl9AfT2.png" width="320">
</table>
# UI Guide
## SD.Next
URL: https://github.com/vladmandic/automatic/
Go to Models -> Huggingface and type `Disty0/sotediffusion-wuerstchen3-alpha2-decoder` into the model name and press download.
Load `Disty0/sotediffusion-wuerstchen3-alpha2-decoder` after the download process is complete.
Prompt:
```
very aesthetic, best quality, newest,
```
Negative Prompt:
```
very displeasing, worst quality, oldest, monochrome, sketch, realistic,
```
Parameters:
Sampler: Default
Steps: 30 or 40
Refiner Steps: 10
CFG: 8
Secondary CFG: 1 or 1.2
Resolution: 1024x1536, 2048x1152
Anything works as long as it's a multiply of 128.
## ComfyUI
Please refer to CivitAI: https://civitai.com/models/353284
# Code Example
```shell
pip install diffusers
```
```python
import torch
from diffusers import StableCascadeCombinedPipeline
device = "cuda"
dtype = torch.bfloat16
model = "Disty0/sotediffusion-wuerstchen3-alpha2-decoder"
pipe = StableCascadeCombinedPipeline.from_pretrained(model, torch_dtype=dtype)
# send everything to the gpu:
pipe = pipe.to(device, dtype=dtype)
pipe.prior_pipe = pipe.prior_pipe.to(device, dtype=dtype)
# or enable model offload to save vram:
# pipe.enable_model_cpu_offload()
prompt = "1girl, solo, cowboy shot, straight hair, looking at viewer, hoodie, indoors, slight smile, casual, furniture, doorway, very aesthetic, best quality, newest,"
negative_prompt = "very displeasing, worst quality, oldest, monochrome, sketch, realistic,"
output = pipe(
width=1024,
height=1536,
prompt=prompt,
negative_prompt=negative_prompt,
decoder_guidance_scale=1.0,
prior_guidance_scale=8.0,
prior_num_inference_steps=40,
output_type="pil",
num_inference_steps=10
).images[0]
## do something with the output image
```
## Training Status:
**GPU used for training**: 1x AMD RX 7900 XTX 24GB
**GPU Hours**: 250 (Accumulative starting from alpha1)
| dataset name | training done | remaining |
|---|---|---|
| **newest** | 010 | 221 |
| **recent** | 010 | 162 |
| **mid** | 010 | 114 |
| **early** | 010 | 060 |
| **oldest** | 010 | 010 |
| **pixiv** | 010 | 032 |
| **visual novel cg** | 010 | 018 |
| **anime wallpaper** | 010 | 003 |
| **Total** | 88 | 620 |
**Note**: chunks starts from 0 and there are 8000 images per chunk
## Dataset:
**GPU used for captioning**: 1x Intel ARC A770 16GB
**GPU Hours**: 350
**Model used for captioning**: SmilingWolf/wd-swinv2-tagger-v3
**Command:**
```
python /mnt/DataSSD/AI/Apps/kohya_ss/sd-scripts/finetune/tag_images_by_wd14_tagger.py --model_dir "/mnt/DataSSD/AI/models/wd14_tagger_model" --repo_id "SmilingWolf/wd-swinv2-tagger-v3" --recursive --remove_underscore --use_rating_tags --character_tags_first --character_tag_expand --append_tags --onnx --caption_separator ", " --general_threshold 0.35 --character_threshold 0.50 --batch_size 4 --caption_extension ".txt" ./
```
| dataset name | total images | total chunk |
|---|---|---|
| **newest** | 1.848.331 | 232 |
| **recent** | 1.380.630 | 173 |
| **mid** | 993.227 | 125 |
| **early** | 566.152 | 071 |
| **oldest** | 160.397 | 021 |
| **pixiv** | 343.614 | 043 |
| **visual novel cg** | 231.358 | 029 |
| **anime wallpaper** | 104.790 | 014 |
| **Total** | 5.628.499 | 708 |
**Note**:
- Smallest size is 1280x600 | 768.000 pixels
- Deduped based on image similarity using czkawka-cli
## Tags:
Model is trained with random tag order but this is the order in the dataset if you are interested:
```
aesthetic tags, quality tags, date tags, custom tags, rating tags, character, series, rest of the tags
```
### Date:
| tag | date |
|---|---|
| **newest** | 2022 to 2024 |
| **recent** | 2019 to 2021 |
| **mid** | 2015 to 2018 |
| **early** | 2011 to 2014 |
| **oldest** | 2005 to 2010 |
### Aesthetic Tags:
**Model used**: shadowlilac/aesthetic-shadow-v2
| score greater than | tag | count |
|---|---|---|
| **0.90** | extremely aesthetic | 125.451 |
| **0.80** | very aesthetic | 887.382 |
| **0.70** | aesthetic | 1.049.857 |
| **0.50** | slightly aesthetic | 1.643.091 |
| **0.40** | not displeasing | 569.543 |
| **0.30** | not aesthetic | 445.188 |
| **0.20** | slightly displeasing | 341.424 |
| **0.10** | displeasing | 237.660 |
| **rest of them** | very displeasing | 328.712 |
### Quality Tags:
**Model used**: https://huggingface.co/hakurei/waifu-diffusion-v1-4/blob/main/models/aes-B32-v0.pth
| score greater than | tag | count |
|---|---|---|
| **0.980** | best quality | 1.270.447 |
| **0.900** | high quality | 498.244 |
| **0.750** | great quality | 351.006 |
| **0.500** | medium quality | 366.448 |
| **0.250** | normal quality | 368.380 |
| **0.125** | bad quality | 279.050 |
| **0.025** | low quality | 538.958 |
| **rest of them** | worst quality | 1.955.966 |
## Rating Tags
| tag | count |
|---|---|
| **general** | 1.416.451 |
| **sensitive** | 3.447.664 |
| **nsfw** | 427.459 |
| **explicit nsfw** | 336.925 |
## Custom Tags:
| dataset name | custom tag |
|---|---|
| **image boards** | date, |
| **pixiv** | art by Display_Name, |
| **visual novel cg** | Full_VN_Name (short_3_letter_name), visual novel cg, |
| **anime wallpaper** | date, anime wallpaper, |
## Training Parameters:
**Software used**: Kohya SD-Scripts with Stable Cascade branch
https://github.com/kohya-ss/sd-scripts/tree/stable-cascade
**Base model**: Disty0/sote-diffusion-cascade-alpha0
### Command:
```shell
#!/bin/sh
CURRENT=$1
CURRENT_SUB=$2
PAST=$3
PAST_SUB=$4
LD_PRELOAD=/usr/lib/libtcmalloc.so.4 accelerate launch --mixed_precision fp16 --num_cpu_threads_per_process 1 stable_cascade_train_stage_c.py \
--mixed_precision fp16 \
--save_precision fp16 \
--full_fp16 \
--sdpa \
--gradient_checkpointing \
--train_text_encoder \
--resolution "1024,1024" \
--train_batch_size 2 \
--gradient_accumulation_steps 8 \
--learning_rate 1e-5 \
--learning_rate_te1 1e-5 \
--lr_scheduler constant_with_warmup \
--lr_warmup_steps 100 \
--optimizer_type adafactor \
--optimizer_args "scale_parameter=False" "relative_step=False" "warmup_init=False" \
--max_grad_norm 0 \
--token_warmup_min 1 \
--token_warmup_step 0 \
--shuffle_caption \
--caption_separator ", " \
--caption_dropout_rate 0 \
--caption_tag_dropout_rate 0 \
--caption_dropout_every_n_epochs 0 \
--dataset_repeats 1 \
--save_state \
--save_every_n_steps 256 \
--sample_every_n_steps 64 \
--max_token_length 225 \
--max_train_epochs 1 \
--caption_extension ".txt" \
--max_data_loader_n_workers 2 \
--persistent_data_loader_workers \
--enable_bucket \
--min_bucket_reso 256 \
--max_bucket_reso 4096 \
--bucket_reso_steps 64 \
--bucket_no_upscale \
--log_with tensorboard \
--output_name sotediffusion-wr3_3b \
--train_data_dir /mnt/DataSSD/AI/anime_image_dataset/combined/combined-$CURRENT/$CURRENT_SUB \
--in_json /mnt/DataSSD/AI/anime_image_dataset/combined/combined-$CURRENT/$CURRENT_SUB.json \
--output_dir /mnt/DataSSD/AI/SoteDiffusion/Wuerstchen3/sotediffusion-wr3_3b-$CURRENT/$CURRENT_SUB \
--logging_dir /mnt/DataSSD/AI/SoteDiffusion/Wuerstchen3/sotediffusion-wr3_3b-$CURRENT/$CURRENT_SUB/logs \
--resume /mnt/DataSSD/AI/SoteDiffusion/Wuerstchen3/sotediffusion-wr3_3b-$PAST/$PAST_SUB/sotediffusion-wr3_3b-state \
--stage_c_checkpoint_path /mnt/DataSSD/AI/SoteDiffusion/Wuerstchen3/sotediffusion-wr3_3b-$PAST/$PAST_SUB/sotediffusion-wr3_3b.safetensors \
--text_model_checkpoint_path /mnt/DataSSD/AI/SoteDiffusion/Wuerstchen3/sotediffusion-wr3_3b-$PAST/$PAST_SUB/sotediffusion-wr3_3b_text_model.safetensors \
--effnet_checkpoint_path /mnt/DataSSD/AI/models/wuerstchen3/effnet_encoder.safetensors \
--previewer_checkpoint_path /mnt/DataSSD/AI/models/wuerstchen3/previewer.safetensors \
--sample_prompts /mnt/DataSSD/AI/SoteDiffusion/Wuerstchen3/config/sotediffusion-prompt.txt
```
## Limitations and Bias
### Bias
- This model is intended for anime illustrations.
Realistic capabilites are not tested at all.
### Limitations
- Can fall back to realistic.
Add "realistic" tag to the negatives when this happens.
- Far shot eyes can be bad.
- Anatomy and hands can be bad.
- Still in active training.
## License
SoteDiffusion models falls under [Fair AI Public License 1.0-SD](https://freedevproject.org/faipl-1.0-sd/) license, which is compatible with Stable Diffusion models’ license. Key points:
1. **Modification Sharing:** If you modify SoteDiffusion models, you must share both your changes and the original license.
2. **Source Code Accessibility:** If your modified version is network-accessible, provide a way (like a download link) for others to get the source code. This applies to derived models too.
3. **Distribution Terms:** Any distribution must be under this license or another with similar rules.
4. **Compliance:** Non-compliance must be fixed within 30 days to avoid license termination, emphasizing transparency and adherence to open-source values.
**Notes**: Anything not covered by Fair AI license is inherited from Stability AI Non-Commercial license which is named as LICENSE_INHERIT. Meaning, still no commercial use of any kind.