File size: 18,611 Bytes

---
pipeline_tag: text-to-image
license: other
license_name: faipl-1.0-sd
license_link: LICENSE
decoder:
- Disty0/sotediffusion-wuerstchen3-alpha1-decoder
---


# SoteDiffusion Wuerstchen3

Anime finetune of Würstchen V3.  
Currently is in early state in training.  
No commercial use.  

# Release Notes

- Switched to OneTrainer
- Trained more.  
- Currenty trained on 2,6M images.  

<style>
.image {
    float: left;
    margin-left: 10px;
}
</style>

<table>
<img class="image" src="https://cdn-uploads.huggingface.co/production/uploads/6456af6195082f722d178522/lfxdg68qO5jrNg20AMuTf.png">
</table>

# UI Guide

## SD.Next
URL: https://github.com/vladmandic/automatic/

Switch to the dev branch:
```
git checkout dev
```
Go to Models -> Huggingface and type `Disty0/sotediffusion-wuerstchen3-alpha3-decoder` into the model name and press download.  
Load `Disty0/sotediffusion-wuerstchen3-alpha3-decoder` after the download process is complete.  

Prompt:  
```
very aesthetic, best quality, newest,
```

Negative Prompt:  
```
very displeasing, worst quality, oldest, monochrome, sketch, realistic,
```

Parameters:  
Sampler: Default  

Steps: 30 or 40  
Refiner Steps: 10  

CFG: 6-8  
Secondary CFG: 2 or 1  

Resolution: 1024x1536, 2048x1152  
Anything works as long as it's a multiply of 128.


## ComfyUI

Please refer to CivitAI: https://civitai.com/models/353284  


# Code Example

```shell
pip install diffusers
```

```python
import torch
from diffusers import StableCascadeCombinedPipeline

device = "cuda"
dtype = torch.bfloat16 # or torch.float16
model = "Disty0/sotediffusion-wuerstchen3-alpha3-decoder"

pipe = StableCascadeCombinedPipeline.from_pretrained(model, torch_dtype=dtype)

# send everything to the gpu:
pipe = pipe.to(device, dtype=dtype)
pipe.prior_pipe = pipe.prior_pipe.to(device, dtype=dtype)

# or enable model offload to save vram:
# pipe.enable_model_cpu_offload()



prompt = "1girl, solo, cowboy shot, straight hair, looking at viewer, hoodie, indoors, slight smile, casual, furniture, doorway, very aesthetic, best quality, newest,"
negative_prompt = "very displeasing, worst quality, oldest, monochrome, sketch, realistic,"

output = pipe(
    width=1024,
    height=1536,
    prompt=prompt,
    negative_prompt=negative_prompt,
    decoder_guidance_scale=1.0,
    prior_guidance_scale=8.0,
    prior_num_inference_steps=40,
    output_type="pil",
    num_inference_steps=10
).images[0]

## do something with the output image
```


## Training Status:

**GPU used for training**: 1x AMD RX 7900 XTX 24GB  
**GPU Hours**: 500 (Accumulative starting from alpha1)  

| dataset name | training done | remaining |
|---|---|---|
| **newest** | 100 | 131 |
| **recent** | 040 | 132 |
| **mid** | 040 | 084 |
| **early** | 040 | 030 |
| **oldest** | done | done |
| **pixiv** | done | done |
| **visual novel cg** | done | done |
| **anime wallpaper** | done | done |
| **Total** | 333 | 375 |

**Note**: chunks starts from 0 and there are 8000 images per chunk  


## Dataset:

**GPU used for captioning**: 1x Intel ARC A770 16GB  
**GPU Hours**: 350  

**Model used for captioning**: SmilingWolf/wd-swinv2-tagger-v3  
**Command:**  
```
python /mnt/DataSSD/AI/Apps/kohya_ss/sd-scripts/finetune/tag_images_by_wd14_tagger.py --model_dir "/mnt/DataSSD/AI/models/wd14_tagger_model" --repo_id "SmilingWolf/wd-swinv2-tagger-v3" --recursive --remove_underscore --use_rating_tags --character_tags_first --character_tag_expand --append_tags --onnx --caption_separator ", " --general_threshold 0.35 --character_threshold 0.50 --batch_size 4 --caption_extension ".txt" ./
```


| dataset name | total images | total chunk |
|---|---|---|
| **newest** | 1.848.331 | 232 |
| **recent** | 1.380.630 | 173 |
| **mid** | 993.227 | 125 |
| **early** | 566.152 | 071 |
| **oldest** | 160.397 | 021 |
| **pixiv** | 343.614 | 043 |
| **visual novel cg** | 231.358 | 029 |
| **anime wallpaper** | 104.790 | 014 |
| **Total** | 5.628.499 | 708 |

**Note**:  
 - Smallest size is 1280x600 | 768.000 pixels
 - Deduped based on image similarity using czkawka-cli


## Tags:

Model is trained with random tag order but this is the order in the dataset if you are interested:  
```
aesthetic tags, quality tags, date tags, custom tags, rating tags, character, series, rest of the tags
```

### Date:

| tag | date |
|---|---|
| **newest** | 2022 to 2024 |
| **recent** | 2019 to 2021 |
| **mid** | 2015 to 2018 |
| **early** | 2011 to 2014 |
| **oldest** | 2005 to 2010 |

### Aesthetic Tags:
**Model used**: shadowlilac/aesthetic-shadow-v2

| score greater than | tag | count |
|---|---|---|
| **0.90** | extremely aesthetic | 125.451 |
| **0.80** | very aesthetic | 887.382 |
| **0.70** | aesthetic | 1.049.857 |
| **0.50** | slightly aesthetic | 1.643.091 |
| **0.40** | not displeasing | 569.543 |
| **0.30** | not aesthetic | 445.188 |
| **0.20** | slightly displeasing | 341.424 |
| **0.10** | displeasing | 237.660 |
| **rest of them** | very displeasing | 328.712 |

### Quality Tags:
**Model used**: https://huggingface.co/hakurei/waifu-diffusion-v1-4/blob/main/models/aes-B32-v0.pth

| score greater than | tag | count |
|---|---|---|
| **0.980** | best quality | 1.270.447 |
| **0.900** | high quality | 498.244 |
| **0.750** | great quality | 351.006 |
| **0.500** | medium quality | 366.448 |
| **0.250** | normal quality | 368.380 |
| **0.125** | bad quality | 279.050 |
| **0.025** | low quality | 538.958 |
| **rest of them** | worst quality | 1.955.966 |

## Rating Tags

| tag | count |
|---|---|
| **general** | 1.416.451 |
| **sensitive** | 3.447.664 |
| **nsfw** | 427.459 |
| **explicit nsfw** | 336.925 |

## Custom Tags:

| dataset name | custom tag |
|---|---|
| **image boards** | date, |
| **pixiv** | art by Display_Name, |
| **visual novel cg** | Full_VN_Name (short_3_letter_name), visual novel cg, |
| **anime wallpaper** | date, anime wallpaper, |

## Training Parameters:
**Software used**: OneTrainer  
https://github.com/Nerogar/OneTrainer/  

**Base model**: Disty0/sote-diffusion-cascade-alpha2  
### Config:
```
{
    "__version": 3,
    "training_method": "FINE_TUNE",
    "model_type": "STABLE_CASCADE_1",
    "debug_mode": false,
    "debug_dir": "debug",
    "workspace_dir": "/mnt/DataSSD/AI/SoteDiffusion/Wuerstchen3/OneTrainer/run",
    "cache_dir": "/mnt/DataSSD/AI/SoteDiffusion/Wuerstchen3/OneTrainer/workspace-cache/run",
    "tensorboard": true,
    "tensorboard_expose": false,
    "continue_last_backup": true,
    "include_train_config": "NONE",
    "base_model_name": "Disty0/sotediffusion-wuerstchen3-alpha2",
    "weight_dtype": "BFLOAT_16",
    "output_dtype": "BFLOAT_16",
    "output_model_format": "SAFETENSORS",
    "output_model_destination": "/mnt/DataSSD/AI/SoteDiffusion/Wuerstchen3/OneTrainer/workspace-cache/models",
    "gradient_checkpointing": true,
    "force_circular_padding": false,
    "concept_file_name": "training_concepts/concepts.json",
    "concepts": [
        {
            "__version": 1,
            "image": {
                "__version": 0,
                "enable_crop_jitter": false,
                "enable_random_flip": false,
                "enable_fixed_flip": false,
                "enable_random_rotate": false,
                "enable_fixed_rotate": false,
                "random_rotate_max_angle": 0.0,
                "enable_random_brightness": false,
                "enable_fixed_brightness": false,
                "random_brightness_max_strength": 0.0,
                "enable_random_contrast": false,
                "enable_fixed_contrast": false,
                "random_contrast_max_strength": 0.0,
                "enable_random_saturation": false,
                "enable_fixed_saturation": false,
                "random_saturation_max_strength": 0.0,
                "enable_random_hue": false,
                "enable_fixed_hue": false,
                "random_hue_max_strength": 0.0,
                "enable_resolution_override": false,
                "resolution_override": "1024"
            },
            "text": {
                "__version": 0,
                "prompt_source": "sample",
                "prompt_path": "",
                "enable_tag_shuffling": true,
                "tag_delimiter": ", ",
                "keep_tags_count": 1
            },
            "name": "",
            "path": "/mnt/DataSSD/AI/anime_image_dataset/best/newest_best",
            "seed": -209204630,
            "enabled": true,
            "include_subdirectories": true,
            "image_variations": 1,
            "text_variations": 1,
            "balancing": 1.0,
            "balancing_strategy": "REPEATS",
            "loss_weight": 1.0
        }
    ],
    "circular_mask_generation": false,
    "random_rotate_and_crop": false,
    "aspect_ratio_bucketing": true,
    "latent_caching": true,
    "clear_cache_before_training": false,
    "learning_rate_scheduler": "CONSTANT",
    "learning_rate": 1e-05,
    "learning_rate_warmup_steps": 200,
    "learning_rate_cycles": 1,
    "epochs": 1,
    "batch_size": 16,
    "gradient_accumulation_steps": 1,
    "ema": "OFF",
    "ema_decay": 0.999,
    "ema_update_step_interval": 5,
    "dataloader_threads": 8,
    "train_device": "cuda",
    "temp_device": "cpu",
    "train_dtype": "FLOAT_16",
    "fallback_train_dtype": "BFLOAT_16",
    "enable_autocast_cache": true,
    "only_cache": false,
    "resolution": "1024",
    "attention_mechanism": "SDP",
    "align_prop": false,
    "align_prop_probability": 0.1,
    "align_prop_loss": "AESTHETIC",
    "align_prop_weight": 0.01,
    "align_prop_steps": 20,
    "align_prop_truncate_steps": 0.5,
    "align_prop_cfg_scale": 7.0,
    "mse_strength": 1.0,
    "mae_strength": 0.0,
    "vb_loss_strength": 1.0,
    "loss_weight_fn": "P2",
    "loss_weight_strength": 1.0,
    "dropout_probability": 0.0,
    "loss_scaler": "NONE",
    "learning_rate_scaler": "NONE",
    "offset_noise_weight": 0.0,
    "perturbation_noise_weight": 0.0,
    "rescale_noise_scheduler_to_zero_terminal_snr": false,
    "force_v_prediction": false,
    "force_epsilon_prediction": false,
    "min_noising_strength": 0.0,
    "max_noising_strength": 1.0,
    "noising_weight": 0.0,
    "noising_bias": 0.5,
    "unet": {
        "__version": 0,
        "model_name": "",
        "train": true,
        "stop_training_after": 0,
        "stop_training_after_unit": "NEVER",
        "learning_rate": null,
        "weight_dtype": "NONE"
    },
    "prior": {
        "__version": 0,
        "model_name": "",
        "train": true,
        "stop_training_after": 0,
        "stop_training_after_unit": "NEVER",
        "learning_rate": null,
        "weight_dtype": "NONE"
    },
    "text_encoder": {
        "__version": 0,
        "model_name": "",
        "train": true,
        "stop_training_after": 0,
        "stop_training_after_unit": "NEVER",
        "learning_rate": null,
        "weight_dtype": "NONE"
    },
    "text_encoder_layer_skip": 0,
    "text_encoder_2": {
        "__version": 0,
        "model_name": "",
        "train": true,
        "stop_training_after": 30,
        "stop_training_after_unit": "EPOCH",
        "learning_rate": null,
        "weight_dtype": "NONE"
    },
    "text_encoder_2_layer_skip": 0,
    "vae": {
        "__version": 0,
        "model_name": "",
        "train": true,
        "stop_training_after": null,
        "stop_training_after_unit": "NEVER",
        "learning_rate": null,
        "weight_dtype": "FLOAT_32"
    },
    "effnet_encoder": {
        "__version": 0,
        "model_name": "/mnt/DataSSD/AI/models/wuerstchen3/effnet_encoder.safetensors",
        "train": true,
        "stop_training_after": null,
        "stop_training_after_unit": "NEVER",
        "learning_rate": null,
        "weight_dtype": "FLOAT_16"
    },
    "decoder": {
        "__version": 0,
        "model_name": "Disty0/sotediffusion-wuerstchen3-alpha2-decoder",
        "train": true,
        "stop_training_after": null,
        "stop_training_after_unit": "NEVER",
        "learning_rate": null,
        "weight_dtype": "FLOAT_16"
    },
    "decoder_text_encoder": {
        "__version": 0,
        "model_name": "",
        "train": true,
        "stop_training_after": null,
        "stop_training_after_unit": "NEVER",
        "learning_rate": null,
        "weight_dtype": "NONE"
    },
    "decoder_vqgan": {
        "__version": 0,
        "model_name": "",
        "train": true,
        "stop_training_after": null,
        "stop_training_after_unit": "NEVER",
        "learning_rate": null,
        "weight_dtype": "FLOAT_16"
    },
    "masked_training": false,
    "unmasked_probability": 0.1,
    "unmasked_weight": 0.1,
    "normalize_masked_area_loss": false,
    "embedding_learning_rate": null,
    "preserve_embedding_norm": false,
    "embedding": {
        "__version": 0,
        "uuid": "bf3a36b4-bd01-4b46-b818-3c6414887497",
        "model_name": "",
        "placeholder": "<embedding>",
        "train": true,
        "stop_training_after": null,
        "stop_training_after_unit": "NEVER",
        "token_count": 1,
        "initial_embedding_text": "*"
    },
    "additional_embeddings": [],
    "embedding_weight_dtype": "FLOAT_32",
    "lora_model_name": "",
    "lora_rank": 16,
    "lora_alpha": 1.0,
    "lora_weight_dtype": "FLOAT_32",
    "optimizer": {
        "__version": 0,
        "optimizer": "ADAFACTOR",
        "adam_w_mode": false,
        "alpha": null,
        "amsgrad": false,
        "beta1": null,
        "beta2": null,
        "beta3": null,
        "bias_correction": false,
        "block_wise": false,
        "capturable": false,
        "centered": false,
        "clip_threshold": 1.0,
        "d0": null,
        "d_coef": null,
        "dampening": null,
        "decay_rate": -0.8,
        "decouple": false,
        "differentiable": false,
        "eps": 1e-30,
        "eps2": 0.001,
        "foreach": false,
        "fsdp_in_use": false,
        "fused": false,
        "fused_back_pass": true,
        "growth_rate": null,
        "initial_accumulator_value": null,
        "is_paged": false,
        "log_every": null,
        "lr_decay": null,
        "max_unorm": null,
        "maximize": false,
        "min_8bit_size": null,
        "momentum": null,
        "nesterov": false,
        "no_prox": false,
        "optim_bits": null,
        "percentile_clipping": null,
        "r": null,
        "relative_step": false,
        "safeguard_warmup": false,
        "scale_parameter": false,
        "stochastic_rounding": true,
        "use_bias_correction": false,
        "use_triton": false,
        "warmup_init": false,
        "weight_decay": 0.0,
        "weight_lr_power": null
    },
    "optimizer_defaults": {
        "ADAFACTOR": {
            "__version": 0,
            "optimizer": "ADAFACTOR",
            "adam_w_mode": false,
            "alpha": null,
            "amsgrad": false,
            "beta1": null,
            "beta2": null,
            "beta3": null,
            "bias_correction": false,
            "block_wise": false,
            "capturable": false,
            "centered": false,
            "clip_threshold": 1.0,
            "d0": null,
            "d_coef": null,
            "dampening": null,
            "decay_rate": -0.8,
            "decouple": false,
            "differentiable": false,
            "eps": 1e-30,
            "eps2": 0.001,
            "foreach": false,
            "fsdp_in_use": false,
            "fused": false,
            "fused_back_pass": true,
            "growth_rate": null,
            "initial_accumulator_value": null,
            "is_paged": false,
            "log_every": null,
            "lr_decay": null,
            "max_unorm": null,
            "maximize": false,
            "min_8bit_size": null,
            "momentum": null,
            "nesterov": false,
            "no_prox": false,
            "optim_bits": null,
            "percentile_clipping": null,
            "r": null,
            "relative_step": false,
            "safeguard_warmup": false,
            "scale_parameter": false,
            "stochastic_rounding": true,
            "use_bias_correction": false,
            "use_triton": false,
            "warmup_init": false,
            "weight_decay": 0.0,
            "weight_lr_power": null
        }
    },
    "sample_definition_file_name": "training_samples/samples.json",
    "samples": [
        {
            "__version": 0,
            "enabled": true,
            "prompt": "very aesthetic, best quality, newest, sensitive, 1girl, solo, upper body,",
            "negative_prompt": "monochrome, sketch, fat,",
            "height": 1024,
            "width": 1024,
            "seed": 42,
            "random_seed": true,
            "diffusion_steps": 30,
            "cfg_scale": 7.0,
            "noise_scheduler": "EULER_A"
        }
    ],
    "sample_after": 30,
    "sample_after_unit": "MINUTE",
    "sample_image_format": "JPG",
    "samples_to_tensorboard": true,
    "non_ema_sampling": true,
    "backup_after": 30,
    "backup_after_unit": "MINUTE",
    "rolling_backup": true,
    "rolling_backup_count": 10,
    "backup_before_save": true,
    "save_after": 30,
    "save_after_unit": "MINUTE",
    "save_filename_prefix": ""
}
```


## Limitations and Bias

### Bias

- This model is intended for anime illustrations.  
  Realistic capabilites are not tested at all.  

### Limitations

- Can fall back to realistic.  
  Add "realistic" tag to the negatives when this happens.  
- Far shot eyes can be bad.  
- Anatomy and hands can be bad.  
- Still in active training.  


## License

SoteDiffusion models falls under [Fair AI Public License 1.0-SD](https://freedevproject.org/faipl-1.0-sd/) license, which is compatible with Stable Diffusion models’ license. Key points:

1. **Modification Sharing:** If you modify SoteDiffusion models, you must share both your changes and the original license.
2. **Source Code Accessibility:** If your modified version is network-accessible, provide a way (like a download link) for others to get the source code. This applies to derived models too.
3. **Distribution Terms:** Any distribution must be under this license or another with similar rules.
4. **Compliance:** Non-compliance must be fixed within 30 days to avoid license termination, emphasizing transparency and adherence to open-source values.

**Notes**: Anything not covered by Fair AI license is inherited from Stability AI Non-Commercial license which is named as LICENSE_INHERIT. Meaning, still no commercial use of any kind.