cagliostrolab
/

animagine-xl-3.1

@@ -11,13 +11,13 @@ tags:
   - stable-diffusion-xl
 base_model: cagliostrolab/animagine-xl-3.0
 widget:
-- text: 1girl, green hair, sweater, looking at viewer, upper body, beanie, outdoors, night, turtleneck, masterpiece, best quality
   parameter:
-    negative_prompt: nsfw, lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, artist name
   example_title: 1girl
-- text: 1boy, male focus, green hair, sweater, looking at viewer, upper body, beanie, outdoors, night, turtleneck, masterpiece, best quality
   parameter:
-    negative_prompt: nsfw, lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, artist name
   example_title: 1boy
 ---
 <style>
@@ -197,7 +197,6 @@ In addition to special tags, we would like to introduce aesthetic tags based on
 ## Anime-focused Dataset Additions
 On Animagine XL 3.0, we mostly added characters from popular gacha games. Based on users’ feedbacks, we are adding plenty of popular anime franchises into our dataset for this model. We will release the full list of the characters that might be generated by this iteration to our HuggingFace soon, be sure to check it out when it’s up!
 ## Model Details
 - **Developed by**: [Cagliostro Research Lab](https://huggingface.co/cagliostrolab)
 - **Model type**: Diffusion-based text-to-image generative model
@@ -217,36 +216,24 @@ Animagine XL 3.1 is accessible through user-friendly platforms such as Gradio an
 To use Animagine XL 3.1, install the required libraries as follows:
 ```bash
-pip install diffusers --upgrade
-pip install transformers accelerate safetensors
 ```
 Example script for generating images with Animagine XL 3.1:
 ```python
 import torch
-from diffusers import (
-    StableDiffusionXLPipeline,
-    EulerAncestralDiscreteScheduler,
-    AutoencoderKL
-)
-# Load VAE component
-vae = AutoencoderKL.from_pretrained(
-    "madebyollin/sdxl-vae-fp16-fix",
-    torch_dtype=torch.float16
-)
-# Configure the pipeline
-pipe = StableDiffusionXLPipeline.from_pretrained(
     "cagliostrolab/animagine-xl-3.1",
-    vae=vae,
     torch_dtype=torch.float16,
     use_safetensors=True,
 )
-pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)
 pipe.to('cuda')
-# Define prompts and generate image
-prompt = "1girl, arima kana, oshi no ko, solo, upper body, v, smile, looking at viewer, outdoors, night"
-negative_prompt = "nsfw, lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, artist name"
 image = pipe(
     prompt,
     negative_prompt=negative_prompt,
@@ -255,13 +242,15 @@ image = pipe(
     guidance_scale=7,
     num_inference_steps=28
 ).images[0]
 ```
 ## Usage Guidelines
 ### Tag Ordering
-Prompting is a bit different in this iteration, for optimal results, it's recommended to follow the structured prompt template because we train the model like this:
 ```
 1girl/1boy, character name, from what series, everything else in any order.
@@ -273,63 +262,67 @@ Like the previous iteration, this model was trained with some special tags to st
 ### Quality Modifiers
-| Quality Modifier | Score Criterion |
-| ---------------- | --------------- |
-| `masterpiece`    | >150            |
-| `best quality`   | 100-150         |
-| `high quality`   | 75-100          |
-| `medium quality` | 25-75           |
-| `normal quality` | 0-25            |
-| `low quality`    | -5-0            |
-| `worst quality`  | <-5             |
 ### Rating Modifiers
-| Rating Modifier               | Rating Criterion          |
-| ------------------------------| ------------------------- |
-| `rating: general`             | General                   |
-| `rating: sensitive`           | Sensitive                 |
-| `rating: questionable`, `nsfw`| Questionable              |
-| `rating: explicit`, `nsfw`    | Explicit                  |
 ### Year Modifier
-These tags help to steer the result toward modern or vintage anime art styles, ranging from `newest` to `oldest`.
 | Year Tag | Year Range       |
-| -------- | ---------------- |
-| `newest` | 2022 to 2023     |
-| `late`   | 2019 to 2021     |
-| `mid`    | 2015 to 2018     |
 | `early`  | 2011 to 2014     |
 | `oldest` | 2005 to 2010     |
 ### Aesthetic Tags
-This tag, combined with quality tag, can be used to guide the model to generate better results.
-| Aesthetic Tags     |
-|--------------------|
-| `very aesthetic`   |
-| `aesthetic`        |
-| `displeasing`      |
-| `very displeasing` |
 ## Recommended settings
 To guide the model towards generating high-aesthetic images, use negative prompts like:
 ```
-nsfw, lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, artist name
 ```
 For higher quality outcomes, prepend prompts with:
 ```
-masterpiece, best quality
 ```
-However, be careful to use `masterpiece`, `best quality` because many high-scored datasets are NSFW. It’s better to add `nsfw`, `rating: sensitive` to the negative prompt and `rating: general` to the positive prompt. it’s recommended to use a lower classifier-free guidance (CFG Scale) of around 5-7, sampling steps below 30, and to use Euler Ancestral (Euler a) as a sampler.
 ### Multi Aspect Resolution
@@ -349,37 +342,38 @@ This model supports generating images at the following dimensions:
 ## Training and Hyperparameters
-- **Animagine XL 3.1** was trained on a 2x A100 GPU with 80GB memory for 31 days or over 500 gpu hours. The training process encompassed three stages:
-  - Base:
-    - **Feature Alignment Stage**: Utilized 1.2m images to acquaint the model with basic anime concepts.
-    - **Refining UNet Stage**: Employed 2.5k curated datasets to only fine-tune the UNet.
-  - Curated:
-    - **Aesthetic Tuning Stage**: Employed 3.5k high-quality curated datasets to refine the model's art style.
 ### Hyperparameters
-| Stage                       | Epochs | UNet Learning Rate | Train Text Encoder | Text Encoder Learning Rate | Batch Size     | Mixed Precision | Noise Offset |
-|-----------------------------|--------|--------------------|--------------------|----------------------------|----------------|-----------------|--------------|
-| **Feature Alignment Stage** | 10     | 7.5e-6             | True               | 3.75e-6                    | 48 x 2         | fp16            | N/A          |
-| **Refining UNet Stage**     | 10     | 2e-6               | False              | N/A                        | 48             | fp16            | 0.0357       |
-| **Aesthetic Tuning Stage**  | 10     | 1e-6               | False              | N/A                        | 48             | fp16            | 0.0357       |
-## Model Comparison
 ### Training Config
-| Configuration Item    | Animagine XL 2.0        | Animagine 3.0           | Animagine 3.1           |
-|-----------------------|-------------------------|-------------------------|-------------------------|
-| **GPU**               | A100 80G                | 2 x A100 80G            | 2 x A100 80G            |
-| **Dataset**           | 170k + 83k images       | 1271990 + 3500 Images   | 1271990 + 3500 Images   |
-| **Shuffle Separator** | N/A                     | True                    | True                    |
-| **Global Epochs**     | 20                      | 20                      | 20                      |
-| **Learning Rate**     | 1e-6                    | 7.5e-6                  | 7.5e-6                  |
-| **Batch Size**        | 32                      | 48 x 2                  | 48 x 2                  |
-| **Train Text Encoder**| True                    | True                    | True                    |
-| **Train Special Tags**| True                    | True                    | True                    |
-| **Image Resolution**  | 1024                    | 1024                    | 1024                    |
-| **Bucket Resolution** | 2048 x 512              | 2048 x 512              | 2048 x 512              |
 Source code and training config are available here: https://github.com/cagliostrolab/sd-scripts/tree/main/notebook

   - stable-diffusion-xl
 base_model: cagliostrolab/animagine-xl-3.0
 widget:
+- text: 1girl, green hair, sweater, looking at viewer, upper body, beanie, outdoors, night, turtleneck, masterpiece, best quality, very aesthetic, absurdes
   parameter:
+    negative_prompt: nsfw, lowres, (bad), text, error, fewer, extra, missing, worst quality, jpeg artifacts, low quality, watermark, unfinished, displeasing, oldest, early, chromatic aberration, signature, extra digits, artistic error, username, scan, [abstract]
   example_title: 1girl
+- text: 1boy, male focus, green hair, sweater, looking at viewer, upper body, beanie, outdoors, night, turtleneck, masterpiece, best quality, very aesthetic, absurdes
   parameter:
+    negative_prompt: nsfw, lowres, (bad), text, error, fewer, extra, missing, worst quality, jpeg artifacts, low quality, watermark, unfinished, displeasing, oldest, early, chromatic aberration, signature, extra digits, artistic error, username, scan, [abstract]
   example_title: 1boy
 ---
 <style>
 ## Anime-focused Dataset Additions
 On Animagine XL 3.0, we mostly added characters from popular gacha games. Based on users’ feedbacks, we are adding plenty of popular anime franchises into our dataset for this model. We will release the full list of the characters that might be generated by this iteration to our HuggingFace soon, be sure to check it out when it’s up!
 ## Model Details
 - **Developed by**: [Cagliostro Research Lab](https://huggingface.co/cagliostrolab)
 - **Model type**: Diffusion-based text-to-image generative model
 To use Animagine XL 3.1, install the required libraries as follows:
 ```bash
+pip install diffusers transformers accelerate safetensors --upgrade
 ```
 Example script for generating images with Animagine XL 3.1:
 ```python
 import torch
+from diffusers import DiffusionPipeline,
+pipe = DiffusionPipeline.from_pretrained(
     "cagliostrolab/animagine-xl-3.1",
     torch_dtype=torch.float16,
     use_safetensors=True,
 )
 pipe.to('cuda')
+prompt = "1girl, souryuu asuka langley, neon genesis evangelion, solo, upper body, v, smile, looking at viewer, outdoors, night"
+negative_prompt = "nsfw, lowres, (bad), text, error, fewer, extra, missing, worst quality, jpeg artifacts, low quality, watermark, unfinished, displeasing, oldest, early, chromatic aberration, signature, extra digits, artistic error, username, scan, [abstract]"
 image = pipe(
     prompt,
     negative_prompt=negative_prompt,
     guidance_scale=7,
     num_inference_steps=28
 ).images[0]
+image.save("./asuka_test.png")
 ```
 ## Usage Guidelines
 ### Tag Ordering
+For optimal results, it's recommended to follow the structured prompt template because we train the model like this:
 ```
 1girl/1boy, character name, from what series, everything else in any order.
 ### Quality Modifiers
+Quality tags now consider both scores and post ratings to ensure a balanced quality distribution. We've refined labels for greater clarity, such as changing 'high quality' to 'great quality'.
+| Quality Modifier | Score Criterion   |
+|------------------|-------------------|
+| `masterpiece`    | > 95%             |
+| `best quality`   | > 85% & ≤ 95%     |
+| `great quality`  | > 75% & ≤ 85%     |
+| `good quality`   | > 50% & ≤ 75%     |
+| `normal quality` | > 25% & ≤ 50%     |
+| `low quality`    | > 10% & ≤ 25%     |
+| `worst quality`  | ≤ 10%             |
 ### Rating Modifiers
+We've also streamlined our rating tags for simplicity and clarity, aiming to establish global rules that can be applied across different models. For example, the tag 'rating: general' is now simply 'general', and 'rating: sensitive' has been condensed to 'sensitive'.
+| Rating Modifier   | Rating Criterion |
+|-------------------|------------------|
+| `general`         | General          |
+| `sensitive`       | Sensitive        |
+| `nsfw`            | Questionable     |
+| `explicit, nsfw`  | Explicit         |
 ### Year Modifier
+We've also redefined the year range to steer results towards specific modern or vintage anime art styles more accurately. This update simplifies the range, focusing on relevance to current and past eras.
 | Year Tag | Year Range       |
+|----------|------------------|
+| `newest` | 2021 to 2024     |
+| `recent` | 2018 to 2020     |
+| `mid`    | 2015 to 2017     |
 | `early`  | 2011 to 2014     |
 | `oldest` | 2005 to 2010     |
 ### Aesthetic Tags
+We've enhanced our tagging system with aesthetic tags to refine content categorization based on visual appeal. These tags—`very aesthetic`, `aesthetic`, `displeasing`, and `very displeasing`—are derived from evaluations made by a specialized ViT (Vision Transformer) image classification model, specifically trained on anime data. For this purpose, we utilized the model [shadowlilac/aesthetic-shadow-v2](https://huggingface.co/shadowlilac/aesthetic-shadow-v2), which assesses the aesthetic value of content before it undergoes training. This ensures that each piece of content is not only relevant and accurate but also visually appealing.
+| Aesthetic Tag     | Score Range       |
+|-------------------|-------------------|
+| `very aesthetic`  | > 0.71            |
+| `aesthetic`       | > 0.45 & < 0.71   |
+| `displeasing`     | > 0.27 & < 0.45   |
+| `very displeasing`| ≤ 0.27            |
 ## Recommended settings
 To guide the model towards generating high-aesthetic images, use negative prompts like:
 ```
+nsfw, lowres, (bad), text, error, fewer, extra, missing, worst quality, jpeg artifacts, low quality, watermark, unfinished, displeasing, oldest, early, chromatic aberration, signature, extra digits, artistic error, username, scan, [abstract]
 ```
 For higher quality outcomes, prepend prompts with:
 ```
+masterpiece, best quality, very aesthetic, absurdres
 ```
+it’s recommended to use a lower classifier-free guidance (CFG Scale) of around 5-7, sampling steps below 30, and to use Euler Ancestral (Euler a) as a sampler.
 ### Multi Aspect Resolution
 ## Training and Hyperparameters
+- **Animagine XL 3.1** was trained on a 2x A100 GPU 80GB for roughly 15 days or over 350 gpu hours (pretraining stage). The training process encompassed three stages:
+  - Continual Pretraining:
+    - **Pretraining Stage**: Utilize data-rich collection of images, this consists of 870k million ordered, tagged images, to increase Animagine XL 3.0 model knowledge.
+  - Finetuning:
+    - **First Stage**: Utilize labeled and curated aesthetic datasets to refine broken U-Net after pretraining
+    - **Second Stage**: Utilize labeled and curated aesthetic datasets to refine the model's art style and fixing bad hands and anatomy
 ### Hyperparameters
+| Stage                 | Epochs | UNet lr | Train Text Encoder | Batch Size | Noise Offset | Optimizer  | LR Scheduler                  | Grad Acc Steps | GPUs |
+|-----------------------|--------|---------|--------------------|------------|--------------|------------|-------------------------------|----------------|------|
+| **Pretraining Stage** | 10     | 1e-5    | True               | 16         | N/A          | AdamW      | Cosine Annealing Warm Restart | 3              | 2    |
+| **First Stage**       | 10     | 2e-6    | False              | 48         | 0.0357       | Adafactor  | Constant with Warmup          | 1              | 1    |
+| **Second Stage**      | 15     | 1e-6    | False              | 48         | 0.0357       | Adafactor  | Constant with Warmup          | 1              | 1    |
+## Model Comparison (Pretraining only)
 ### Training Config
+| Configuration Item              | Animagine XL 3.0                         | Animagine XL 3.1                               |
+|---------------------------------|------------------------------------------|------------------------------------------------|
+| **GPU**                         | 2 x A100 80G                             | 2 x A100 80G                                   |
+| **Dataset**                     | 1,271,990                                | 873,504                                        |
+| **Shuffle Separator**           | True                                     | True                                           |
+| **Num Epochs**                  | 10                                       | 10                                             |
+| **Learning Rate**               | 7.5e-6                                   | 1e-5                                           |
+| **Text Encoder Learning Rate**  | 3.75e-6                                  | 1e-5                                           |
+| **Effective Batch Size**        | 48 x 1 x 2                               | 16 x 3 x 2                                     |
+| **Optimizer**                   | Adafactor                                | AdamW                                          |
+| **Optimizer Args**              | Scale Parameter: False, Relative Step: False, Warmup Init: False | Weight Decay: 0.1, Betas: (0.9, 0.99)   |
+| **LR Scheduler**                | Constant with Warmup                     | Cosine Annealing Warm Restart                  |
+| **LR Scheduler Args**           | Warmup Steps: 100                        | Num Cycles: 10, Min LR: 1e-6, LR Decay: 0.9, First Cycle Steps: 9,099 |
 Source code and training config are available here: https://github.com/cagliostrolab/sd-scripts/tree/main/notebook