How I train a LoRA: m3lt style training overview

Community Article Published July 1, 2024

This article will take a step by step approach to outlining the method that I used to train the 'm3lt' lora model. I'll provide the input images (synthetically generated) and rely on automatically generated captions, to show the importance of images and good parameters. To train I've used LoRA Ease by multimodalart

Thank you to multimodalart and the HF team!

Getting a training environment set up that was approachable, free to access for those with local GPUs, and met my ideal settings was a near impossible task. I have to really thank the Huggingface team, in particular multimodalart (apolinario) for creating LoRA Ease and making the adjustments needed so that I could identify the perfect parameters. Open source AI is going to continue to change the world!

General notes

I personally do not think you can quickly learn LoRA training from a tutorial - I think the ideal situation would be to learn it from another person, because it requires a mixed art and pattern recognition understanding. That is why I rarely write "here is how to do X" style workflows for LoRA training.

If this is your first time trying a LoRA - it would be wise to use a few LoRAs first with a model so that when you test your model, you know what good results look like.

If this is your first time finetuning a model - expect that you might need to retrain it a couple times to get what you want. I have trained probably thousands of LoRAs and I still retrain some 2-3 times to get the right output.

Model Info

Notes

  • My goal with artistic LoRAs is to create a new style, and thus I do sometimes blend styles such as in this workflow.
  • It is possible to achieve even more nuance and accuracy in creating a particular style.
  • By leveraging captions using some of the suggestions I've made in past articles, you can be more accurate with your stylization.
  • Different styles require adjustments to parameters, in my experience. I recommend you use this as a jumping off point, and encourage you to make your own discoveries. I plan to do other workflows in the future to exemplify the difference.
  • If you would prefer to run a script locally, you can use this link to set it up.

Assembling Training Data

Possibly the most important step in creation of your model starts with dataset curation. In the case of this model, I chose 54 images. You'll note that there is a range of style elements, with some common themes. This was intentional - I wanted to craft a style that had the ability to have a wide range of color, not overfitting on the more muted tones of the portraits, as well as create a sketchy quality in the linework.

image/png

Typically, I tend to keep my training datasets between 20-30 images. Because this style was intentionally aimed at having a wider array of style concepts present, I pushed the dataset to a larger size.

Captioning

In this case of this model, I wanted to show the strength of a good dataset and parameters, so I left the captions to the autocaption output on LoRA Ease. I could also leverage captions to engineer certain results, which I may show in a future article.

Unique Token

I used the unique token provided by LoRA Ease. However, the particular parameters I'm using here tend not to always need a unique token.

The style certainly comes through without the token here, using the prompt "woman."

image/png

However, there's no doubt that the style shifts and solidifies slightly with the inclusion of "style of TOK" (same seed). We also see more depth of field and detail. This could be manipulated further with more advanced captioning techniques during training.

image/png

Training Parameters

For LoRA Ease I used the following training parameters. I am also testing an alternative "slow" training on a lower learning rate, which is important for nuanced styles such as photography. I'll share that preset in a different article.

image/png

When you use LoRA Ease, feel free to select the Style preset for this workflow.

Advanced Options

  • Optimizer AdamW
  • Use SNR Gamma Yes
  • SNR Gamma Weight 5
  • Mixed Precision bg16
  • UNet LR 0.0005
  • Max Training Steps 3000[^1]
  • LoRA Rank 32
  • Repeats 12
  • Prior Pres Loss Off
  • TI Embedding Training Off[^2]
  • TE % 0.2
  • TE Learning Rate 0.00005
  • Resolution 1024
  • Seed 42 is fine, whatever works best for you.

[^1]: I calculate max training steps by multiplying number of training images by steps per image. Typically, I want between 50-75 steps per image when using an UNet LR of 1e-4 to 5e-4. I increase the steps when I train at a lower overall LR.

[^2]: I find that TI trains the style well but overfits the concepts. I'll explore this in more depth with some concept specific training runs.

Even More Advanced Options

  • gradient_accumulation_steps 1
  • Train Batch Size 4
  • num_train_epochs 1
  • prior_loss_weight 1
  • gradient_checkpointing On
  • adam_beta1 0.9
  • adam_beta2 0.999
  • Scale Learning Rate Off
  • lr_scheduler cosine_with_restarts
  • lr_power 1
  • lr_warmup_steps 0
  • Dataloader num workers 0
  • local_rank -1
  • Use Prodigy Beta 3 No
  • Adam Weight Decay 0.0001
  • Use Adam Weight Decay Text Encoder No
  • Adam Weight Decay Text Encoder 0
  • Adam Epsilon 1e-8
  • Prodigy Use Bias Correction No
  • Prodigy Safeguard Warmup No
  • Max Grad Norm 1
  • enable_xformers_memory_efficient_attention Off

With these settings you should be able to train styles fairly quickly. I tend to recommend, if you find that you are overfitting, to reduce the UNet training steps by 100-200. At a -e4 learning rate, small step changes can make a big impact, as opposed to slower -e5 LRs.

If you find the style is training well, but the concepts are overfitting, reduce the TE percentage to 0.15 or 0.1. In this case you also may need to increase the UNet learning steps by 100-200 steps. If the issue persists, I recommend revisiting your dataset and ensuring that your images aren't repeats, or simply too consistent.

If you are underfitting, try to fix it first by incrementally increasing the overall step count by 100 steps at a time. Typically it is better to overfit on a LoRA training than underfit, because you can balance out by reducing the LoRA strength during inference.

Results

I wanted to share some examples of the final outputs for this model, to show you the range these parameters created. For inference settings, I always use the model that the style was trained on (this is crucial for good results in my opinion) and the following inference parameters:

image/png


a woman, style of TOK

image/png

a robot flying through the sky with smoke billowing behind it

image/png

a woman with blonde-brown hair, white t shirt, black linen shorts

image/png

a woman with blonde-brown hair, white t shirt, black linen shorts, anime style illustration, big round glasses, style of TOK

image/png

a meatball sub, style of TOK

image/png

Final Observations

  • Different prompts relate to the unique token in different ways. This could be solved with a mixed caption style, which would put less importance on prompt syntax and the relationship between the unique prompt and the concepts in each image.
  • The most variety is certainly in the style of women and girl created with the prompts. This is no surprise, given the range in the dataset, and they respond well to different stylized prompts.
  • Concepts not at all present in the original dataset often require the unique prompt. This could be adjusted by making changes to the captioning style.