license: creativeml-openrail-m
tags:
- computer vision
- stable-diffusion
- stable-diffusion-2-1
- photography
- photoreal
Capabilities
This model is capable of producing photorealistic images of people.
It retains much of the base 2.1-v model knowledge, as its text encoder is minimally tuned.
Limitations
This model does not produce perfect results every time.
This model cannot reproduce most real people. Instead, it makes "Derp-a-Like" equivalents to real people, which I prefer.
This model is not great at abstract imagery or digital art, though it certainly can produce a variety of amazing art styles.
Dataset
- cushman (8000 kodachrome slides from 1939 to 1969)
- midjourney v5.1-filtered (about 22,000 upscaled v5.1 images)
- national geographic (about 3-4,000 >1024x768 images of animals, wildlife, landscapes, history)
- a small dataset of stock images of people vaping / smoking
Training parameters
- polynomial learning rate scheduler shared between TE and Unet starting at 4e-8 and decaying to 1e-8
- batch size 15, gradient accumulations 10 => effective BS=150
- target is 30,000 steps but will likely stop sooner
- terminal SNR enforced betas
Training goals
- explore the effects of terminal SNR scheduling
- improve faces, especially "at a distance"
- improve composition, eg. completeness of resulting image
- improve prompt comprehension, eg. "do what i want, even if it is weird"
- retain / introduce a slightly colourful flavour due to the midjourney data
- enhance understanding of the past, through the Cushman collection
- retain the ability to produce natural landscapes and animals via National Geographic
Observations
- at 1650 steps, we still haven't cracked the code on faces.
- at 250 steps, we had amazing photoreal Mars landscapes that have carried forward mostly to 1650 steps
- lighting and composition are at their best
Future work
This model inspired the search for a solution to the proliferation issue that led me to ttj/flex-diffusion-2-1, which led to the creation of ptx0/pseudo-flex-base, another photoreal model with multiple aspect support.
This model was trained purely on 768x768 square images, which were randomly resized and cropped. It can produce some higher resolution landscapes, but it cannot reliably do higher resolution subjects without deformities.