metadata

license: creativeml-openrail-m
tags:
  - computer vision
  - stable-diffusion
  - stable-diffusion-2-1
  - photography
  - photoreal

Capabilities

This model is capable of producing photorealistic images of people.

It retains much of the base 2.1-v model knowledge, as its text encoder is minimally tuned.

Limitations

This model does not produce perfect results every time.

This model cannot reproduce most real people. Instead, it makes "Derp-a-Like" equivalents to real people, which I prefer.

This model is not great at abstract imagery or digital art, though it certainly can produce a variety of amazing art styles.

Dataset

cushman (8000 kodachrome slides from 1939 to 1969)
midjourney v5.1-filtered (about 22,000 upscaled v5.1 images)
national geographic (about 3-4,000 >1024x768 images of animals, wildlife, landscapes, history)
a small dataset of stock images of people vaping / smoking

Training parameters

polynomial learning rate scheduler shared between TE and Unet starting at 4e-8 and decaying to 1e-8
batch size 15, gradient accumulations 10 => effective BS=150
target is 30,000 steps but will likely stop sooner
terminal SNR enforced betas

Training goals

explore the effects of terminal SNR scheduling
improve faces, especially "at a distance"
improve composition, eg. completeness of resulting image
improve prompt comprehension, eg. "do what i want, even if it is weird"
retain / introduce a slightly colourful flavour due to the midjourney data
enhance understanding of the past, through the Cushman collection
retain the ability to produce natural landscapes and animals via National Geographic

Observations

at 1650 steps, we still haven't cracked the code on faces.
at 250 steps, we had amazing photoreal Mars landscapes that have carried forward mostly to 1650 steps
lighting and composition are at their best

Future work

This model inspired the search for a solution to the proliferation issue that led me to ttj/flex-diffusion-2-1, which led to the creation of ptx0/pseudo-flex-base, another photoreal model with multiple aspect support.

This model was trained purely on 768x768 square images, which were randomly resized and cropped. It can produce some higher resolution landscapes, but it cannot reliably do higher resolution subjects without deformities.