PseudoTerminal X
commited on
Commit
•
b8fe228
1
Parent(s):
8847ac0
Add a model card
Browse files
README.md
CHANGED
@@ -1,3 +1,59 @@
|
|
1 |
---
|
2 |
license: creativeml-openrail-m
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: creativeml-openrail-m
|
3 |
+
tags:
|
4 |
+
- computer vision
|
5 |
+
- stable-diffusion
|
6 |
+
- stable-diffusion-2-1
|
7 |
+
- photography
|
8 |
+
- photoreal
|
9 |
---
|
10 |
+
|
11 |
+
# Capabilities
|
12 |
+
|
13 |
+
This model is capable of producing photorealistic images of people.
|
14 |
+
|
15 |
+
It retains much of the base 2.1-v model knowledge, as its text encoder is minimally tuned.
|
16 |
+
|
17 |
+
# Limitations
|
18 |
+
|
19 |
+
This model does not produce perfect results every time.
|
20 |
+
|
21 |
+
This model cannot reproduce most real people. Instead, it makes "Derp-a-Like" equivalents to real people, which I prefer.
|
22 |
+
|
23 |
+
This model is not great at abstract imagery or digital art, though it certainly can produce a variety of amazing art styles.
|
24 |
+
|
25 |
+
# Dataset
|
26 |
+
|
27 |
+
* cushman (8000 kodachrome slides from 1939 to 1969)
|
28 |
+
* midjourney v5.1-filtered (about 22,000 upscaled v5.1 images)
|
29 |
+
* national geographic (about 3-4,000 >1024x768 images of animals, wildlife, landscapes, history)
|
30 |
+
* a small dataset of stock images of people vaping / smoking
|
31 |
+
|
32 |
+
# Training parameters
|
33 |
+
|
34 |
+
* polynomial learning rate scheduler shared between TE and Unet starting at 4e-8 and decaying to 1e-8
|
35 |
+
* batch size 15, gradient accumulations 10 => effective BS=150
|
36 |
+
* target is 30,000 steps but will likely stop sooner
|
37 |
+
* terminal SNR enforced betas
|
38 |
+
|
39 |
+
# Training goals
|
40 |
+
|
41 |
+
* explore the effects of terminal SNR scheduling
|
42 |
+
* improve faces, especially "at a distance"
|
43 |
+
* improve composition, eg. completeness of resulting image
|
44 |
+
* improve prompt comprehension, eg. "do what i want, even if it is weird"
|
45 |
+
* retain / introduce a slightly colourful flavour due to the midjourney data
|
46 |
+
* enhance understanding of the past, through the Cushman collection
|
47 |
+
* retain the ability to produce natural landscapes and animals via National Geographic
|
48 |
+
|
49 |
+
# Observations
|
50 |
+
|
51 |
+
* at 1650 steps, we still haven't cracked the code on faces.
|
52 |
+
* at 250 steps, we had amazing photoreal Mars landscapes that have carried forward mostly to 1650 steps
|
53 |
+
* lighting and composition are at their best
|
54 |
+
|
55 |
+
# Future work
|
56 |
+
|
57 |
+
This model inspired the search for a solution to the proliferation issue that led me to ttj/flex-diffusion-2-1, which led to the creation of ptx0/pseudo-flex-base, another photoreal model with multiple aspect support.
|
58 |
+
|
59 |
+
This model was trained **purely** on 768x768 square images, which were randomly resized and cropped. It can produce some higher resolution landscapes, but it cannot reliably do higher resolution subjects without deformities.
|