happyme531 commited on
Commit
1df274b
1 Parent(s): 0d329b2

Upload 38 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ model/text_encoder/model.rknn filter=lfs diff=lfs merge=lfs -text
37
+ model/unet/model.onnx_data filter=lfs diff=lfs merge=lfs -text
38
+ model/unet/model.rknn filter=lfs diff=lfs merge=lfs -text
39
+ model/vae_decoder/model.rknn filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,122 @@
1
- ---
2
- license: agpl-3.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Stable Diffusion 1.5 Latent Consistency Model for RKNN2
2
+
3
+ ## (English README see below)
4
+
5
+ 使用RKNPU2运行Stable Diffusion 1.5 LCM 图像生成模型!!
6
+
7
+ - 推理速度(RK3588): 单NPU核, 384x384分辨率, 4次迭代, 生成1张图片平均耗时约13.8秒
8
+ - 内存占用: 约5.2GB
9
+
10
+
11
+ ## 使用方法
12
+
13
+ ### 1. 克隆或者下载此仓库到本地
14
+
15
+ ### 2. 安装依赖
16
+
17
+ ```bash
18
+ pip install diffusers pillow numpy<2
19
+ ```
20
+
21
+ 当然你还要安装rknn-toolkit2-lite2。
22
+
23
+ ### 3. 运行
24
+
25
+ ```bash
26
+ python ./run_rknn-lcm.py -i ./model -o ./images --num-inference-steps 4 -s 384x384 --prompt "Majestic mountain landscape with snow-capped peaks, autumn foliage in vibrant reds and oranges, a turquoise river winding through a valley, crisp and serene atmosphere, ultra-realistic style."
27
+ ```
28
+
29
+ ## 模型转换
30
+
31
+ ### 1. 下载模型
32
+
33
+ 下载一个onnx格式的Stable Diffusion 1.5 LCM模型,并放到`./model`目录下。
34
+
35
+ ```bash
36
+ huggingface-cli download TheyCallMeHex/LCM-Dreamshaper-V7-ONNX
37
+ cp -r -L ~/.cache/huggingface/hub/models--TheyCallMeHex--LCM-Dreamshaper-V7-ONNX/snapshots/4029a217f9cdc0437f395738d3ab686bb910ceea ./model
38
+ ```
39
+
40
+ 理论上你也可以通过将LCM Lora合并到普通的Stable Diffusion 1.5模型,然后转换为onnx格式,来实现LCM的推理。但是我这边也不知道怎么做,有知道的小伙伴可以提个PR。
41
+
42
+ ### 2. 转换模型
43
+
44
+ ```bash
45
+ # 转换模型, 384x384分辨率
46
+ python ./convert-onnx-to-rknn.py -m ./model -r 384x384
47
+ ```
48
+
49
+ 注意分辨率越高,模型越大,转换时间越长。不建议使用太大的分辨率。
50
+
51
+ ## 已知问题
52
+
53
+ 1. 截至目前,使用最新版本的rknn-toolkit2 2.2.0版本转换的模型仍然存在极其严重的精度损失!即使使用的是fp16数据类型。如图,上方是使用onnx模型推理的结果,下方是使用rknn模型推理的结果。所有参数均一致。并且分辨率越高,精度损失越严重。这是rknn-toolkit2的bug。
54
+
55
+ 2. 其实模型转换脚本可以选择多个分辨率(例如"384x384,256x256"), 但这会导致模型转换失败。这是rknn-toolkit2的bug。
56
+
57
+ ## 参考
58
+
59
+ - [TheyCallMeHex/LCM-Dreamshaper-V7-ONNX](https://huggingface.co/TheyCallMeHex/LCM-Dreamshaper-V7-ONNX)
60
+ - [Optimum's LatentConsistencyPipeline](https://github.com/huggingface/optimum/blob/main/optimum/pipelines/diffusers/pipeline_latent_consistency.py)
61
+ - [happyme531/RK3588-stable-diffusion-GPU](https://github.com/happyme531/RK3588-stable-diffusion-GPU)
62
+
63
+ ## English README
64
+
65
+ # Stable Diffusion 1.5 Latent Consistency Model for RKNN2
66
+
67
+ Run the Stable Diffusion 1.5 LCM image generation model using RKNPU2!
68
+
69
+ - Inference speed (RK3588): Single NPU core, 384x384 resolution, 4 iterations, average time to generate 1 image is about 13.8 seconds
70
+ - Memory usage: About 5.2GB
71
+
72
+ ## Usage
73
+
74
+ ### 1. Clone or download this repository to your local machine
75
+
76
+ ### 2. Install dependencies
77
+
78
+ ```bash
79
+ pip install diffusers pillow numpy<2
80
+ ```
81
+
82
+ Of course, you also need to install rknn-toolkit2-lite2.
83
+
84
+ ### 3. Run
85
+
86
+ ```bash
87
+ python ./run_rknn-lcm.py -i ./model -o ./images --num-inference-steps 4 -s 384x384 --prompt "Majestic mountain landscape with snow-capped peaks, autumn foliage in vibrant reds and oranges, a turquoise river winding through a valley, crisp and serene atmosphere, ultra-realistic style."
88
+ ```
89
+
90
+ ## Model Conversion
91
+
92
+ ### 1. Download the model
93
+
94
+ Download a Stable Diffusion 1.5 LCM model in ONNX format and place it in the `./model` directory.
95
+
96
+ ```bash
97
+ huggingface-cli download TheyCallMeHex/LCM-Dreamshaper-V7-ONNX
98
+ cp -r -L ~/.cache/huggingface/hub/models--TheyCallMeHex--LCM-Dreamshaper-V7-ONNX/snapshots/4029a217f9cdc0437f395738d3ab686bb910ceea ./model
99
+ ```
100
+
101
+ In theory, you could also achieve LCM inference by merging the LCM Lora into a regular Stable Diffusion 1.5 model and then converting it to ONNX format. However, I'm not sure how to do this. If anyone knows, please feel free to submit a PR.
102
+
103
+ ### 2. Convert the model
104
+
105
+ ```bash
106
+ # Convert the model, 384x384 resolution
107
+ python ./convert-onnx-to-rknn.py -m ./model -r 384x384
108
+ ```
109
+
110
+ Note that the higher the resolution, the larger the model and the longer the conversion time. It's not recommended to use very high resolutions.
111
+
112
+ ## Known Issues
113
+
114
+ 1. As of now, models converted using the latest version of rknn-toolkit2 (version 2.2.0) still suffer from severe precision loss, even when using fp16 data type. As shown in the image, the top is the result of inference using the ONNX model, and the bottom is the result using the RKNN model. All parameters are the same. Moreover, the higher the resolution, the more severe the precision loss. This is a bug in rknn-toolkit2.
115
+
116
+ 2. Actually, the model conversion script can select multiple resolutions (e.g., "384x384,256x256"), but this causes the model conversion to fail. This is a bug in rknn-toolkit2.
117
+
118
+ ## References
119
+
120
+ - [TheyCallMeHex/LCM-Dreamshaper-V7-ONNX](https://huggingface.co/TheyCallMeHex/LCM-Dreamshaper-V7-ONNX)
121
+ - [Optimum's LatentConsistencyPipeline](https://github.com/huggingface/optimum/blob/main/optimum/pipelines/diffusers/pipeline_latent_consistency.py)
122
+ - [happyme531/RK3588-stable-diffusion-GPU](https://github.com/happyme531/RK3588-stable-diffusion-GPU)
model/.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.onnx_data filter=lfs diff=lfs merge=lfs -text
model/Assets/Icon.png ADDED
model/Assets/LCM-Dreamshaper-V7-ONNX.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "Name": "Dreamshaper v7(LCM)",
3
+ "Description": "DreamShaper started as a model to have an alternative to MidJourney in the open source world.",
4
+ "Author": "TheyCallMeHex",
5
+ "Repository": "https://huggingface.co/TheyCallMeHex/LCM-Dreamshaper-V7-ONNX",
6
+ "ImageIcon": "https://raw.githubusercontent.com/saddam213/OnnxStack/master/Assets/Templates/LCM-Dreamshaper-V7/Icon.png",
7
+ "Status": "Active",
8
+ "PadTokenId": 49407,
9
+ "BlankTokenId": 49407,
10
+ "TokenizerLimit": 77,
11
+ "EmbeddingsLength": 768,
12
+ "ScaleFactor": 0.18215,
13
+ "PipelineType": "LatentConsistency",
14
+ "Diffusers": [
15
+ "TextToImage",
16
+ "ImageToImage",
17
+ "ImageInpaintLegacy"
18
+ ],
19
+ "ModelFiles": [
20
+ "https://huggingface.co/TheyCallMeHex/LCM-Dreamshaper-V7-ONNX/resolve/main/tokenizer/model.onnx",
21
+ "https://huggingface.co/TheyCallMeHex/LCM-Dreamshaper-V7-ONNX/resolve/main/unet/model.onnx",
22
+ "https://huggingface.co/TheyCallMeHex/LCM-Dreamshaper-V7-ONNX/resolve/main/unet/model.onnx_data",
23
+ "https://huggingface.co/TheyCallMeHex/LCM-Dreamshaper-V7-ONNX/resolve/main/text_encoder/model.onnx",
24
+ "https://huggingface.co/TheyCallMeHex/LCM-Dreamshaper-V7-ONNX/resolve/main/vae_decoder/model.onnx",
25
+ "https://huggingface.co/TheyCallMeHex/LCM-Dreamshaper-V7-ONNX/resolve/main/vae_encoder/model.onnx"
26
+ ],
27
+ "Images": [
28
+ "https://raw.githubusercontent.com/saddam213/OnnxStack/master/Assets/Templates/LCM-Dreamshaper-V7/Preview1.png",
29
+ "https://raw.githubusercontent.com/saddam213/OnnxStack/master/Assets/Templates/LCM-Dreamshaper-V7/Preview2.png",
30
+ "https://raw.githubusercontent.com/saddam213/OnnxStack/master/Assets/Templates/LCM-Dreamshaper-V7/Preview3.png",
31
+ "https://raw.githubusercontent.com/saddam213/OnnxStack/master/Assets/Templates/LCM-Dreamshaper-V7/Preview4.png",
32
+ "https://raw.githubusercontent.com/saddam213/OnnxStack/master/Assets/Templates/LCM-Dreamshaper-V7/Preview5.png",
33
+ "https://raw.githubusercontent.com/saddam213/OnnxStack/master/Assets/Templates/LCM-Dreamshaper-V7/Preview6.png"
34
+ ]
35
+ }
model/Assets/OnnxStack - 640x320.png ADDED
model/Assets/Preview1.png ADDED
model/Assets/Preview2.png ADDED
model/Assets/Preview3.png ADDED
model/Assets/Preview4.png ADDED
model/Assets/Preview5.png ADDED
model/Assets/Preview6.png ADDED
model/Assets/lcm_angel_30_7.5_2092464983.png ADDED
model/Assets/lcm_car_30_7.5_2092464983.png ADDED
model/Assets/lcm_demon_30_7.5_2092464983.png ADDED
model/Assets/lcm_ninja_30_7.5_2092464983.png ADDED
model/README.md ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+
3
+ language:
4
+ - en
5
+ license: mit
6
+ tags:
7
+ - stable-diffusion
8
+ - stable-diffusion-diffusers
9
+ - text-to-image
10
+ - diffusers
11
+ inference: true
12
+ ---
13
+
14
+ <p align="center" width="100%">
15
+ <img width="80%" src="Assets/OnnxStack - 640x320.png">
16
+ </p>
17
+
18
+ ### OnnxStack
19
+ This model has been converted to ONNX and tested with OnnxStack
20
+
21
+ - [OnnxStack](https://github.com/saddam213/OnnxStack)
22
+
23
+ ### LCM Dreamshaper V7 Diffusion
24
+ This model was converted to ONNX from LCM Dreamshaper V7
25
+
26
+ - [LCM-Dreamshaper-V7](https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7)
27
+
28
+ ### Sample Images
29
+ *A demon*
30
+
31
+ <img src="Assets/lcm_demon_30_7.5_2092464983.png" width="256" alt="Image of browser inferencing on sample images."/>
32
+
33
+ Seed: 207582124 GuidanceScale: 7.5 NumInferenceSteps: 30
34
+
35
+ __________________________
36
+ *An angel*
37
+
38
+ <img src="Assets/lcm_angel_30_7.5_2092464983.png" width="256" alt="Image of browser inferencing on sample images."/>
39
+
40
+ Seed: 207582124 GuidanceScale: 7.5 NumInferenceSteps: 30
41
+
42
+ __________________________
43
+ *A ninja*
44
+
45
+ <img src="Assets/lcm_ninja_30_7.5_2092464983.png" width="256" alt="Image of browser inferencing on sample images."/>
46
+
47
+ Seed: 207582124 GuidanceScale: 7.5 NumInferenceSteps: 30
48
+
49
+ __________________________
50
+ *a japanese dometic market sports car sitting in a showroom*
51
+
52
+ <img src="Assets/lcm_car_30_7.5_2092464983.png" width="256" alt="Image of browser inferencing on sample images."/>
53
+
54
+ Seed: 207582124 GuidanceScale: 7.5 NumInferenceSteps: 30
55
+
56
+ __________________________
model/feature_extractor/preprocessor_config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": {
3
+ "height": 224,
4
+ "width": 224
5
+ },
6
+ "do_center_crop": true,
7
+ "do_convert_rgb": true,
8
+ "do_normalize": true,
9
+ "do_rescale": true,
10
+ "do_resize": true,
11
+ "feature_extractor_type": "CLIPFeatureExtractor",
12
+ "image_mean": [
13
+ 0.48145466,
14
+ 0.4578275,
15
+ 0.40821073
16
+ ],
17
+ "image_processor_type": "CLIPImageProcessor",
18
+ "image_std": [
19
+ 0.26862954,
20
+ 0.26130258,
21
+ 0.27577711
22
+ ],
23
+ "resample": 3,
24
+ "rescale_factor": 0.00392156862745098,
25
+ "size": {
26
+ "shortest_edge": 224
27
+ }
28
+ }
model/model_index.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "StableDiffusionPipeline",
3
+ "_diffusers_version": "0.22.0.dev0",
4
+ "_name_or_path": "LCM_Dreamshaper_v7",
5
+ "feature_extractor": [
6
+ "transformers",
7
+ "CLIPImageProcessor"
8
+ ],
9
+ "requires_safety_checker": true,
10
+ "safety_checker": [
11
+ "stable_diffusion",
12
+ "StableDiffusionSafetyChecker"
13
+ ],
14
+ "scheduler": [
15
+ "diffusers",
16
+ "LCMScheduler"
17
+ ],
18
+ "text_encoder": [
19
+ "transformers",
20
+ "CLIPTextModel"
21
+ ],
22
+ "tokenizer": [
23
+ "transformers",
24
+ "CLIPTokenizer"
25
+ ],
26
+ "unet": [
27
+ "diffusers",
28
+ "UNet2DConditionModel"
29
+ ],
30
+ "vae": [
31
+ "diffusers",
32
+ "AutoencoderKL"
33
+ ]
34
+ }
model/scheduler/scheduler_config.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "LCMScheduler",
3
+ "_diffusers_version": "0.22.0.dev0",
4
+ "beta_end": 0.012,
5
+ "beta_schedule": "scaled_linear",
6
+ "beta_start": 0.00085,
7
+ "clip_sample": false,
8
+ "clip_sample_range": 1.0,
9
+ "dynamic_thresholding_ratio": 0.995,
10
+ "num_train_timesteps": 1000,
11
+ "original_inference_steps": 50,
12
+ "prediction_type": "epsilon",
13
+ "rescale_betas_zero_snr": false,
14
+ "sample_max_value": 1.0,
15
+ "set_alpha_to_one": true,
16
+ "steps_offset": 1,
17
+ "thresholding": false,
18
+ "timestep_spacing": "leading",
19
+ "trained_betas": null
20
+ }
model/text_encoder/config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "LCM_Dreamshaper_v7\\text_encoder",
3
+ "architectures": [
4
+ "CLIPTextModel"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 0,
8
+ "dropout": 0.0,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "quick_gelu",
11
+ "hidden_size": 768,
12
+ "initializer_factor": 1.0,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 3072,
15
+ "layer_norm_eps": 1e-05,
16
+ "max_position_embeddings": 77,
17
+ "model_type": "clip_text_model",
18
+ "num_attention_heads": 12,
19
+ "num_hidden_layers": 12,
20
+ "pad_token_id": 1,
21
+ "projection_dim": 768,
22
+ "torch_dtype": "float32",
23
+ "transformers_version": "4.34.1",
24
+ "vocab_size": 49408
25
+ }
model/text_encoder/model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fefe95eab6542e5fb7642b3f592489176836cc3fd49196b924a63760602c8c4a
3
+ size 492588002
model/text_encoder/model.rknn ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:36d87985d8b86a47ce8c6668fe3049c699f1633405e016c4f6c227a3dffaf638
3
+ size 249037093
model/tokenizer/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model/tokenizer/model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:52af50d264d702c351484aabf62c64abe61f59d6a6d2c508a3e797e23dc1e008
3
+ size 1683168
model/tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|startoftext|>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|endoftext|>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<|endoftext|>",
18
+ "lstrip": false,
19
+ "normalized": true,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<|endoftext|>",
25
+ "lstrip": false,
26
+ "normalized": true,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
model/tokenizer/tokenizer_config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "49406": {
5
+ "content": "<|startoftext|>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "49407": {
13
+ "content": "<|endoftext|>",
14
+ "lstrip": false,
15
+ "normalized": true,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ }
20
+ },
21
+ "additional_special_tokens": [],
22
+ "bos_token": "<|startoftext|>",
23
+ "clean_up_tokenization_spaces": true,
24
+ "do_lower_case": true,
25
+ "eos_token": "<|endoftext|>",
26
+ "errors": "replace",
27
+ "model_max_length": 77,
28
+ "pad_token": "<|endoftext|>",
29
+ "tokenizer_class": "CLIPTokenizer",
30
+ "unk_token": "<|endoftext|>"
31
+ }
model/tokenizer/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
model/unet/config.json ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "UNet2DConditionModel",
3
+ "_diffusers_version": "0.22.0.dev0",
4
+ "_name_or_path": "LCM_Dreamshaper_v7\\unet",
5
+ "act_fn": "silu",
6
+ "addition_embed_type": null,
7
+ "addition_embed_type_num_heads": 64,
8
+ "addition_time_embed_dim": null,
9
+ "attention_head_dim": 8,
10
+ "attention_type": "default",
11
+ "block_out_channels": [
12
+ 320,
13
+ 640,
14
+ 1280,
15
+ 1280
16
+ ],
17
+ "center_input_sample": false,
18
+ "class_embed_type": null,
19
+ "class_embeddings_concat": false,
20
+ "conv_in_kernel": 3,
21
+ "conv_out_kernel": 3,
22
+ "cross_attention_dim": 768,
23
+ "cross_attention_norm": null,
24
+ "down_block_types": [
25
+ "CrossAttnDownBlock2D",
26
+ "CrossAttnDownBlock2D",
27
+ "CrossAttnDownBlock2D",
28
+ "DownBlock2D"
29
+ ],
30
+ "downsample_padding": 1,
31
+ "dropout": 0.0,
32
+ "dual_cross_attention": false,
33
+ "encoder_hid_dim": null,
34
+ "encoder_hid_dim_type": null,
35
+ "flip_sin_to_cos": true,
36
+ "freq_shift": 0,
37
+ "in_channels": 4,
38
+ "layers_per_block": 2,
39
+ "mid_block_only_cross_attention": null,
40
+ "mid_block_scale_factor": 1,
41
+ "mid_block_type": "UNetMidBlock2DCrossAttn",
42
+ "norm_eps": 1e-05,
43
+ "norm_num_groups": 32,
44
+ "num_attention_heads": null,
45
+ "num_class_embeds": null,
46
+ "only_cross_attention": false,
47
+ "out_channels": 4,
48
+ "projection_class_embeddings_input_dim": null,
49
+ "resnet_out_scale_factor": 1.0,
50
+ "resnet_skip_time_act": false,
51
+ "resnet_time_scale_shift": "default",
52
+ "reverse_transformer_layers_per_block": null,
53
+ "sample_size": 96,
54
+ "time_cond_proj_dim": 256,
55
+ "time_embedding_act_fn": null,
56
+ "time_embedding_dim": null,
57
+ "time_embedding_type": "positional",
58
+ "timestep_post_act": null,
59
+ "transformer_layers_per_block": 1,
60
+ "up_block_types": [
61
+ "UpBlock2D",
62
+ "CrossAttnUpBlock2D",
63
+ "CrossAttnUpBlock2D",
64
+ "CrossAttnUpBlock2D"
65
+ ],
66
+ "upcast_attention": null,
67
+ "use_linear_projection": false
68
+ }
model/unet/model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f3e9a08a3e5b943046bf90a513c492cf4c6e31e26229062af8eb4ad2ddf172b5
3
+ size 1948508
model/unet/model.onnx_data ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ef99ccc336de0e79f247fcb3d1398b3f3d1a02796916b88a351d7a83f570a31a
3
+ size 3438411520
model/unet/model.rknn ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:622e350654e2a2e90ff863f51aea1a30f4d148ed3bf8645f355a7c37df13420c
3
+ size 1758599217
model/vae_decoder/config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "AutoencoderKL",
3
+ "_diffusers_version": "0.22.0.dev0",
4
+ "_name_or_path": "LCM_Dreamshaper_v7\\vae",
5
+ "act_fn": "silu",
6
+ "block_out_channels": [
7
+ 128,
8
+ 256,
9
+ 512,
10
+ 512
11
+ ],
12
+ "down_block_types": [
13
+ "DownEncoderBlock2D",
14
+ "DownEncoderBlock2D",
15
+ "DownEncoderBlock2D",
16
+ "DownEncoderBlock2D"
17
+ ],
18
+ "force_upcast": true,
19
+ "in_channels": 3,
20
+ "latent_channels": 4,
21
+ "layers_per_block": 2,
22
+ "norm_num_groups": 32,
23
+ "out_channels": 3,
24
+ "sample_size": 768,
25
+ "scaling_factor": 0.18215,
26
+ "up_block_types": [
27
+ "UpDecoderBlock2D",
28
+ "UpDecoderBlock2D",
29
+ "UpDecoderBlock2D",
30
+ "UpDecoderBlock2D"
31
+ ]
32
+ }
model/vae_decoder/model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ec5298d7bfa592d492d36b42d17f794fcdb9175e2aac366956d40f3f38d13ca1
3
+ size 198078038
model/vae_decoder/model.rknn ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f0d6fc059da6ca88930ef04148f6a4311512cce547a663102cdfedc50f22b8c3
3
+ size 159530108
model/vae_encoder/config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "AutoencoderKL",
3
+ "_diffusers_version": "0.22.0.dev0",
4
+ "_name_or_path": "LCM_Dreamshaper_v7\\vae",
5
+ "act_fn": "silu",
6
+ "block_out_channels": [
7
+ 128,
8
+ 256,
9
+ 512,
10
+ 512
11
+ ],
12
+ "down_block_types": [
13
+ "DownEncoderBlock2D",
14
+ "DownEncoderBlock2D",
15
+ "DownEncoderBlock2D",
16
+ "DownEncoderBlock2D"
17
+ ],
18
+ "force_upcast": true,
19
+ "in_channels": 3,
20
+ "latent_channels": 4,
21
+ "layers_per_block": 2,
22
+ "norm_num_groups": 32,
23
+ "out_channels": 3,
24
+ "sample_size": 768,
25
+ "scaling_factor": 0.18215,
26
+ "up_block_types": [
27
+ "UpDecoderBlock2D",
28
+ "UpDecoderBlock2D",
29
+ "UpDecoderBlock2D",
30
+ "UpDecoderBlock2D"
31
+ ]
32
+ }
model/vae_encoder/model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:268d4398021d7bc91e91c94e4835cc5ffa471155db1b722d0a43f6d1a4f822fd
3
+ size 136760154
run_onnx-lcm.py ADDED
@@ -0,0 +1,665 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ import argparse
3
+ import json
4
+ import time
5
+
6
+ import PIL
7
+ from diffusers import StableDiffusionPipeline
8
+ from diffusers.pipelines.pipeline_utils import DiffusionPipeline
9
+ from diffusers.pipelines.stable_diffusion import StableDiffusionPipelineOutput
10
+ from diffusers.schedulers import (
11
+ LCMScheduler
12
+ )
13
+ from diffusers.schedulers.scheduling_utils import SchedulerMixin
14
+
15
+ import gc
16
+ import inspect
17
+
18
+ import logging
19
+
20
+ logging.basicConfig()
21
+ logger = logging.getLogger(__name__)
22
+ logger.setLevel(logging.INFO)
23
+
24
+ import numpy as np
25
+ import os
26
+
27
+ import torch # Only used for `torch.from_tensor` in `pipe.scheduler.step()`
28
+ from transformers import CLIPFeatureExtractor, CLIPTokenizer
29
+ from typing import Callable, List, Optional, Union, Tuple
30
+ from PIL import Image
31
+
32
+ # from rknnlite.api import RKNNLite
33
+
34
+ # class RKNN2Model:
35
+ # """ Wrapper for running RKNPU2 models """
36
+
37
+ # def __init__(self, model_path):
38
+
39
+ # logger.info(f"Loading {model_path}")
40
+
41
+ # start = time.time()
42
+ # assert os.path.exists(model_path) and model_path.endswith(".rknn")
43
+ # self.rknnlite = RKNNLite()
44
+ # self.rknnlite.load_rknn(model_path)
45
+ # self.rknnlite.init_runtime(core_mask=RKNNLite.NPU_CORE_AUTO) # Multi-core will cause kernel crash
46
+ # load_time = time.time() - start
47
+ # logger.info(f"Done. Took {load_time:.1f} seconds.")
48
+ # self.modelname = model_path.split("/")[-1]
49
+ # self.inference_time = 0
50
+
51
+ # def __call__(self, **kwargs) -> List[np.ndarray]:
52
+ # np.savez(f"{self.modelname}_input_{self.inference_time}.npz", **kwargs)
53
+ # #print(kwargs)
54
+ # input_list = [value for key, value in kwargs.items()]
55
+ # for i, input in enumerate(input_list):
56
+ # if isinstance(input, np.ndarray):
57
+ # print(f"input {i} shape: {input.shape}")
58
+ # results = self.rknnlite.inference(inputs=input_list)
59
+ # for res in results:
60
+ # print(f"output shape: {res.shape}")
61
+ # return results
62
+
63
+ import onnxruntime as ort
64
+
65
+ class RKNN2Model:
66
+ """ Wrapper for running ONNX models """
67
+
68
+ def __init__(self, model_dir):
69
+ logger.info(f"Loading {model_dir}")
70
+ start = time.time()
71
+ self.config = json.load(open(os.path.join(model_dir, "config.json")))
72
+ assert os.path.exists(model_dir) and os.path.exists(os.path.join(model_dir, "model.onnx"))
73
+ self.session = ort.InferenceSession(os.path.join(model_dir, "model.onnx"))
74
+ load_time = time.time() - start
75
+ logger.info(f"Done. Took {load_time:.1f} seconds.")
76
+ self.modelname = model_dir.split("/")[-1]
77
+ self.inference_time = 0
78
+
79
+ def __call__(self, **kwargs) -> List[np.ndarray]:
80
+ # np.savez(f"onnx_out/{self.modelname}_input_{self.inference_time}.npz", **kwargs)
81
+ self.inference_time += 1
82
+ results = self.session.run(None, kwargs)
83
+ results_list = []
84
+ for res in results:
85
+ results_list.append(res)
86
+ return results
87
+
88
+ class RKNN2StableDiffusionPipeline(DiffusionPipeline):
89
+ """ RKNN2 version of
90
+ `diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline`
91
+ """
92
+
93
+ def __init__(
94
+ self,
95
+ text_encoder: RKNN2Model,
96
+ unet: RKNN2Model,
97
+ vae_decoder: RKNN2Model,
98
+ scheduler: LCMScheduler,
99
+ tokenizer: CLIPTokenizer,
100
+ force_zeros_for_empty_prompt: Optional[bool] = True,
101
+ feature_extractor: Optional[CLIPFeatureExtractor] = None,
102
+ text_encoder_2: Optional[RKNN2Model] = None,
103
+ tokenizer_2: Optional[CLIPTokenizer] = None
104
+
105
+ ):
106
+ super().__init__()
107
+
108
+ # Register non-Core ML components of the pipeline similar to the original pipeline
109
+ self.register_modules(
110
+ tokenizer=tokenizer,
111
+ scheduler=scheduler,
112
+ feature_extractor=feature_extractor,
113
+ )
114
+ self.force_zeros_for_empty_prompt = force_zeros_for_empty_prompt
115
+ self.safety_checker = None
116
+
117
+ # Register Core ML components of the pipeline
118
+ self.text_encoder = text_encoder
119
+ self.text_encoder_2 = text_encoder_2
120
+ self.tokenizer_2 = tokenizer_2
121
+ self.unet = unet
122
+ self.vae_decoder = vae_decoder
123
+
124
+ VAE_DECODER_UPSAMPLE_FACTOR = 8
125
+
126
+ # In PyTorch, users can determine the tensor shapes dynamically by default
127
+ # In CoreML, tensors have static shapes unless flexible shapes were used during export
128
+ # See https://coremltools.readme.io/docs/flexible-inputs
129
+ latent_h, latent_w = 32, 32 # hallo1: FIXME: hardcoded value
130
+ self.height = latent_h * VAE_DECODER_UPSAMPLE_FACTOR
131
+ self.width = latent_w * VAE_DECODER_UPSAMPLE_FACTOR
132
+ self.vae_scale_factor = VAE_DECODER_UPSAMPLE_FACTOR
133
+ logger.info(
134
+ f"Stable Diffusion configured to generate {self.height}x{self.width} images"
135
+ )
136
+
137
+ @staticmethod
138
+ def postprocess(
139
+ image: np.ndarray,
140
+ output_type: str = "pil",
141
+ do_denormalize: Optional[List[bool]] = None,
142
+ ):
143
+ def numpy_to_pil(images: np.ndarray):
144
+ """
145
+ Convert a numpy image or a batch of images to a PIL image.
146
+ """
147
+ if images.ndim == 3:
148
+ images = images[None, ...]
149
+ images = (images * 255).round().astype("uint8")
150
+ if images.shape[-1] == 1:
151
+ # special case for grayscale (single channel) images
152
+ pil_images = [Image.fromarray(image.squeeze(), mode="L") for image in images]
153
+ else:
154
+ pil_images = [Image.fromarray(image) for image in images]
155
+
156
+ return pil_images
157
+
158
+ def denormalize(images: np.ndarray):
159
+ """
160
+ Denormalize an image array to [0,1].
161
+ """
162
+ return np.clip(images / 2 + 0.5, 0, 1)
163
+
164
+ if not isinstance(image, np.ndarray):
165
+ raise ValueError(
166
+ f"Input for postprocessing is in incorrect format: {type(image)}. We only support np array"
167
+ )
168
+ if output_type not in ["latent", "np", "pil"]:
169
+ deprecation_message = (
170
+ f"the output_type {output_type} is outdated and has been set to `np`. Please make sure to set it to one of these instead: "
171
+ "`pil`, `np`, `pt`, `latent`"
172
+ )
173
+ logger.warning(deprecation_message)
174
+ output_type = "np"
175
+
176
+ if output_type == "latent":
177
+ return image
178
+
179
+ if do_denormalize is None:
180
+ raise ValueError("do_denormalize is required for postprocessing")
181
+
182
+ image = np.stack(
183
+ [denormalize(image[i]) if do_denormalize[i] else image[i] for i in range(image.shape[0])], axis=0
184
+ )
185
+ image = image.transpose((0, 2, 3, 1))
186
+
187
+ if output_type == "pil":
188
+ image = numpy_to_pil(image)
189
+
190
+ return image
191
+
192
+ def _encode_prompt(
193
+ self,
194
+ prompt: Union[str, List[str]],
195
+ num_images_per_prompt: int,
196
+ do_classifier_free_guidance: bool,
197
+ negative_prompt: Optional[Union[str, list]],
198
+ prompt_embeds: Optional[np.ndarray] = None,
199
+ negative_prompt_embeds: Optional[np.ndarray] = None,
200
+ ):
201
+ r"""
202
+ Encodes the prompt into text encoder hidden states.
203
+
204
+ Args:
205
+ prompt (`Union[str, List[str]]`):
206
+ prompt to be encoded
207
+ num_images_per_prompt (`int`):
208
+ number of images that should be generated per prompt
209
+ do_classifier_free_guidance (`bool`):
210
+ whether to use classifier free guidance or not
211
+ negative_prompt (`Optional[Union[str, list]]`):
212
+ The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
213
+ if `guidance_scale` is less than `1`).
214
+ prompt_embeds (`Optional[np.ndarray]`, defaults to `None`):
215
+ Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
216
+ provided, text embeddings will be generated from `prompt` input argument.
217
+ negative_prompt_embeds (`Optional[np.ndarray]`, defaults to `None`):
218
+ Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
219
+ weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
220
+ argument.
221
+ """
222
+ if isinstance(prompt, str):
223
+ batch_size = 1
224
+ elif isinstance(prompt, list):
225
+ batch_size = len(prompt)
226
+ else:
227
+ batch_size = prompt_embeds.shape[0]
228
+
229
+ if prompt_embeds is None:
230
+ # get prompt text embeddings
231
+ text_inputs = self.tokenizer(
232
+ prompt,
233
+ padding="max_length",
234
+ max_length=self.tokenizer.model_max_length,
235
+ truncation=True,
236
+ return_tensors="np",
237
+ )
238
+ text_input_ids = text_inputs.input_ids
239
+ untruncated_ids = self.tokenizer(prompt, padding="max_length", return_tensors="np").input_ids
240
+
241
+ if not np.array_equal(text_input_ids, untruncated_ids):
242
+ removed_text = self.tokenizer.batch_decode(
243
+ untruncated_ids[:, self.tokenizer.model_max_length - 1 : -1]
244
+ )
245
+ logger.warning(
246
+ "The following part of your input was truncated because CLIP can only handle sequences up to"
247
+ f" {self.tokenizer.model_max_length} tokens: {removed_text}"
248
+ )
249
+
250
+ prompt_embeds = self.text_encoder(input_ids=text_input_ids.astype(np.int32))[0]
251
+
252
+ prompt_embeds = np.repeat(prompt_embeds, num_images_per_prompt, axis=0)
253
+
254
+ # get unconditional embeddings for classifier free guidance
255
+ if do_classifier_free_guidance and negative_prompt_embeds is None:
256
+ uncond_tokens: List[str]
257
+ if negative_prompt is None:
258
+ uncond_tokens = [""] * batch_size
259
+ elif type(prompt) is not type(negative_prompt):
260
+ raise TypeError(
261
+ f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
262
+ f" {type(prompt)}."
263
+ )
264
+ elif isinstance(negative_prompt, str):
265
+ uncond_tokens = [negative_prompt] * batch_size
266
+ elif batch_size != len(negative_prompt):
267
+ raise ValueError(
268
+ f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
269
+ f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
270
+ " the batch size of `prompt`."
271
+ )
272
+ else:
273
+ uncond_tokens = negative_prompt
274
+
275
+ max_length = prompt_embeds.shape[1]
276
+ uncond_input = self.tokenizer(
277
+ uncond_tokens,
278
+ padding="max_length",
279
+ max_length=max_length,
280
+ truncation=True,
281
+ return_tensors="np",
282
+ )
283
+ negative_prompt_embeds = self.text_encoder(input_ids=uncond_input.input_ids.astype(np.int32))[0]
284
+
285
+ if do_classifier_free_guidance:
286
+ negative_prompt_embeds = np.repeat(negative_prompt_embeds, num_images_per_prompt, axis=0)
287
+
288
+ # For classifier free guidance, we need to do two forward passes.
289
+ # Here we concatenate the unconditional and text embeddings into a single batch
290
+ # to avoid doing two forward passes
291
+ prompt_embeds = np.concatenate([negative_prompt_embeds, prompt_embeds])
292
+
293
+ return prompt_embeds
294
+
295
+ # Copied from https://github.com/huggingface/diffusers/blob/v0.17.1/src/diffusers/pipelines/stable_diffusion/pipeline_onnx_stable_diffusion.py#L217
296
+ def check_inputs(
297
+ self,
298
+ prompt: Union[str, List[str]],
299
+ height: Optional[int],
300
+ width: Optional[int],
301
+ callback_steps: int,
302
+ negative_prompt: Optional[str] = None,
303
+ prompt_embeds: Optional[np.ndarray] = None,
304
+ negative_prompt_embeds: Optional[np.ndarray] = None,
305
+ ):
306
+ if height % 8 != 0 or width % 8 != 0:
307
+ raise ValueError(f"`height` and `width` have to be divisible by 8 but are {height} and {width}.")
308
+
309
+ if (callback_steps is None) or (
310
+ callback_steps is not None and (not isinstance(callback_steps, int) or callback_steps <= 0)
311
+ ):
312
+ raise ValueError(
313
+ f"`callback_steps` has to be a positive integer but is {callback_steps} of type"
314
+ f" {type(callback_steps)}."
315
+ )
316
+
317
+ if prompt is not None and prompt_embeds is not None:
318
+ raise ValueError(
319
+ f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
320
+ " only forward one of the two."
321
+ )
322
+ elif prompt is None and prompt_embeds is None:
323
+ raise ValueError(
324
+ "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
325
+ )
326
+ elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
327
+ raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
328
+
329
+ if negative_prompt is not None and negative_prompt_embeds is not None:
330
+ raise ValueError(
331
+ f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
332
+ f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
333
+ )
334
+
335
+ if prompt_embeds is not None and negative_prompt_embeds is not None:
336
+ if prompt_embeds.shape != negative_prompt_embeds.shape:
337
+ raise ValueError(
338
+ "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
339
+ f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
340
+ f" {negative_prompt_embeds.shape}."
341
+ )
342
+
343
+ # Adapted from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_latents
344
+ def prepare_latents(self, batch_size, num_channels_latents, height, width, dtype, generator, latents=None):
345
+ shape = (batch_size, num_channels_latents, height // self.vae_scale_factor, width // self.vae_scale_factor)
346
+ if isinstance(generator, list) and len(generator) != batch_size:
347
+ raise ValueError(
348
+ f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
349
+ f" size of {batch_size}. Make sure the batch size matches the length of the generators."
350
+ )
351
+
352
+ if latents is None:
353
+ if isinstance(generator, np.random.RandomState):
354
+ latents = generator.randn(*shape).astype(dtype)
355
+ elif isinstance(generator, torch.Generator):
356
+ latents = torch.randn(*shape, generator=generator).numpy().astype(dtype)
357
+ else:
358
+ raise ValueError(
359
+ f"Expected `generator` to be of type `np.random.RandomState` or `torch.Generator`, but got"
360
+ f" {type(generator)}."
361
+ )
362
+ elif latents.shape != shape:
363
+ raise ValueError(f"Unexpected latents shape, got {latents.shape}, expected {shape}")
364
+
365
+ # scale the initial noise by the standard deviation required by the scheduler
366
+ latents = latents * np.float64(self.scheduler.init_noise_sigma)
367
+
368
+ return latents
369
+
370
+ # Adapted from https://github.com/huggingface/diffusers/blob/v0.22.0/src/diffusers/pipelines/latent_consistency/pipeline_latent_consistency.py#L264
371
+ def __call__(
372
+ self,
373
+ prompt: Union[str, List[str]] = "",
374
+ height: Optional[int] = None,
375
+ width: Optional[int] = None,
376
+ num_inference_steps: int = 4,
377
+ original_inference_steps: int = None,
378
+ guidance_scale: float = 8.5,
379
+ num_images_per_prompt: int = 1,
380
+ generator: Optional[Union[np.random.RandomState, torch.Generator]] = None,
381
+ latents: Optional[np.ndarray] = None,
382
+ prompt_embeds: Optional[np.ndarray] = None,
383
+ output_type: str = "pil",
384
+ return_dict: bool = True,
385
+ callback: Optional[Callable[[int, int, np.ndarray], None]] = None,
386
+ callback_steps: int = 1,
387
+ ):
388
+ r"""
389
+ Function invoked when calling the pipeline for generation.
390
+
391
+ Args:
392
+ prompt (`Optional[Union[str, List[str]]]`, defaults to None):
393
+ The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
394
+ instead.
395
+ height (`Optional[int]`, defaults to None):
396
+ The height in pixels of the generated image.
397
+ width (`Optional[int]`, defaults to None):
398
+ The width in pixels of the generated image.
399
+ num_inference_steps (`int`, defaults to 50):
400
+ The number of denoising steps. More denoising steps usually lead to a higher quality image at the
401
+ expense of slower inference.
402
+ guidance_scale (`float`, defaults to 7.5):
403
+ Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
404
+ `guidance_scale` is defined as `w` of equation 2. of [Imagen
405
+ Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
406
+ 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
407
+ usually at the expense of lower image quality.
408
+ num_images_per_prompt (`int`, defaults to 1):
409
+ The number of images to generate per prompt.
410
+ generator (`Optional[Union[np.random.RandomState, torch.Generator]]`, defaults to `None`):
411
+ A np.random.RandomState to make generation deterministic.
412
+ latents (`Optional[np.ndarray]`, defaults to `None`):
413
+ Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
414
+ generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
415
+ tensor will ge generated by sampling using the supplied random `generator`.
416
+ prompt_embeds (`Optional[np.ndarray]`, defaults to `None`):
417
+ Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
418
+ provided, text embeddings will be generated from `prompt` input argument.
419
+ output_type (`str`, defaults to `"pil"`):
420
+ The output format of the generate image. Choose between
421
+ [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
422
+ return_dict (`bool`, defaults to `True`):
423
+ Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
424
+ plain tuple.
425
+ callback (Optional[Callable], defaults to `None`):
426
+ A function that will be called every `callback_steps` steps during inference. The function will be
427
+ called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
428
+ callback_steps (`int`, defaults to 1):
429
+ The frequency at which the `callback` function will be called. If not specified, the callback will be
430
+ called at every step.
431
+ guidance_rescale (`float`, defaults to 0.0):
432
+ Guidance rescale factor proposed by [Common Diffusion Noise Schedules and Sample Steps are
433
+ Flawed](https://arxiv.org/pdf/2305.08891.pdf) `guidance_scale` is defined as `φ` in equation 16. of
434
+ [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/pdf/2305.08891.pdf).
435
+ Guidance rescale factor should fix overexposure when using zero terminal SNR.
436
+
437
+ Returns:
438
+ [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
439
+ [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple.
440
+ When returning a tuple, the first element is a list with the generated images, and the second element is a
441
+ list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work"
442
+ (nsfw) content, according to the `safety_checker`.
443
+ """
444
+ height = height or self.unet.config["sample_size"] * self.vae_scale_factor
445
+ width = width or self.unet.config["sample_size"] * self.vae_scale_factor
446
+
447
+ # Don't need to get negative prompts due to LCM guided distillation
448
+ negative_prompt = None
449
+ negative_prompt_embeds = None
450
+
451
+ # check inputs. Raise error if not correct
452
+ self.check_inputs(
453
+ prompt, height, width, callback_steps, negative_prompt, prompt_embeds, negative_prompt_embeds
454
+ )
455
+
456
+ # define call parameters
457
+ if isinstance(prompt, str):
458
+ batch_size = 1
459
+ elif isinstance(prompt, list):
460
+ batch_size = len(prompt)
461
+ else:
462
+ batch_size = prompt_embeds.shape[0]
463
+
464
+ if generator is None:
465
+ generator = np.random.RandomState()
466
+
467
+ prompt_embeds = self._encode_prompt(
468
+ prompt,
469
+ num_images_per_prompt,
470
+ False,
471
+ negative_prompt,
472
+ prompt_embeds=prompt_embeds,
473
+ negative_prompt_embeds=negative_prompt_embeds,
474
+ )
475
+
476
+ # set timesteps
477
+ self.scheduler.set_timesteps(num_inference_steps, original_inference_steps=original_inference_steps)
478
+ timesteps = self.scheduler.timesteps
479
+
480
+ latents = self.prepare_latents(
481
+ batch_size * num_images_per_prompt,
482
+ self.unet.config["in_channels"],
483
+ height,
484
+ width,
485
+ prompt_embeds.dtype,
486
+ generator,
487
+ latents,
488
+ )
489
+
490
+ bs = batch_size * num_images_per_prompt
491
+ # get Guidance Scale Embedding
492
+ w = np.full(bs, guidance_scale - 1, dtype=prompt_embeds.dtype)
493
+ w_embedding = self.get_guidance_scale_embedding(
494
+ w, embedding_dim=self.unet.config["time_cond_proj_dim"], dtype=prompt_embeds.dtype
495
+ )
496
+
497
+ # Adapted from diffusers to extend it for other runtimes than ORT
498
+ timestep_dtype = np.int64
499
+
500
+ num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order
501
+ for i, t in enumerate(self.progress_bar(timesteps)):
502
+ timestep = np.array([t], dtype=timestep_dtype)
503
+ noise_pred = self.unet(
504
+ sample=latents,
505
+ timestep=timestep,
506
+ encoder_hidden_states=prompt_embeds,
507
+ timestep_cond=w_embedding,
508
+ )[0]
509
+
510
+ # compute the previous noisy sample x_t -> x_t-1
511
+ latents, denoised = self.scheduler.step(
512
+ torch.from_numpy(noise_pred), t, torch.from_numpy(latents), return_dict=False
513
+ )
514
+ latents, denoised = latents.numpy(), denoised.numpy()
515
+
516
+ # call the callback, if provided
517
+ if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
518
+ if callback is not None and i % callback_steps == 0:
519
+ callback(i, t, latents)
520
+
521
+ if output_type == "latent":
522
+ image = denoised
523
+ has_nsfw_concept = None
524
+ else:
525
+ denoised /= self.vae_decoder.config["scaling_factor"]
526
+ # it seems likes there is a strange result for using half-precision vae decoder if batchsize>1
527
+ image = np.concatenate(
528
+ [self.vae_decoder(latent_sample=denoised[i : i + 1])[0] for i in range(denoised.shape[0])]
529
+ )
530
+ # image, has_nsfw_concept = self.run_safety_checker(image)
531
+ has_nsfw_concept = None # skip safety checker
532
+
533
+ if has_nsfw_concept is None:
534
+ do_denormalize = [True] * image.shape[0]
535
+ else:
536
+ do_denormalize = [not has_nsfw for has_nsfw in has_nsfw_concept]
537
+
538
+ image = self.postprocess(image, output_type=output_type, do_denormalize=do_denormalize)
539
+
540
+ if not return_dict:
541
+ return (image, has_nsfw_concept)
542
+
543
+ return StableDiffusionPipelineOutput(images=image, nsfw_content_detected=has_nsfw_concept)
544
+
545
+
546
+ # Adapted from https://github.com/huggingface/diffusers/blob/v0.22.0/src/diffusers/pipelines/latent_consistency/pipeline_latent_consistency.py#L264
547
+ def get_guidance_scale_embedding(self, w, embedding_dim=512, dtype=None):
548
+ """
549
+ See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298
550
+
551
+ Args:
552
+ timesteps (`torch.Tensor`):
553
+ generate embedding vectors at these timesteps
554
+ embedding_dim (`int`, *optional*, defaults to 512):
555
+ dimension of the embeddings to generate
556
+ dtype:
557
+ data type of the generated embeddings
558
+
559
+ Returns:
560
+ `torch.FloatTensor`: Embedding vectors with shape `(len(timesteps), embedding_dim)`
561
+ """
562
+ w = w * 1000
563
+ half_dim = embedding_dim // 2
564
+ emb = np.log(10000.0) / (half_dim - 1)
565
+ emb = np.exp(np.arange(half_dim, dtype=dtype) * -emb)
566
+ emb = w[:, None] * emb[None, :]
567
+ emb = np.concatenate([np.sin(emb), np.cos(emb)], axis=1)
568
+
569
+ if embedding_dim % 2 == 1: # zero pad
570
+ emb = np.pad(emb, [(0, 0), (0, 1)])
571
+
572
+ assert emb.shape == (w.shape[0], embedding_dim)
573
+ return emb
574
+
575
+ def get_image_path(args, **override_kwargs):
576
+ """ mkdir output folder and encode metadata in the filename
577
+ """
578
+ out_folder = os.path.join(args.o, "_".join(args.prompt.replace("/", "_").rsplit(" ")))
579
+ os.makedirs(out_folder, exist_ok=True)
580
+
581
+ out_fname = f"randomSeed_{override_kwargs.get('seed', None) or args.seed}"
582
+
583
+ out_fname += f"_LCM_"
584
+ out_fname += f"_numInferenceSteps{override_kwargs.get('num_inference_steps', None) or args.num_inference_steps}"
585
+ out_fname += "_onnx_"
586
+
587
+ return os.path.join(out_folder, out_fname + ".png")
588
+
589
+
590
+ def prepare_controlnet_cond(image_path, height, width):
591
+ image = Image.open(image_path).convert("RGB")
592
+ image = image.resize((height, width), resample=Image.LANCZOS)
593
+ image = np.array(image).transpose(2, 0, 1) / 255.0
594
+ return image
595
+
596
+
597
+ def main(args):
598
+ logger.info(f"Setting random seed to {args.seed}")
599
+
600
+ # load scheduler from /scheduler/scheduler_config.json
601
+ scheduler_config_path = os.path.join(args.i, "scheduler/scheduler_config.json")
602
+ with open(scheduler_config_path, "r") as f:
603
+ scheduler_config = json.load(f)
604
+ user_specified_scheduler = LCMScheduler.from_config(scheduler_config)
605
+
606
+ print("user_specified_scheduler", user_specified_scheduler)
607
+
608
+ pipe = RKNN2StableDiffusionPipeline(
609
+ text_encoder=RKNN2Model(os.path.join(args.i, "text_encoder")),
610
+ unet=RKNN2Model(os.path.join(args.i, "unet")),
611
+ vae_decoder=RKNN2Model(os.path.join(args.i, "vae_decoder")),
612
+ scheduler=user_specified_scheduler,
613
+ tokenizer=CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch16"),
614
+ )
615
+
616
+ logger.info("Beginning image generation.")
617
+ image = pipe(
618
+ prompt=args.prompt,
619
+ height=int(args.size.split("x")[0]),
620
+ width=int(args.size.split("x")[1]),
621
+ num_inference_steps=args.num_inference_steps,
622
+ guidance_scale=args.guidance_scale,
623
+ generator=np.random.RandomState(args.seed),
624
+ )
625
+
626
+ out_path = get_image_path(args)
627
+ logger.info(f"Saving generated image to {out_path}")
628
+ image["images"][0].save(out_path)
629
+
630
+
631
+ if __name__ == "__main__":
632
+ parser = argparse.ArgumentParser()
633
+
634
+ parser.add_argument(
635
+ "--prompt",
636
+ required=True,
637
+ help="The text prompt to be used for text-to-image generation.")
638
+ parser.add_argument(
639
+ "-i",
640
+ required=True,
641
+ help=("Path to model directory"))
642
+ parser.add_argument("-o", required=True)
643
+ parser.add_argument("--seed",
644
+ default=93,
645
+ type=int,
646
+ help="Random seed to be able to reproduce results")
647
+ parser.add_argument(
648
+ "-s",
649
+ "--size",
650
+ default="256x256",
651
+ type=str,
652
+ help="Image size")
653
+ parser.add_argument(
654
+ "--num-inference-steps",
655
+ default=4,
656
+ type=int,
657
+ help="The number of iterations the unet model will be executed throughout the reverse diffusion process")
658
+ parser.add_argument(
659
+ "--guidance-scale",
660
+ default=7.5,
661
+ type=float,
662
+ help="Controls the influence of the text prompt on sampling process (0=random images)")
663
+
664
+ args = parser.parse_args()
665
+ main(args)
run_rknn-lcm.py ADDED
@@ -0,0 +1,632 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ import argparse
3
+ import json
4
+ import time
5
+
6
+ import PIL
7
+ from diffusers import StableDiffusionPipeline
8
+ from diffusers.pipelines.pipeline_utils import DiffusionPipeline
9
+ from diffusers.pipelines.stable_diffusion import StableDiffusionPipelineOutput
10
+ from diffusers.schedulers import (
11
+ LCMScheduler
12
+ )
13
+
14
+ import logging
15
+
16
+ logging.basicConfig()
17
+ logger = logging.getLogger(__name__)
18
+ logger.setLevel(logging.INFO)
19
+
20
+ import numpy as np
21
+ import os
22
+
23
+ import torch # Only used for `torch.from_tensor` in `pipe.scheduler.step()`
24
+ from transformers import CLIPFeatureExtractor, CLIPTokenizer
25
+ from typing import Callable, List, Optional, Union, Tuple
26
+ from PIL import Image
27
+
28
+ from rknnlite.api import RKNNLite
29
+
30
+ class RKNN2Model:
31
+ """ Wrapper for running RKNPU2 models """
32
+
33
+ def __init__(self, model_dir):
34
+ logger.info(f"Loading {model_dir}")
35
+ start = time.time()
36
+ self.config = json.load(open(os.path.join(model_dir, "config.json")))
37
+ assert os.path.exists(model_dir) and os.path.exists(os.path.join(model_dir, "model.rknn"))
38
+ self.rknnlite = RKNNLite()
39
+ self.rknnlite.load_rknn(os.path.join(model_dir, "model.rknn"))
40
+ self.rknnlite.init_runtime(core_mask=RKNNLite.NPU_CORE_AUTO) # Multi-core will cause kernel crash
41
+ load_time = time.time() - start
42
+ logger.info(f"Done. Took {load_time:.1f} seconds.")
43
+ self.modelname = model_dir.split("/")[-1]
44
+ self.inference_time = 0
45
+
46
+ def __call__(self, **kwargs) -> List[np.ndarray]:
47
+ # np.savez(f"rknn_out/{self.modelname}_input_{self.inference_time}.npz", **kwargs)
48
+ # self.inference_time += 1
49
+ #print(kwargs)
50
+ input_list = [value for key, value in kwargs.items()]
51
+ for i, input in enumerate(input_list):
52
+ if isinstance(input, np.ndarray):
53
+ print(f"input {i} shape: {input.shape}")
54
+
55
+ results = self.rknnlite.inference(inputs=input_list, data_format='nchw')
56
+ for res in results:
57
+ print(f"output shape: {res.shape}")
58
+ return results
59
+
60
+ class RKNN2LatentConsistencyPipeline(DiffusionPipeline):
61
+
62
+ def __init__(
63
+ self,
64
+ text_encoder: RKNN2Model,
65
+ unet: RKNN2Model,
66
+ vae_decoder: RKNN2Model,
67
+ scheduler: LCMScheduler,
68
+ tokenizer: CLIPTokenizer,
69
+ force_zeros_for_empty_prompt: Optional[bool] = True,
70
+ feature_extractor: Optional[CLIPFeatureExtractor] = None,
71
+ text_encoder_2: Optional[RKNN2Model] = None,
72
+ tokenizer_2: Optional[CLIPTokenizer] = None
73
+ ):
74
+ super().__init__()
75
+
76
+ self.register_modules(
77
+ tokenizer=tokenizer,
78
+ scheduler=scheduler,
79
+ feature_extractor=feature_extractor,
80
+ )
81
+ self.force_zeros_for_empty_prompt = force_zeros_for_empty_prompt
82
+ self.safety_checker = None
83
+
84
+ self.text_encoder = text_encoder
85
+ self.text_encoder_2 = text_encoder_2
86
+ self.tokenizer_2 = tokenizer_2
87
+ self.unet = unet
88
+ self.vae_decoder = vae_decoder
89
+
90
+ VAE_DECODER_UPSAMPLE_FACTOR = 8
91
+ self.vae_scale_factor = VAE_DECODER_UPSAMPLE_FACTOR
92
+
93
+ @staticmethod
94
+ def postprocess(
95
+ image: np.ndarray,
96
+ output_type: str = "pil",
97
+ do_denormalize: Optional[List[bool]] = None,
98
+ ):
99
+ def numpy_to_pil(images: np.ndarray):
100
+ """
101
+ Convert a numpy image or a batch of images to a PIL image.
102
+ """
103
+ if images.ndim == 3:
104
+ images = images[None, ...]
105
+ images = (images * 255).round().astype("uint8")
106
+ if images.shape[-1] == 1:
107
+ # special case for grayscale (single channel) images
108
+ pil_images = [Image.fromarray(image.squeeze(), mode="L") for image in images]
109
+ else:
110
+ pil_images = [Image.fromarray(image) for image in images]
111
+
112
+ return pil_images
113
+
114
+ def denormalize(images: np.ndarray):
115
+ """
116
+ Denormalize an image array to [0,1].
117
+ """
118
+ return np.clip(images / 2 + 0.5, 0, 1)
119
+
120
+ if not isinstance(image, np.ndarray):
121
+ raise ValueError(
122
+ f"Input for postprocessing is in incorrect format: {type(image)}. We only support np array"
123
+ )
124
+ if output_type not in ["latent", "np", "pil"]:
125
+ deprecation_message = (
126
+ f"the output_type {output_type} is outdated and has been set to `np`. Please make sure to set it to one of these instead: "
127
+ "`pil`, `np`, `pt`, `latent`"
128
+ )
129
+ logger.warning(deprecation_message)
130
+ output_type = "np"
131
+
132
+ if output_type == "latent":
133
+ return image
134
+
135
+ if do_denormalize is None:
136
+ raise ValueError("do_denormalize is required for postprocessing")
137
+
138
+ image = np.stack(
139
+ [denormalize(image[i]) if do_denormalize[i] else image[i] for i in range(image.shape[0])], axis=0
140
+ )
141
+ image = image.transpose((0, 2, 3, 1))
142
+
143
+ if output_type == "pil":
144
+ image = numpy_to_pil(image)
145
+
146
+ return image
147
+
148
+ def _encode_prompt(
149
+ self,
150
+ prompt: Union[str, List[str]],
151
+ num_images_per_prompt: int,
152
+ do_classifier_free_guidance: bool,
153
+ negative_prompt: Optional[Union[str, list]],
154
+ prompt_embeds: Optional[np.ndarray] = None,
155
+ negative_prompt_embeds: Optional[np.ndarray] = None,
156
+ ):
157
+ r"""
158
+ Encodes the prompt into text encoder hidden states.
159
+
160
+ Args:
161
+ prompt (`Union[str, List[str]]`):
162
+ prompt to be encoded
163
+ num_images_per_prompt (`int`):
164
+ number of images that should be generated per prompt
165
+ do_classifier_free_guidance (`bool`):
166
+ whether to use classifier free guidance or not
167
+ negative_prompt (`Optional[Union[str, list]]`):
168
+ The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
169
+ if `guidance_scale` is less than `1`).
170
+ prompt_embeds (`Optional[np.ndarray]`, defaults to `None`):
171
+ Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
172
+ provided, text embeddings will be generated from `prompt` input argument.
173
+ negative_prompt_embeds (`Optional[np.ndarray]`, defaults to `None`):
174
+ Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
175
+ weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
176
+ argument.
177
+ """
178
+ if isinstance(prompt, str):
179
+ batch_size = 1
180
+ elif isinstance(prompt, list):
181
+ batch_size = len(prompt)
182
+ else:
183
+ batch_size = prompt_embeds.shape[0]
184
+
185
+ if prompt_embeds is None:
186
+ # get prompt text embeddings
187
+ text_inputs = self.tokenizer(
188
+ prompt,
189
+ padding="max_length",
190
+ max_length=self.tokenizer.model_max_length,
191
+ truncation=True,
192
+ return_tensors="np",
193
+ )
194
+ text_input_ids = text_inputs.input_ids
195
+ untruncated_ids = self.tokenizer(prompt, padding="max_length", return_tensors="np").input_ids
196
+
197
+ if not np.array_equal(text_input_ids, untruncated_ids):
198
+ removed_text = self.tokenizer.batch_decode(
199
+ untruncated_ids[:, self.tokenizer.model_max_length - 1 : -1]
200
+ )
201
+ logger.warning(
202
+ "The following part of your input was truncated because CLIP can only handle sequences up to"
203
+ f" {self.tokenizer.model_max_length} tokens: {removed_text}"
204
+ )
205
+
206
+ prompt_embeds = self.text_encoder(input_ids=text_input_ids.astype(np.int32))[0]
207
+
208
+ prompt_embeds = np.repeat(prompt_embeds, num_images_per_prompt, axis=0)
209
+
210
+ # get unconditional embeddings for classifier free guidance
211
+ if do_classifier_free_guidance and negative_prompt_embeds is None:
212
+ uncond_tokens: List[str]
213
+ if negative_prompt is None:
214
+ uncond_tokens = [""] * batch_size
215
+ elif type(prompt) is not type(negative_prompt):
216
+ raise TypeError(
217
+ f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
218
+ f" {type(prompt)}."
219
+ )
220
+ elif isinstance(negative_prompt, str):
221
+ uncond_tokens = [negative_prompt] * batch_size
222
+ elif batch_size != len(negative_prompt):
223
+ raise ValueError(
224
+ f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
225
+ f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
226
+ " the batch size of `prompt`."
227
+ )
228
+ else:
229
+ uncond_tokens = negative_prompt
230
+
231
+ max_length = prompt_embeds.shape[1]
232
+ uncond_input = self.tokenizer(
233
+ uncond_tokens,
234
+ padding="max_length",
235
+ max_length=max_length,
236
+ truncation=True,
237
+ return_tensors="np",
238
+ )
239
+ negative_prompt_embeds = self.text_encoder(input_ids=uncond_input.input_ids.astype(np.int32))[0]
240
+
241
+ if do_classifier_free_guidance:
242
+ negative_prompt_embeds = np.repeat(negative_prompt_embeds, num_images_per_prompt, axis=0)
243
+
244
+ # For classifier free guidance, we need to do two forward passes.
245
+ # Here we concatenate the unconditional and text embeddings into a single batch
246
+ # to avoid doing two forward passes
247
+ prompt_embeds = np.concatenate([negative_prompt_embeds, prompt_embeds])
248
+
249
+ return prompt_embeds
250
+
251
+ # Copied from https://github.com/huggingface/diffusers/blob/v0.17.1/src/diffusers/pipelines/stable_diffusion/pipeline_onnx_stable_diffusion.py#L217
252
+ def check_inputs(
253
+ self,
254
+ prompt: Union[str, List[str]],
255
+ height: Optional[int],
256
+ width: Optional[int],
257
+ callback_steps: int,
258
+ negative_prompt: Optional[str] = None,
259
+ prompt_embeds: Optional[np.ndarray] = None,
260
+ negative_prompt_embeds: Optional[np.ndarray] = None,
261
+ ):
262
+ if height % 8 != 0 or width % 8 != 0:
263
+ raise ValueError(f"`height` and `width` have to be divisible by 8 but are {height} and {width}.")
264
+
265
+ if (callback_steps is None) or (
266
+ callback_steps is not None and (not isinstance(callback_steps, int) or callback_steps <= 0)
267
+ ):
268
+ raise ValueError(
269
+ f"`callback_steps` has to be a positive integer but is {callback_steps} of type"
270
+ f" {type(callback_steps)}."
271
+ )
272
+
273
+ if prompt is not None and prompt_embeds is not None:
274
+ raise ValueError(
275
+ f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
276
+ " only forward one of the two."
277
+ )
278
+ elif prompt is None and prompt_embeds is None:
279
+ raise ValueError(
280
+ "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
281
+ )
282
+ elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
283
+ raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
284
+
285
+ if negative_prompt is not None and negative_prompt_embeds is not None:
286
+ raise ValueError(
287
+ f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
288
+ f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
289
+ )
290
+
291
+ if prompt_embeds is not None and negative_prompt_embeds is not None:
292
+ if prompt_embeds.shape != negative_prompt_embeds.shape:
293
+ raise ValueError(
294
+ "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
295
+ f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
296
+ f" {negative_prompt_embeds.shape}."
297
+ )
298
+
299
+ # Adapted from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_latents
300
+ def prepare_latents(self, batch_size, num_channels_latents, height, width, dtype, generator, latents=None):
301
+ shape = (batch_size, num_channels_latents, height // self.vae_scale_factor, width // self.vae_scale_factor)
302
+ if isinstance(generator, list) and len(generator) != batch_size:
303
+ raise ValueError(
304
+ f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
305
+ f" size of {batch_size}. Make sure the batch size matches the length of the generators."
306
+ )
307
+
308
+ if latents is None:
309
+ if isinstance(generator, np.random.RandomState):
310
+ latents = generator.randn(*shape).astype(dtype)
311
+ elif isinstance(generator, torch.Generator):
312
+ latents = torch.randn(*shape, generator=generator).numpy().astype(dtype)
313
+ else:
314
+ raise ValueError(
315
+ f"Expected `generator` to be of type `np.random.RandomState` or `torch.Generator`, but got"
316
+ f" {type(generator)}."
317
+ )
318
+ elif latents.shape != shape:
319
+ raise ValueError(f"Unexpected latents shape, got {latents.shape}, expected {shape}")
320
+
321
+ # scale the initial noise by the standard deviation required by the scheduler
322
+ latents = latents * np.float64(self.scheduler.init_noise_sigma)
323
+
324
+ return latents
325
+
326
+ # Adapted from https://github.com/huggingface/diffusers/blob/v0.22.0/src/diffusers/pipelines/latent_consistency/pipeline_latent_consistency.py#L264
327
+ def __call__(
328
+ self,
329
+ prompt: Union[str, List[str]] = "",
330
+ height: Optional[int] = None,
331
+ width: Optional[int] = None,
332
+ num_inference_steps: int = 4,
333
+ original_inference_steps: int = None,
334
+ guidance_scale: float = 8.5,
335
+ num_images_per_prompt: int = 1,
336
+ generator: Optional[Union[np.random.RandomState, torch.Generator]] = None,
337
+ latents: Optional[np.ndarray] = None,
338
+ prompt_embeds: Optional[np.ndarray] = None,
339
+ output_type: str = "pil",
340
+ return_dict: bool = True,
341
+ callback: Optional[Callable[[int, int, np.ndarray], None]] = None,
342
+ callback_steps: int = 1,
343
+ ):
344
+ r"""
345
+ Function invoked when calling the pipeline for generation.
346
+
347
+ Args:
348
+ prompt (`Optional[Union[str, List[str]]]`, defaults to None):
349
+ The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
350
+ instead.
351
+ height (`Optional[int]`, defaults to None):
352
+ The height in pixels of the generated image.
353
+ width (`Optional[int]`, defaults to None):
354
+ The width in pixels of the generated image.
355
+ num_inference_steps (`int`, defaults to 50):
356
+ The number of denoising steps. More denoising steps usually lead to a higher quality image at the
357
+ expense of slower inference.
358
+ guidance_scale (`float`, defaults to 7.5):
359
+ Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
360
+ `guidance_scale` is defined as `w` of equation 2. of [Imagen
361
+ Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
362
+ 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
363
+ usually at the expense of lower image quality.
364
+ num_images_per_prompt (`int`, defaults to 1):
365
+ The number of images to generate per prompt.
366
+ generator (`Optional[Union[np.random.RandomState, torch.Generator]]`, defaults to `None`):
367
+ A np.random.RandomState to make generation deterministic.
368
+ latents (`Optional[np.ndarray]`, defaults to `None`):
369
+ Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
370
+ generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
371
+ tensor will ge generated by sampling using the supplied random `generator`.
372
+ prompt_embeds (`Optional[np.ndarray]`, defaults to `None`):
373
+ Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
374
+ provided, text embeddings will be generated from `prompt` input argument.
375
+ output_type (`str`, defaults to `"pil"`):
376
+ The output format of the generate image. Choose between
377
+ [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
378
+ return_dict (`bool`, defaults to `True`):
379
+ Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
380
+ plain tuple.
381
+ callback (Optional[Callable], defaults to `None`):
382
+ A function that will be called every `callback_steps` steps during inference. The function will be
383
+ called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
384
+ callback_steps (`int`, defaults to 1):
385
+ The frequency at which the `callback` function will be called. If not specified, the callback will be
386
+ called at every step.
387
+ guidance_rescale (`float`, defaults to 0.0):
388
+ Guidance rescale factor proposed by [Common Diffusion Noise Schedules and Sample Steps are
389
+ Flawed](https://arxiv.org/pdf/2305.08891.pdf) `guidance_scale` is defined as `φ` in equation 16. of
390
+ [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/pdf/2305.08891.pdf).
391
+ Guidance rescale factor should fix overexposure when using zero terminal SNR.
392
+
393
+ Returns:
394
+ [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
395
+ [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple.
396
+ When returning a tuple, the first element is a list with the generated images, and the second element is a
397
+ list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work"
398
+ (nsfw) content, according to the `safety_checker`.
399
+ """
400
+ height = height or self.unet.config["sample_size"] * self.vae_scale_factor
401
+ width = width or self.unet.config["sample_size"] * self.vae_scale_factor
402
+
403
+ # Don't need to get negative prompts due to LCM guided distillation
404
+ negative_prompt = None
405
+ negative_prompt_embeds = None
406
+
407
+ # check inputs. Raise error if not correct
408
+ self.check_inputs(
409
+ prompt, height, width, callback_steps, negative_prompt, prompt_embeds, negative_prompt_embeds
410
+ )
411
+
412
+ # define call parameters
413
+ if isinstance(prompt, str):
414
+ batch_size = 1
415
+ elif isinstance(prompt, list):
416
+ batch_size = len(prompt)
417
+ else:
418
+ batch_size = prompt_embeds.shape[0]
419
+
420
+ if generator is None:
421
+ generator = np.random.RandomState()
422
+
423
+ start_time = time.time()
424
+ prompt_embeds = self._encode_prompt(
425
+ prompt,
426
+ num_images_per_prompt,
427
+ False,
428
+ negative_prompt,
429
+ prompt_embeds=prompt_embeds,
430
+ negative_prompt_embeds=negative_prompt_embeds,
431
+ )
432
+ encode_prompt_time = time.time() - start_time
433
+ print(f"Prompt encoding time: {encode_prompt_time:.2f}s")
434
+
435
+ # set timesteps
436
+ self.scheduler.set_timesteps(num_inference_steps, original_inference_steps=original_inference_steps)
437
+ timesteps = self.scheduler.timesteps
438
+
439
+ latents = self.prepare_latents(
440
+ batch_size * num_images_per_prompt,
441
+ self.unet.config["in_channels"],
442
+ height,
443
+ width,
444
+ prompt_embeds.dtype,
445
+ generator,
446
+ latents,
447
+ )
448
+
449
+ bs = batch_size * num_images_per_prompt
450
+ # get Guidance Scale Embedding
451
+ w = np.full(bs, guidance_scale - 1, dtype=prompt_embeds.dtype)
452
+ w_embedding = self.get_guidance_scale_embedding(
453
+ w, embedding_dim=self.unet.config["time_cond_proj_dim"], dtype=prompt_embeds.dtype
454
+ )
455
+
456
+ # Adapted from diffusers to extend it for other runtimes than ORT
457
+ timestep_dtype = np.int64
458
+
459
+ num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order
460
+ inference_start = time.time()
461
+ for i, t in enumerate(self.progress_bar(timesteps)):
462
+ timestep = np.array([t], dtype=timestep_dtype)
463
+ noise_pred = self.unet(
464
+ sample=latents,
465
+ timestep=timestep,
466
+ encoder_hidden_states=prompt_embeds,
467
+ timestep_cond=w_embedding,
468
+ )[0]
469
+
470
+ # compute the previous noisy sample x_t -> x_t-1
471
+ latents, denoised = self.scheduler.step(
472
+ torch.from_numpy(noise_pred), t, torch.from_numpy(latents), return_dict=False
473
+ )
474
+ latents, denoised = latents.numpy(), denoised.numpy()
475
+
476
+ # call the callback, if provided
477
+ if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
478
+ if callback is not None and i % callback_steps == 0:
479
+ callback(i, t, latents)
480
+ inference_time = time.time() - inference_start
481
+ print(f"Inference time: {inference_time:.2f}s")
482
+
483
+ decode_start = time.time()
484
+ if output_type == "latent":
485
+ image = denoised
486
+ has_nsfw_concept = None
487
+ else:
488
+ denoised /= self.vae_decoder.config["scaling_factor"]
489
+ # it seems likes there is a strange result for using half-precision vae decoder if batchsize>1
490
+ image = np.concatenate(
491
+ [self.vae_decoder(latent_sample=denoised[i : i + 1])[0] for i in range(denoised.shape[0])]
492
+ )
493
+ # image, has_nsfw_concept = self.run_safety_checker(image)
494
+ has_nsfw_concept = None # skip safety checker
495
+
496
+ if has_nsfw_concept is None:
497
+ do_denormalize = [True] * image.shape[0]
498
+ else:
499
+ do_denormalize = [not has_nsfw for has_nsfw in has_nsfw_concept]
500
+
501
+ image = self.postprocess(image, output_type=output_type, do_denormalize=do_denormalize)
502
+ decode_time = time.time() - decode_start
503
+ print(f"Decode time: {decode_time:.2f}s")
504
+
505
+ total_time = encode_prompt_time + inference_time + decode_time
506
+ print(f"Total time: {total_time:.2f}s")
507
+
508
+ if not return_dict:
509
+ return (image, has_nsfw_concept)
510
+
511
+ return StableDiffusionPipelineOutput(images=image, nsfw_content_detected=has_nsfw_concept)
512
+
513
+
514
+ # Adapted from https://github.com/huggingface/diffusers/blob/v0.22.0/src/diffusers/pipelines/latent_consistency/pipeline_latent_consistency.py#L264
515
+ def get_guidance_scale_embedding(self, w, embedding_dim=512, dtype=None):
516
+ """
517
+ See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298
518
+
519
+ Args:
520
+ timesteps (`torch.Tensor`):
521
+ generate embedding vectors at these timesteps
522
+ embedding_dim (`int`, *optional*, defaults to 512):
523
+ dimension of the embeddings to generate
524
+ dtype:
525
+ data type of the generated embeddings
526
+
527
+ Returns:
528
+ `torch.FloatTensor`: Embedding vectors with shape `(len(timesteps), embedding_dim)`
529
+ """
530
+ w = w * 1000
531
+ half_dim = embedding_dim // 2
532
+ emb = np.log(10000.0) / (half_dim - 1)
533
+ emb = np.exp(np.arange(half_dim, dtype=dtype) * -emb)
534
+ emb = w[:, None] * emb[None, :]
535
+ emb = np.concatenate([np.sin(emb), np.cos(emb)], axis=1)
536
+
537
+ if embedding_dim % 2 == 1: # zero pad
538
+ emb = np.pad(emb, [(0, 0), (0, 1)])
539
+
540
+ assert emb.shape == (w.shape[0], embedding_dim)
541
+ return emb
542
+
543
+ def get_image_path(args, **override_kwargs):
544
+ """ mkdir output folder and encode metadata in the filename
545
+ """
546
+ out_folder = os.path.join(args.o, "_".join(args.prompt.replace("/", "_").rsplit(" ")))
547
+ os.makedirs(out_folder, exist_ok=True)
548
+
549
+ out_fname = f"randomSeed_{override_kwargs.get('seed', None) or args.seed}"
550
+
551
+ out_fname += f"_LCM_"
552
+ out_fname += f"_numInferenceSteps{override_kwargs.get('num_inference_steps', None) or args.num_inference_steps}"
553
+
554
+ return os.path.join(out_folder, out_fname + ".png")
555
+
556
+
557
+ def prepare_controlnet_cond(image_path, height, width):
558
+ image = Image.open(image_path).convert("RGB")
559
+ image = image.resize((height, width), resample=Image.LANCZOS)
560
+ image = np.array(image).transpose(2, 0, 1) / 255.0
561
+ return image
562
+
563
+
564
+ def main(args):
565
+ logger.info(f"Setting random seed to {args.seed}")
566
+
567
+ # load scheduler from /scheduler/scheduler_config.json
568
+ scheduler_config_path = os.path.join(args.i, "scheduler/scheduler_config.json")
569
+ with open(scheduler_config_path, "r") as f:
570
+ scheduler_config = json.load(f)
571
+ user_specified_scheduler = LCMScheduler.from_config(scheduler_config)
572
+
573
+ print("user_specified_scheduler", user_specified_scheduler)
574
+
575
+ pipe = RKNN2LatentConsistencyPipeline(
576
+ text_encoder=RKNN2Model(os.path.join(args.i, "text_encoder")),
577
+ unet=RKNN2Model(os.path.join(args.i, "unet")),
578
+ vae_decoder=RKNN2Model(os.path.join(args.i, "vae_decoder")),
579
+ scheduler=user_specified_scheduler,
580
+ tokenizer=CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch16"),
581
+ )
582
+
583
+ logger.info("Beginning image generation.")
584
+ image = pipe(
585
+ prompt=args.prompt,
586
+ height=int(args.size.split("x")[0]),
587
+ width=int(args.size.split("x")[1]),
588
+ num_inference_steps=args.num_inference_steps,
589
+ guidance_scale=args.guidance_scale,
590
+ generator=np.random.RandomState(args.seed),
591
+ )
592
+
593
+ out_path = get_image_path(args)
594
+ logger.info(f"Saving generated image to {out_path}")
595
+ image["images"][0].save(out_path)
596
+
597
+
598
+ if __name__ == "__main__":
599
+ parser = argparse.ArgumentParser()
600
+
601
+ parser.add_argument(
602
+ "--prompt",
603
+ required=True,
604
+ help="The text prompt to be used for text-to-image generation.")
605
+ parser.add_argument(
606
+ "-i",
607
+ required=True,
608
+ help=("Path to model directory"))
609
+ parser.add_argument("-o", required=True)
610
+ parser.add_argument("--seed",
611
+ default=93,
612
+ type=int,
613
+ help="Random seed to be able to reproduce results")
614
+ parser.add_argument(
615
+ "-s",
616
+ "--size",
617
+ default="256x256",
618
+ type=str,
619
+ help="Image size")
620
+ parser.add_argument(
621
+ "--num-inference-steps",
622
+ default=4,
623
+ type=int,
624
+ help="The number of iterations the unet model will be executed throughout the reverse diffusion process")
625
+ parser.add_argument(
626
+ "--guidance-scale",
627
+ default=7.5,
628
+ type=float,
629
+ help="Controls the influence of the text prompt on sampling process (0=random images)")
630
+
631
+ args = parser.parse_args()
632
+ main(args)