happyme531 commited on Sep 19, 2024

Commit

1df274b

•

1 Parent(s): 0d329b2

Upload 38 files

Browse files

Files changed (39) hide show

.gitattributes +4 -0
README.md +122 -3
model/.gitattributes +36 -0
model/Assets/Icon.png +0 -0
model/Assets/LCM-Dreamshaper-V7-ONNX.json +35 -0
model/Assets/OnnxStack - 640x320.png +0 -0
model/Assets/Preview1.png +0 -0
model/Assets/Preview2.png +0 -0
model/Assets/Preview3.png +0 -0
model/Assets/Preview4.png +0 -0
model/Assets/Preview5.png +0 -0
model/Assets/Preview6.png +0 -0
model/Assets/lcm_angel_30_7.5_2092464983.png +0 -0
model/Assets/lcm_car_30_7.5_2092464983.png +0 -0
model/Assets/lcm_demon_30_7.5_2092464983.png +0 -0
model/Assets/lcm_ninja_30_7.5_2092464983.png +0 -0
model/README.md +56 -0
model/feature_extractor/preprocessor_config.json +28 -0
model/model_index.json +34 -0
model/scheduler/scheduler_config.json +20 -0
model/text_encoder/config.json +25 -0
model/text_encoder/model.onnx +3 -0
model/text_encoder/model.rknn +3 -0
model/tokenizer/merges.txt +0 -0
model/tokenizer/model.onnx +3 -0
model/tokenizer/special_tokens_map.json +30 -0
model/tokenizer/tokenizer_config.json +31 -0
model/tokenizer/vocab.json +0 -0
model/unet/config.json +68 -0
model/unet/model.onnx +3 -0
model/unet/model.onnx_data +3 -0
model/unet/model.rknn +3 -0
model/vae_decoder/config.json +32 -0
model/vae_decoder/model.onnx +3 -0
model/vae_decoder/model.rknn +3 -0
model/vae_encoder/config.json +32 -0
model/vae_encoder/model.onnx +3 -0
run_onnx-lcm.py +665 -0
run_rknn-lcm.py +632 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+model/text_encoder/model.rknn filter=lfs diff=lfs merge=lfs -text
+model/unet/model.onnx_data filter=lfs diff=lfs merge=lfs -text
+model/unet/model.rknn filter=lfs diff=lfs merge=lfs -text
+model/vae_decoder/model.rknn filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,122 @@
----
-license: agpl-3.0
----

+# Stable Diffusion 1.5 Latent Consistency Model for RKNN2
+## (English README see below)
+使用RKNPU2运行Stable Diffusion 1.5 LCM 图像生成模型！！
+- 推理速度(RK3588): 单NPU核, 384x384分辨率, 4次迭代, 生成1张图片平均耗时约13.8秒
+- 内存占用: 约5.2GB
+## 使用方法
+### 1. 克隆或者下载此仓库到本地
+### 2. 安装依赖
+```bash
+pip install diffusers pillow numpy<2
+```
+当然你还要安装rknn-toolkit2-lite2。
+### 3. 运行
+```bash
+python ./run_rknn-lcm.py -i ./model -o ./images --num-inference-steps 4 -s 384x384 --prompt "Majestic mountain landscape with snow-capped peaks, autumn foliage in vibrant reds and oranges, a turquoise river winding through a valley, crisp and serene atmosphere, ultra-realistic style."
+```
+## 模型转换
+### 1. 下载模型
+下载一个onnx格式的Stable Diffusion 1.5 LCM模型，并放到`./model`目录下。
+```bash
+huggingface-cli download TheyCallMeHex/LCM-Dreamshaper-V7-ONNX
+cp -r -L ~/.cache/huggingface/hub/models--TheyCallMeHex--LCM-Dreamshaper-V7-ONNX/snapshots/4029a217f9cdc0437f395738d3ab686bb910ceea ./model
+```
+理论上你也可以通过将LCM Lora合并到普通的Stable Diffusion 1.5模型，然后转换为onnx格式，来实现LCM的推理。但是我这边也不知道怎么做，有知道的小伙伴可以提个PR。
+### 2. 转换模型
+```bash
+# 转换模型, 384x384分辨率
+python ./convert-onnx-to-rknn.py -m ./model -r 384x384
+```
+注意分辨率越高，模型越大，转换时间越长。不建议使用太大的分辨率。
+## 已知问题
+1. 截至目前，使用最新版本的rknn-toolkit2 2.2.0版本转换的模型仍然存在极其严重的精度损失！即使使用的是fp16数据类型。如图，上方是使用onnx模型推理的结果，下方是使用rknn模型推理的结果。所有参数均一致。并且分辨率越高，精度损失越严重。这是rknn-toolkit2的bug。
+2. 其实模型转换脚本可以选择多个分辨率(例如"384x384,256x256"), 但这会导致模型转换失败。这是rknn-toolkit2的bug。
+## 参考
+- [TheyCallMeHex/LCM-Dreamshaper-V7-ONNX](https://huggingface.co/TheyCallMeHex/LCM-Dreamshaper-V7-ONNX)
+- [Optimum's LatentConsistencyPipeline](https://github.com/huggingface/optimum/blob/main/optimum/pipelines/diffusers/pipeline_latent_consistency.py)
+- [happyme531/RK3588-stable-diffusion-GPU](https://github.com/happyme531/RK3588-stable-diffusion-GPU)
+## English README
+# Stable Diffusion 1.5 Latent Consistency Model for RKNN2
+Run the Stable Diffusion 1.5 LCM image generation model using RKNPU2!
+- Inference speed (RK3588): Single NPU core, 384x384 resolution, 4 iterations, average time to generate 1 image is about 13.8 seconds
+- Memory usage: About 5.2GB
+## Usage
+### 1. Clone or download this repository to your local machine
+### 2. Install dependencies
+```bash
+pip install diffusers pillow numpy<2
+```
+Of course, you also need to install rknn-toolkit2-lite2.
+### 3. Run
+```bash
+python ./run_rknn-lcm.py -i ./model -o ./images --num-inference-steps 4 -s 384x384 --prompt "Majestic mountain landscape with snow-capped peaks, autumn foliage in vibrant reds and oranges, a turquoise river winding through a valley, crisp and serene atmosphere, ultra-realistic style."
+```
+## Model Conversion
+### 1. Download the model
+Download a Stable Diffusion 1.5 LCM model in ONNX format and place it in the `./model` directory.
+```bash
+huggingface-cli download TheyCallMeHex/LCM-Dreamshaper-V7-ONNX
+cp -r -L ~/.cache/huggingface/hub/models--TheyCallMeHex--LCM-Dreamshaper-V7-ONNX/snapshots/4029a217f9cdc0437f395738d3ab686bb910ceea ./model
+```
+In theory, you could also achieve LCM inference by merging the LCM Lora into a regular Stable Diffusion 1.5 model and then converting it to ONNX format. However, I'm not sure how to do this. If anyone knows, please feel free to submit a PR.
+### 2. Convert the model
+```bash
+# Convert the model, 384x384 resolution
+python ./convert-onnx-to-rknn.py -m ./model -r 384x384
+```
+Note that the higher the resolution, the larger the model and the longer the conversion time. It's not recommended to use very high resolutions.
+## Known Issues
+1. As of now, models converted using the latest version of rknn-toolkit2 (version 2.2.0) still suffer from severe precision loss, even when using fp16 data type. As shown in the image, the top is the result of inference using the ONNX model, and the bottom is the result using the RKNN model. All parameters are the same. Moreover, the higher the resolution, the more severe the precision loss. This is a bug in rknn-toolkit2.
+2. Actually, the model conversion script can select multiple resolutions (e.g., "384x384,256x256"), but this causes the model conversion to fail. This is a bug in rknn-toolkit2.
+## References
+- [TheyCallMeHex/LCM-Dreamshaper-V7-ONNX](https://huggingface.co/TheyCallMeHex/LCM-Dreamshaper-V7-ONNX)
+- [Optimum's LatentConsistencyPipeline](https://github.com/huggingface/optimum/blob/main/optimum/pipelines/diffusers/pipeline_latent_consistency.py)
+- [happyme531/RK3588-stable-diffusion-GPU](https://github.com/happyme531/RK3588-stable-diffusion-GPU)

model/.gitattributes ADDED Viewed

	@@ -0,0 +1,36 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+*.onnx_data filter=lfs diff=lfs merge=lfs -text

model/Assets/Icon.png ADDED Viewed

model/Assets/LCM-Dreamshaper-V7-ONNX.json ADDED Viewed

	@@ -0,0 +1,35 @@

+{
+  "Name": "Dreamshaper v7(LCM)",
+  "Description": "DreamShaper started as a model to have an alternative to MidJourney in the open source world.",
+  "Author": "TheyCallMeHex",
+  "Repository": "https://huggingface.co/TheyCallMeHex/LCM-Dreamshaper-V7-ONNX",
+  "ImageIcon": "https://raw.githubusercontent.com/saddam213/OnnxStack/master/Assets/Templates/LCM-Dreamshaper-V7/Icon.png",
+  "Status": "Active",
+  "PadTokenId": 49407,
+  "BlankTokenId": 49407,
+  "TokenizerLimit": 77,
+  "EmbeddingsLength": 768,
+  "ScaleFactor": 0.18215,
+  "PipelineType": "LatentConsistency",
+  "Diffusers": [
+    "TextToImage",
+    "ImageToImage",
+    "ImageInpaintLegacy"
+  ],
+  "ModelFiles": [
+	  "https://huggingface.co/TheyCallMeHex/LCM-Dreamshaper-V7-ONNX/resolve/main/tokenizer/model.onnx",
+	  "https://huggingface.co/TheyCallMeHex/LCM-Dreamshaper-V7-ONNX/resolve/main/unet/model.onnx",
+	  "https://huggingface.co/TheyCallMeHex/LCM-Dreamshaper-V7-ONNX/resolve/main/unet/model.onnx_data",
+	  "https://huggingface.co/TheyCallMeHex/LCM-Dreamshaper-V7-ONNX/resolve/main/text_encoder/model.onnx",
+	  "https://huggingface.co/TheyCallMeHex/LCM-Dreamshaper-V7-ONNX/resolve/main/vae_decoder/model.onnx",
+	  "https://huggingface.co/TheyCallMeHex/LCM-Dreamshaper-V7-ONNX/resolve/main/vae_encoder/model.onnx"
+  ],
+  "Images": [
+    "https://raw.githubusercontent.com/saddam213/OnnxStack/master/Assets/Templates/LCM-Dreamshaper-V7/Preview1.png",
+    "https://raw.githubusercontent.com/saddam213/OnnxStack/master/Assets/Templates/LCM-Dreamshaper-V7/Preview2.png",
+    "https://raw.githubusercontent.com/saddam213/OnnxStack/master/Assets/Templates/LCM-Dreamshaper-V7/Preview3.png",
+    "https://raw.githubusercontent.com/saddam213/OnnxStack/master/Assets/Templates/LCM-Dreamshaper-V7/Preview4.png",
+    "https://raw.githubusercontent.com/saddam213/OnnxStack/master/Assets/Templates/LCM-Dreamshaper-V7/Preview5.png",
+    "https://raw.githubusercontent.com/saddam213/OnnxStack/master/Assets/Templates/LCM-Dreamshaper-V7/Preview6.png"
+  ]
+}

model/Assets/OnnxStack - 640x320.png ADDED Viewed

model/Assets/Preview1.png ADDED Viewed

model/Assets/Preview2.png ADDED Viewed

model/Assets/Preview3.png ADDED Viewed

model/Assets/Preview4.png ADDED Viewed

model/Assets/Preview5.png ADDED Viewed

model/Assets/Preview6.png ADDED Viewed

model/Assets/lcm_angel_30_7.5_2092464983.png ADDED Viewed

model/Assets/lcm_car_30_7.5_2092464983.png ADDED Viewed

model/Assets/lcm_demon_30_7.5_2092464983.png ADDED Viewed

model/Assets/lcm_ninja_30_7.5_2092464983.png ADDED Viewed

model/README.md ADDED Viewed

	@@ -0,0 +1,56 @@

+---
+language:
+- en
+license: mit
+tags:
+- stable-diffusion
+- stable-diffusion-diffusers
+- text-to-image
+- diffusers
+inference: true
+---
+<p align="center" width="100%">
+    <img width="80%" src="Assets/OnnxStack - 640x320.png">
+</p>
+### OnnxStack
+This model has been converted to ONNX and tested with OnnxStack
+- [OnnxStack](https://github.com/saddam213/OnnxStack)
+### LCM Dreamshaper V7 Diffusion
+This model was converted to ONNX from LCM Dreamshaper V7
+- [LCM-Dreamshaper-V7](https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7)
+### Sample Images
+*A demon*
+<img src="Assets/lcm_demon_30_7.5_2092464983.png" width="256" alt="Image of browser inferencing on sample images."/>
+     Seed: 207582124     GuidanceScale: 7.5     NumInferenceSteps: 30
+__________________________
+*An angel*
+<img src="Assets/lcm_angel_30_7.5_2092464983.png" width="256" alt="Image of browser inferencing on sample images."/>
+     Seed: 207582124     GuidanceScale: 7.5     NumInferenceSteps: 30
+__________________________
+*A ninja*
+<img src="Assets/lcm_ninja_30_7.5_2092464983.png" width="256" alt="Image of browser inferencing on sample images."/>
+     Seed: 207582124     GuidanceScale: 7.5     NumInferenceSteps: 30
+__________________________
+*a japanese dometic market sports car sitting in a showroom*
+<img src="Assets/lcm_car_30_7.5_2092464983.png" width="256" alt="Image of browser inferencing on sample images."/>
+     Seed: 207582124     GuidanceScale: 7.5     NumInferenceSteps: 30
+__________________________

model/feature_extractor/preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "crop_size": {
+    "height": 224,
+    "width": 224
+  },
+  "do_center_crop": true,
+  "do_convert_rgb": true,
+  "do_normalize": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "feature_extractor_type": "CLIPFeatureExtractor",
+  "image_mean": [
+    0.48145466,
+    0.4578275,
+    0.40821073
+  ],
+  "image_processor_type": "CLIPImageProcessor",
+  "image_std": [
+    0.26862954,
+    0.26130258,
+    0.27577711
+  ],
+  "resample": 3,
+  "rescale_factor": 0.00392156862745098,
+  "size": {
+    "shortest_edge": 224
+  }
+}

model/model_index.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "_class_name": "StableDiffusionPipeline",
+  "_diffusers_version": "0.22.0.dev0",
+  "_name_or_path": "LCM_Dreamshaper_v7",
+  "feature_extractor": [
+    "transformers",
+    "CLIPImageProcessor"
+  ],
+  "requires_safety_checker": true,
+  "safety_checker": [
+    "stable_diffusion",
+    "StableDiffusionSafetyChecker"
+  ],
+  "scheduler": [
+    "diffusers",
+    "LCMScheduler"
+  ],
+  "text_encoder": [
+    "transformers",
+    "CLIPTextModel"
+  ],
+  "tokenizer": [
+    "transformers",
+    "CLIPTokenizer"
+  ],
+  "unet": [
+    "diffusers",
+    "UNet2DConditionModel"
+  ],
+  "vae": [
+    "diffusers",
+    "AutoencoderKL"
+  ]
+}

model/scheduler/scheduler_config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "_class_name": "LCMScheduler",
+  "_diffusers_version": "0.22.0.dev0",
+  "beta_end": 0.012,
+  "beta_schedule": "scaled_linear",
+  "beta_start": 0.00085,
+  "clip_sample": false,
+  "clip_sample_range": 1.0,
+  "dynamic_thresholding_ratio": 0.995,
+  "num_train_timesteps": 1000,
+  "original_inference_steps": 50,
+  "prediction_type": "epsilon",
+  "rescale_betas_zero_snr": false,
+  "sample_max_value": 1.0,
+  "set_alpha_to_one": true,
+  "steps_offset": 1,
+  "thresholding": false,
+  "timestep_spacing": "leading",
+  "trained_betas": null
+}

model/text_encoder/config.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "_name_or_path": "LCM_Dreamshaper_v7\\text_encoder",
+  "architectures": [
+    "CLIPTextModel"
+  ],
+  "attention_dropout": 0.0,
+  "bos_token_id": 0,
+  "dropout": 0.0,
+  "eos_token_id": 2,
+  "hidden_act": "quick_gelu",
+  "hidden_size": 768,
+  "initializer_factor": 1.0,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 77,
+  "model_type": "clip_text_model",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 1,
+  "projection_dim": 768,
+  "torch_dtype": "float32",
+  "transformers_version": "4.34.1",
+  "vocab_size": 49408
+}

model/text_encoder/model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fefe95eab6542e5fb7642b3f592489176836cc3fd49196b924a63760602c8c4a
+size 492588002

model/text_encoder/model.rknn ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:36d87985d8b86a47ce8c6668fe3049c699f1633405e016c4f6c227a3dffaf638
+size 249037093

model/tokenizer/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model/tokenizer/model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:52af50d264d702c351484aabf62c64abe61f59d6a6d2c508a3e797e23dc1e008
+size 1683168

model/tokenizer/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "bos_token": {
+    "content": "<|startoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

model/tokenizer/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "49406": {
+      "content": "<|startoftext|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "49407": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [],
+  "bos_token": "<|startoftext|>",
+  "clean_up_tokenization_spaces": true,
+  "do_lower_case": true,
+  "eos_token": "<|endoftext|>",
+  "errors": "replace",
+  "model_max_length": 77,
+  "pad_token": "<|endoftext|>",
+  "tokenizer_class": "CLIPTokenizer",
+  "unk_token": "<|endoftext|>"
+}

model/tokenizer/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

model/unet/config.json ADDED Viewed

	@@ -0,0 +1,68 @@

+{
+  "_class_name": "UNet2DConditionModel",
+  "_diffusers_version": "0.22.0.dev0",
+  "_name_or_path": "LCM_Dreamshaper_v7\\unet",
+  "act_fn": "silu",
+  "addition_embed_type": null,
+  "addition_embed_type_num_heads": 64,
+  "addition_time_embed_dim": null,
+  "attention_head_dim": 8,
+  "attention_type": "default",
+  "block_out_channels": [
+    320,
+    640,
+    1280,
+    1280
+  ],
+  "center_input_sample": false,
+  "class_embed_type": null,
+  "class_embeddings_concat": false,
+  "conv_in_kernel": 3,
+  "conv_out_kernel": 3,
+  "cross_attention_dim": 768,
+  "cross_attention_norm": null,
+  "down_block_types": [
+    "CrossAttnDownBlock2D",
+    "CrossAttnDownBlock2D",
+    "CrossAttnDownBlock2D",
+    "DownBlock2D"
+  ],
+  "downsample_padding": 1,
+  "dropout": 0.0,
+  "dual_cross_attention": false,
+  "encoder_hid_dim": null,
+  "encoder_hid_dim_type": null,
+  "flip_sin_to_cos": true,
+  "freq_shift": 0,
+  "in_channels": 4,
+  "layers_per_block": 2,
+  "mid_block_only_cross_attention": null,
+  "mid_block_scale_factor": 1,
+  "mid_block_type": "UNetMidBlock2DCrossAttn",
+  "norm_eps": 1e-05,
+  "norm_num_groups": 32,
+  "num_attention_heads": null,
+  "num_class_embeds": null,
+  "only_cross_attention": false,
+  "out_channels": 4,
+  "projection_class_embeddings_input_dim": null,
+  "resnet_out_scale_factor": 1.0,
+  "resnet_skip_time_act": false,
+  "resnet_time_scale_shift": "default",
+  "reverse_transformer_layers_per_block": null,
+  "sample_size": 96,
+  "time_cond_proj_dim": 256,
+  "time_embedding_act_fn": null,
+  "time_embedding_dim": null,
+  "time_embedding_type": "positional",
+  "timestep_post_act": null,
+  "transformer_layers_per_block": 1,
+  "up_block_types": [
+    "UpBlock2D",
+    "CrossAttnUpBlock2D",
+    "CrossAttnUpBlock2D",
+    "CrossAttnUpBlock2D"
+  ],
+  "upcast_attention": null,
+  "use_linear_projection": false
+}

model/unet/model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f3e9a08a3e5b943046bf90a513c492cf4c6e31e26229062af8eb4ad2ddf172b5
+size 1948508

model/unet/model.onnx_data ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ef99ccc336de0e79f247fcb3d1398b3f3d1a02796916b88a351d7a83f570a31a
+size 3438411520

model/unet/model.rknn ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:622e350654e2a2e90ff863f51aea1a30f4d148ed3bf8645f355a7c37df13420c
+size 1758599217

model/vae_decoder/config.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+  "_class_name": "AutoencoderKL",
+  "_diffusers_version": "0.22.0.dev0",
+  "_name_or_path": "LCM_Dreamshaper_v7\\vae",
+  "act_fn": "silu",
+  "block_out_channels": [
+    128,
+    256,
+    512,
+    512
+  ],
+  "down_block_types": [
+    "DownEncoderBlock2D",
+    "DownEncoderBlock2D",
+    "DownEncoderBlock2D",
+    "DownEncoderBlock2D"
+  ],
+  "force_upcast": true,
+  "in_channels": 3,
+  "latent_channels": 4,
+  "layers_per_block": 2,
+  "norm_num_groups": 32,
+  "out_channels": 3,
+  "sample_size": 768,
+  "scaling_factor": 0.18215,
+  "up_block_types": [
+    "UpDecoderBlock2D",
+    "UpDecoderBlock2D",
+    "UpDecoderBlock2D",
+    "UpDecoderBlock2D"
+  ]
+}

model/vae_decoder/model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ec5298d7bfa592d492d36b42d17f794fcdb9175e2aac366956d40f3f38d13ca1
+size 198078038

model/vae_decoder/model.rknn ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f0d6fc059da6ca88930ef04148f6a4311512cce547a663102cdfedc50f22b8c3
+size 159530108

model/vae_encoder/config.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+  "_class_name": "AutoencoderKL",
+  "_diffusers_version": "0.22.0.dev0",
+  "_name_or_path": "LCM_Dreamshaper_v7\\vae",
+  "act_fn": "silu",
+  "block_out_channels": [
+    128,
+    256,
+    512,
+    512
+  ],
+  "down_block_types": [
+    "DownEncoderBlock2D",
+    "DownEncoderBlock2D",
+    "DownEncoderBlock2D",
+    "DownEncoderBlock2D"
+  ],
+  "force_upcast": true,
+  "in_channels": 3,
+  "latent_channels": 4,
+  "layers_per_block": 2,
+  "norm_num_groups": 32,
+  "out_channels": 3,
+  "sample_size": 768,
+  "scaling_factor": 0.18215,
+  "up_block_types": [
+    "UpDecoderBlock2D",
+    "UpDecoderBlock2D",
+    "UpDecoderBlock2D",
+    "UpDecoderBlock2D"
+  ]
+}

model/vae_encoder/model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:268d4398021d7bc91e91c94e4835cc5ffa471155db1b722d0a43f6d1a4f822fd
+size 136760154

run_onnx-lcm.py ADDED Viewed

	@@ -0,0 +1,665 @@

+import argparse
+import json
+import time
+import PIL
+from diffusers import StableDiffusionPipeline
+from diffusers.pipelines.pipeline_utils import DiffusionPipeline
+from diffusers.pipelines.stable_diffusion import StableDiffusionPipelineOutput
+from diffusers.schedulers import (
+    LCMScheduler
+)
+from diffusers.schedulers.scheduling_utils import SchedulerMixin
+import gc
+import inspect
+import logging
+logging.basicConfig()
+logger = logging.getLogger(__name__)
+logger.setLevel(logging.INFO)
+import numpy as np
+import os
+import torch  # Only used for `torch.from_tensor` in `pipe.scheduler.step()`
+from transformers import CLIPFeatureExtractor, CLIPTokenizer
+from typing import Callable, List, Optional, Union, Tuple
+from PIL import Image
+# from rknnlite.api import RKNNLite
+# class RKNN2Model:
+#     """ Wrapper for running RKNPU2 models """
+#     def __init__(self, model_path):
+#         logger.info(f"Loading {model_path}")
+#         start = time.time()
+#         assert os.path.exists(model_path) and model_path.endswith(".rknn")
+#         self.rknnlite = RKNNLite()
+#         self.rknnlite.load_rknn(model_path)
+#         self.rknnlite.init_runtime(core_mask=RKNNLite.NPU_CORE_AUTO) # Multi-core will cause kernel crash
+#         load_time = time.time() - start
+#         logger.info(f"Done. Took {load_time:.1f} seconds.")
+#         self.modelname = model_path.split("/")[-1]
+#         self.inference_time = 0
+#     def __call__(self, **kwargs) -> List[np.ndarray]:
+#         np.savez(f"{self.modelname}_input_{self.inference_time}.npz", **kwargs)
+#         #print(kwargs)
+#         input_list = [value for key, value in kwargs.items()]
+#         for i, input in enumerate(input_list):
+#             if isinstance(input, np.ndarray):
+#                 print(f"input {i} shape: {input.shape}")
+#         results = self.rknnlite.inference(inputs=input_list)
+#         for res in results:
+#             print(f"output shape: {res.shape}")
+#         return results
+import onnxruntime as ort
+class RKNN2Model:
+    """ Wrapper for running ONNX models """
+    def __init__(self, model_dir):
+        logger.info(f"Loading {model_dir}")
+        start = time.time()
+        self.config = json.load(open(os.path.join(model_dir, "config.json")))
+        assert os.path.exists(model_dir) and os.path.exists(os.path.join(model_dir, "model.onnx"))
+        self.session = ort.InferenceSession(os.path.join(model_dir, "model.onnx"))
+        load_time = time.time() - start
+        logger.info(f"Done. Took {load_time:.1f} seconds.")
+        self.modelname = model_dir.split("/")[-1]
+        self.inference_time = 0
+    def __call__(self, **kwargs) -> List[np.ndarray]:
+        # np.savez(f"onnx_out/{self.modelname}_input_{self.inference_time}.npz", **kwargs)
+        self.inference_time += 1
+        results = self.session.run(None, kwargs)
+        results_list = []
+        for res in results:
+            results_list.append(res)
+        return results
+class RKNN2StableDiffusionPipeline(DiffusionPipeline):
+    """ RKNN2 version of
+    `diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline`
+    """
+    def __init__(
+            self,
+            text_encoder: RKNN2Model,
+            unet: RKNN2Model,
+            vae_decoder: RKNN2Model,
+            scheduler: LCMScheduler,
+            tokenizer: CLIPTokenizer,
+            force_zeros_for_empty_prompt: Optional[bool] = True,
+            feature_extractor: Optional[CLIPFeatureExtractor] = None,
+            text_encoder_2: Optional[RKNN2Model] = None,
+            tokenizer_2: Optional[CLIPTokenizer] = None
+    ):
+        super().__init__()
+        # Register non-Core ML components of the pipeline similar to the original pipeline
+        self.register_modules(
+            tokenizer=tokenizer,
+            scheduler=scheduler,
+            feature_extractor=feature_extractor,
+        )
+        self.force_zeros_for_empty_prompt = force_zeros_for_empty_prompt
+        self.safety_checker = None
+        # Register Core ML components of the pipeline
+        self.text_encoder = text_encoder
+        self.text_encoder_2 = text_encoder_2
+        self.tokenizer_2 = tokenizer_2
+        self.unet = unet
+        self.vae_decoder = vae_decoder
+        VAE_DECODER_UPSAMPLE_FACTOR = 8
+        # In PyTorch, users can determine the tensor shapes dynamically by default
+        # In CoreML, tensors have static shapes unless flexible shapes were used during export
+        # See https://coremltools.readme.io/docs/flexible-inputs
+        latent_h, latent_w = 32, 32  # hallo1: FIXME: hardcoded value
+        self.height = latent_h * VAE_DECODER_UPSAMPLE_FACTOR
+        self.width = latent_w * VAE_DECODER_UPSAMPLE_FACTOR
+        self.vae_scale_factor = VAE_DECODER_UPSAMPLE_FACTOR
+        logger.info(
+            f"Stable Diffusion configured to generate {self.height}x{self.width} images"
+        )
+    @staticmethod
+    def postprocess(
+        image: np.ndarray,
+        output_type: str = "pil",
+        do_denormalize: Optional[List[bool]] = None,
+        ):
+        def numpy_to_pil(images: np.ndarray):
+            """
+            Convert a numpy image or a batch of images to a PIL image.
+            """
+            if images.ndim == 3:
+                images = images[None, ...]
+            images = (images * 255).round().astype("uint8")
+            if images.shape[-1] == 1:
+                # special case for grayscale (single channel) images
+                pil_images = [Image.fromarray(image.squeeze(), mode="L") for image in images]
+            else:
+                pil_images = [Image.fromarray(image) for image in images]
+            return pil_images
+        def denormalize(images: np.ndarray):
+            """
+            Denormalize an image array to [0,1].
+            """
+            return np.clip(images / 2 + 0.5, 0, 1)
+        if not isinstance(image, np.ndarray):
+            raise ValueError(
+                f"Input for postprocessing is in incorrect format: {type(image)}. We only support np array"
+            )
+        if output_type not in ["latent", "np", "pil"]:
+            deprecation_message = (
+                f"the output_type {output_type} is outdated and has been set to `np`. Please make sure to set it to one of these instead: "
+                "`pil`, `np`, `pt`, `latent`"
+            )
+            logger.warning(deprecation_message)
+            output_type = "np"
+        if output_type == "latent":
+            return image
+        if do_denormalize is None:
+            raise ValueError("do_denormalize is required for postprocessing")
+        image = np.stack(
+            [denormalize(image[i]) if do_denormalize[i] else image[i] for i in range(image.shape[0])], axis=0
+        )
+        image = image.transpose((0, 2, 3, 1))
+        if output_type == "pil":
+            image = numpy_to_pil(image)
+        return image
+    def _encode_prompt(
+        self,
+        prompt: Union[str, List[str]],
+        num_images_per_prompt: int,
+        do_classifier_free_guidance: bool,
+        negative_prompt: Optional[Union[str, list]],
+        prompt_embeds: Optional[np.ndarray] = None,
+        negative_prompt_embeds: Optional[np.ndarray] = None,
+    ):
+        r"""
+        Encodes the prompt into text encoder hidden states.
+        Args:
+            prompt (`Union[str, List[str]]`):
+                prompt to be encoded
+            num_images_per_prompt (`int`):
+                number of images that should be generated per prompt
+            do_classifier_free_guidance (`bool`):
+                whether to use classifier free guidance or not
+            negative_prompt (`Optional[Union[str, list]]`):
+                The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
+                if `guidance_scale` is less than `1`).
+            prompt_embeds (`Optional[np.ndarray]`, defaults to `None`):
+                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                provided, text embeddings will be generated from `prompt` input argument.
+            negative_prompt_embeds (`Optional[np.ndarray]`, defaults to `None`):
+                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
+                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
+                argument.
+        """
+        if isinstance(prompt, str):
+            batch_size = 1
+        elif isinstance(prompt, list):
+            batch_size = len(prompt)
+        else:
+            batch_size = prompt_embeds.shape[0]
+        if prompt_embeds is None:
+            # get prompt text embeddings
+            text_inputs = self.tokenizer(
+                prompt,
+                padding="max_length",
+                max_length=self.tokenizer.model_max_length,
+                truncation=True,
+                return_tensors="np",
+            )
+            text_input_ids = text_inputs.input_ids
+            untruncated_ids = self.tokenizer(prompt, padding="max_length", return_tensors="np").input_ids
+            if not np.array_equal(text_input_ids, untruncated_ids):
+                removed_text = self.tokenizer.batch_decode(
+                    untruncated_ids[:, self.tokenizer.model_max_length - 1 : -1]
+                )
+                logger.warning(
+                    "The following part of your input was truncated because CLIP can only handle sequences up to"
+                    f" {self.tokenizer.model_max_length} tokens: {removed_text}"
+                )
+            prompt_embeds = self.text_encoder(input_ids=text_input_ids.astype(np.int32))[0]
+        prompt_embeds = np.repeat(prompt_embeds, num_images_per_prompt, axis=0)
+        # get unconditional embeddings for classifier free guidance
+        if do_classifier_free_guidance and negative_prompt_embeds is None:
+            uncond_tokens: List[str]
+            if negative_prompt is None:
+                uncond_tokens = [""] * batch_size
+            elif type(prompt) is not type(negative_prompt):
+                raise TypeError(
+                    f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
+                    f" {type(prompt)}."
+                )
+            elif isinstance(negative_prompt, str):
+                uncond_tokens = [negative_prompt] * batch_size
+            elif batch_size != len(negative_prompt):
+                raise ValueError(
+                    f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
+                    f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
+                    " the batch size of `prompt`."
+                )
+            else:
+                uncond_tokens = negative_prompt
+            max_length = prompt_embeds.shape[1]
+            uncond_input = self.tokenizer(
+                uncond_tokens,
+                padding="max_length",
+                max_length=max_length,
+                truncation=True,
+                return_tensors="np",
+            )
+            negative_prompt_embeds = self.text_encoder(input_ids=uncond_input.input_ids.astype(np.int32))[0]
+        if do_classifier_free_guidance:
+            negative_prompt_embeds = np.repeat(negative_prompt_embeds, num_images_per_prompt, axis=0)
+            # For classifier free guidance, we need to do two forward passes.
+            # Here we concatenate the unconditional and text embeddings into a single batch
+            # to avoid doing two forward passes
+            prompt_embeds = np.concatenate([negative_prompt_embeds, prompt_embeds])
+        return prompt_embeds
+    # Copied from https://github.com/huggingface/diffusers/blob/v0.17.1/src/diffusers/pipelines/stable_diffusion/pipeline_onnx_stable_diffusion.py#L217
+    def check_inputs(
+        self,
+        prompt: Union[str, List[str]],
+        height: Optional[int],
+        width: Optional[int],
+        callback_steps: int,
+        negative_prompt: Optional[str] = None,
+        prompt_embeds: Optional[np.ndarray] = None,
+        negative_prompt_embeds: Optional[np.ndarray] = None,
+    ):
+        if height % 8 != 0 or width % 8 != 0:
+            raise ValueError(f"`height` and `width` have to be divisible by 8 but are {height} and {width}.")
+        if (callback_steps is None) or (
+            callback_steps is not None and (not isinstance(callback_steps, int) or callback_steps <= 0)
+        ):
+            raise ValueError(
+                f"`callback_steps` has to be a positive integer but is {callback_steps} of type"
+                f" {type(callback_steps)}."
+            )
+        if prompt is not None and prompt_embeds is not None:
+            raise ValueError(
+                f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
+                " only forward one of the two."
+            )
+        elif prompt is None and prompt_embeds is None:
+            raise ValueError(
+                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
+            )
+        elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
+            raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
+        if negative_prompt is not None and negative_prompt_embeds is not None:
+            raise ValueError(
+                f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
+                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
+            )
+        if prompt_embeds is not None and negative_prompt_embeds is not None:
+            if prompt_embeds.shape != negative_prompt_embeds.shape:
+                raise ValueError(
+                    "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
+                    f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
+                    f" {negative_prompt_embeds.shape}."
+                )
+    # Adapted from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_latents
+    def prepare_latents(self, batch_size, num_channels_latents, height, width, dtype, generator, latents=None):
+        shape = (batch_size, num_channels_latents, height // self.vae_scale_factor, width // self.vae_scale_factor)
+        if isinstance(generator, list) and len(generator) != batch_size:
+            raise ValueError(
+                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
+                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
+            )
+        if latents is None:
+            if isinstance(generator, np.random.RandomState):
+                latents = generator.randn(*shape).astype(dtype)
+            elif isinstance(generator, torch.Generator):
+                latents = torch.randn(*shape, generator=generator).numpy().astype(dtype)
+            else:
+                raise ValueError(
+                    f"Expected `generator` to be of type `np.random.RandomState` or `torch.Generator`, but got"
+                    f" {type(generator)}."
+                )
+        elif latents.shape != shape:
+            raise ValueError(f"Unexpected latents shape, got {latents.shape}, expected {shape}")
+        # scale the initial noise by the standard deviation required by the scheduler
+        latents = latents * np.float64(self.scheduler.init_noise_sigma)
+        return latents
+    # Adapted from https://github.com/huggingface/diffusers/blob/v0.22.0/src/diffusers/pipelines/latent_consistency/pipeline_latent_consistency.py#L264
+    def __call__(
+        self,
+        prompt: Union[str, List[str]] = "",
+        height: Optional[int] = None,
+        width: Optional[int] = None,
+        num_inference_steps: int = 4,
+        original_inference_steps: int = None,
+        guidance_scale: float = 8.5,
+        num_images_per_prompt: int = 1,
+        generator: Optional[Union[np.random.RandomState, torch.Generator]] = None,
+        latents: Optional[np.ndarray] = None,
+        prompt_embeds: Optional[np.ndarray] = None,
+        output_type: str = "pil",
+        return_dict: bool = True,
+        callback: Optional[Callable[[int, int, np.ndarray], None]] = None,
+        callback_steps: int = 1,
+    ):
+        r"""
+        Function invoked when calling the pipeline for generation.
+        Args:
+            prompt (`Optional[Union[str, List[str]]]`, defaults to None):
+                The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
+                instead.
+            height (`Optional[int]`, defaults to None):
+                The height in pixels of the generated image.
+            width (`Optional[int]`, defaults to None):
+                The width in pixels of the generated image.
+            num_inference_steps (`int`, defaults to 50):
+                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
+                expense of slower inference.
+            guidance_scale (`float`, defaults to 7.5):
+                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
+                `guidance_scale` is defined as `w` of equation 2. of [Imagen
+                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
+                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
+                usually at the expense of lower image quality.
+            num_images_per_prompt (`int`, defaults to 1):
+                The number of images to generate per prompt.
+            generator (`Optional[Union[np.random.RandomState, torch.Generator]]`, defaults to `None`):
+                A np.random.RandomState to make generation deterministic.
+            latents (`Optional[np.ndarray]`, defaults to `None`):
+                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
+                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
+                tensor will ge generated by sampling using the supplied random `generator`.
+            prompt_embeds (`Optional[np.ndarray]`, defaults to `None`):
+                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                provided, text embeddings will be generated from `prompt` input argument.
+            output_type (`str`, defaults to `"pil"`):
+                The output format of the generate image. Choose between
+                [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
+            return_dict (`bool`, defaults to `True`):
+                Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
+                plain tuple.
+            callback (Optional[Callable], defaults to `None`):
+                A function that will be called every `callback_steps` steps during inference. The function will be
+                called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
+            callback_steps (`int`, defaults to 1):
+                The frequency at which the `callback` function will be called. If not specified, the callback will be
+                called at every step.
+            guidance_rescale (`float`, defaults to 0.0):
+                Guidance rescale factor proposed by [Common Diffusion Noise Schedules and Sample Steps are
+                Flawed](https://arxiv.org/pdf/2305.08891.pdf) `guidance_scale` is defined as `φ` in equation 16. of
+                [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/pdf/2305.08891.pdf).
+                Guidance rescale factor should fix overexposure when using zero terminal SNR.
+        Returns:
+            [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
+            [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple.
+            When returning a tuple, the first element is a list with the generated images, and the second element is a
+            list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work"
+            (nsfw) content, according to the `safety_checker`.
+        """
+        height = height or self.unet.config["sample_size"] * self.vae_scale_factor
+        width = width or self.unet.config["sample_size"] * self.vae_scale_factor
+        # Don't need to get negative prompts due to LCM guided distillation
+        negative_prompt = None
+        negative_prompt_embeds = None
+        # check inputs. Raise error if not correct
+        self.check_inputs(
+            prompt, height, width, callback_steps, negative_prompt, prompt_embeds, negative_prompt_embeds
+        )
+        # define call parameters
+        if isinstance(prompt, str):
+            batch_size = 1
+        elif isinstance(prompt, list):
+            batch_size = len(prompt)
+        else:
+            batch_size = prompt_embeds.shape[0]
+        if generator is None:
+            generator = np.random.RandomState()
+        prompt_embeds = self._encode_prompt(
+            prompt,
+            num_images_per_prompt,
+            False,
+            negative_prompt,
+            prompt_embeds=prompt_embeds,
+            negative_prompt_embeds=negative_prompt_embeds,
+        )
+        # set timesteps
+        self.scheduler.set_timesteps(num_inference_steps, original_inference_steps=original_inference_steps)
+        timesteps = self.scheduler.timesteps
+        latents = self.prepare_latents(
+            batch_size * num_images_per_prompt,
+            self.unet.config["in_channels"],
+            height,
+            width,
+            prompt_embeds.dtype,
+            generator,
+            latents,
+        )
+        bs = batch_size * num_images_per_prompt
+        # get Guidance Scale Embedding
+        w = np.full(bs, guidance_scale - 1, dtype=prompt_embeds.dtype)
+        w_embedding = self.get_guidance_scale_embedding(
+            w, embedding_dim=self.unet.config["time_cond_proj_dim"], dtype=prompt_embeds.dtype
+        )
+        # Adapted from diffusers to extend it for other runtimes than ORT
+        timestep_dtype = np.int64
+        num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order
+        for i, t in enumerate(self.progress_bar(timesteps)):
+            timestep = np.array([t], dtype=timestep_dtype)
+            noise_pred = self.unet(
+                sample=latents,
+                timestep=timestep,
+                encoder_hidden_states=prompt_embeds,
+                timestep_cond=w_embedding,
+            )[0]
+            # compute the previous noisy sample x_t -> x_t-1
+            latents, denoised = self.scheduler.step(
+                torch.from_numpy(noise_pred), t, torch.from_numpy(latents), return_dict=False
+            )
+            latents, denoised = latents.numpy(), denoised.numpy()
+            # call the callback, if provided
+            if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
+                if callback is not None and i % callback_steps == 0:
+                    callback(i, t, latents)
+        if output_type == "latent":
+            image = denoised
+            has_nsfw_concept = None
+        else:
+            denoised /= self.vae_decoder.config["scaling_factor"]
+            # it seems likes there is a strange result for using half-precision vae decoder if batchsize>1
+            image = np.concatenate(
+                [self.vae_decoder(latent_sample=denoised[i : i + 1])[0] for i in range(denoised.shape[0])]
+            )
+            # image, has_nsfw_concept = self.run_safety_checker(image)
+            has_nsfw_concept = None  # skip safety checker
+        if has_nsfw_concept is None:
+            do_denormalize = [True] * image.shape[0]
+        else:
+            do_denormalize = [not has_nsfw for has_nsfw in has_nsfw_concept]
+        image = self.postprocess(image, output_type=output_type, do_denormalize=do_denormalize)
+        if not return_dict:
+            return (image, has_nsfw_concept)
+        return StableDiffusionPipelineOutput(images=image, nsfw_content_detected=has_nsfw_concept)
+    # Adapted from https://github.com/huggingface/diffusers/blob/v0.22.0/src/diffusers/pipelines/latent_consistency/pipeline_latent_consistency.py#L264
+    def get_guidance_scale_embedding(self, w, embedding_dim=512, dtype=None):
+        """
+        See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298
+        Args:
+            timesteps (`torch.Tensor`):
+                generate embedding vectors at these timesteps
+            embedding_dim (`int`, *optional*, defaults to 512):
+                dimension of the embeddings to generate
+            dtype:
+                data type of the generated embeddings
+        Returns:
+            `torch.FloatTensor`: Embedding vectors with shape `(len(timesteps), embedding_dim)`
+        """
+        w = w * 1000
+        half_dim = embedding_dim // 2
+        emb = np.log(10000.0) / (half_dim - 1)
+        emb = np.exp(np.arange(half_dim, dtype=dtype) * -emb)
+        emb = w[:, None] * emb[None, :]
+        emb = np.concatenate([np.sin(emb), np.cos(emb)], axis=1)
+        if embedding_dim % 2 == 1:  # zero pad
+            emb = np.pad(emb, [(0, 0), (0, 1)])
+        assert emb.shape == (w.shape[0], embedding_dim)
+        return emb
+def get_image_path(args, **override_kwargs):
+    """ mkdir output folder and encode metadata in the filename
+    """
+    out_folder = os.path.join(args.o, "_".join(args.prompt.replace("/", "_").rsplit(" ")))
+    os.makedirs(out_folder, exist_ok=True)
+    out_fname = f"randomSeed_{override_kwargs.get('seed', None) or args.seed}"
+    out_fname += f"_LCM_"
+    out_fname += f"_numInferenceSteps{override_kwargs.get('num_inference_steps', None) or args.num_inference_steps}"
+    out_fname += "_onnx_"
+    return os.path.join(out_folder, out_fname + ".png")
+def prepare_controlnet_cond(image_path, height, width):
+    image = Image.open(image_path).convert("RGB")
+    image = image.resize((height, width), resample=Image.LANCZOS)
+    image = np.array(image).transpose(2, 0, 1) / 255.0
+    return image
+def main(args):
+    logger.info(f"Setting random seed to {args.seed}")
+    # load scheduler from /scheduler/scheduler_config.json
+    scheduler_config_path = os.path.join(args.i, "scheduler/scheduler_config.json")
+    with open(scheduler_config_path, "r") as f:
+        scheduler_config = json.load(f)
+    user_specified_scheduler = LCMScheduler.from_config(scheduler_config)
+    print("user_specified_scheduler", user_specified_scheduler)
+    pipe = RKNN2StableDiffusionPipeline(
+        text_encoder=RKNN2Model(os.path.join(args.i, "text_encoder")),
+        unet=RKNN2Model(os.path.join(args.i, "unet")),
+        vae_decoder=RKNN2Model(os.path.join(args.i, "vae_decoder")),
+        scheduler=user_specified_scheduler,
+        tokenizer=CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch16"),
+    )
+    logger.info("Beginning image generation.")
+    image = pipe(
+        prompt=args.prompt,
+        height=int(args.size.split("x")[0]),
+        width=int(args.size.split("x")[1]),
+        num_inference_steps=args.num_inference_steps,
+        guidance_scale=args.guidance_scale,
+        generator=np.random.RandomState(args.seed),
+    )
+    out_path = get_image_path(args)
+    logger.info(f"Saving generated image to {out_path}")
+    image["images"][0].save(out_path)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--prompt",
+        required=True,
+        help="The text prompt to be used for text-to-image generation.")
+    parser.add_argument(
+        "-i",
+        required=True,
+        help=("Path to model directory"))
+    parser.add_argument("-o", required=True)
+    parser.add_argument("--seed",
+                        default=93,
+                        type=int,
+                        help="Random seed to be able to reproduce results")
+    parser.add_argument(
+        "-s",
+        "--size",
+        default="256x256",
+        type=str,
+        help="Image size")
+    parser.add_argument(
+        "--num-inference-steps",
+        default=4,
+        type=int,
+        help="The number of iterations the unet model will be executed throughout the reverse diffusion process")
+    parser.add_argument(
+        "--guidance-scale",
+        default=7.5,
+        type=float,
+        help="Controls the influence of the text prompt on sampling process (0=random images)")
+    args = parser.parse_args()
+    main(args)

run_rknn-lcm.py ADDED Viewed

	@@ -0,0 +1,632 @@

+import argparse
+import json
+import time
+import PIL
+from diffusers import StableDiffusionPipeline
+from diffusers.pipelines.pipeline_utils import DiffusionPipeline
+from diffusers.pipelines.stable_diffusion import StableDiffusionPipelineOutput
+from diffusers.schedulers import (
+    LCMScheduler
+)
+import logging
+logging.basicConfig()
+logger = logging.getLogger(__name__)
+logger.setLevel(logging.INFO)
+import numpy as np
+import os
+import torch  # Only used for `torch.from_tensor` in `pipe.scheduler.step()`
+from transformers import CLIPFeatureExtractor, CLIPTokenizer
+from typing import Callable, List, Optional, Union, Tuple
+from PIL import Image
+from rknnlite.api import RKNNLite
+class RKNN2Model:
+    """ Wrapper for running RKNPU2 models """
+    def __init__(self, model_dir):
+        logger.info(f"Loading {model_dir}")
+        start = time.time()
+        self.config = json.load(open(os.path.join(model_dir, "config.json")))
+        assert os.path.exists(model_dir) and os.path.exists(os.path.join(model_dir, "model.rknn"))
+        self.rknnlite = RKNNLite()
+        self.rknnlite.load_rknn(os.path.join(model_dir, "model.rknn"))
+        self.rknnlite.init_runtime(core_mask=RKNNLite.NPU_CORE_AUTO) # Multi-core will cause kernel crash
+        load_time = time.time() - start
+        logger.info(f"Done. Took {load_time:.1f} seconds.")
+        self.modelname = model_dir.split("/")[-1]
+        self.inference_time = 0
+    def __call__(self, **kwargs) -> List[np.ndarray]:
+        # np.savez(f"rknn_out/{self.modelname}_input_{self.inference_time}.npz", **kwargs)
+        # self.inference_time += 1
+        #print(kwargs)
+        input_list = [value for key, value in kwargs.items()]
+        for i, input in enumerate(input_list):
+            if isinstance(input, np.ndarray):
+                print(f"input {i} shape: {input.shape}")
+        results = self.rknnlite.inference(inputs=input_list, data_format='nchw')
+        for res in results:
+            print(f"output shape: {res.shape}")
+        return results
+class RKNN2LatentConsistencyPipeline(DiffusionPipeline):
+    def __init__(
+            self,
+            text_encoder: RKNN2Model,
+            unet: RKNN2Model,
+            vae_decoder: RKNN2Model,
+            scheduler: LCMScheduler,
+            tokenizer: CLIPTokenizer,
+            force_zeros_for_empty_prompt: Optional[bool] = True,
+            feature_extractor: Optional[CLIPFeatureExtractor] = None,
+            text_encoder_2: Optional[RKNN2Model] = None,
+            tokenizer_2: Optional[CLIPTokenizer] = None
+    ):
+        super().__init__()
+        self.register_modules(
+            tokenizer=tokenizer,
+            scheduler=scheduler,
+            feature_extractor=feature_extractor,
+        )
+        self.force_zeros_for_empty_prompt = force_zeros_for_empty_prompt
+        self.safety_checker = None
+        self.text_encoder = text_encoder
+        self.text_encoder_2 = text_encoder_2
+        self.tokenizer_2 = tokenizer_2
+        self.unet = unet
+        self.vae_decoder = vae_decoder
+        VAE_DECODER_UPSAMPLE_FACTOR = 8
+        self.vae_scale_factor = VAE_DECODER_UPSAMPLE_FACTOR
+    @staticmethod
+    def postprocess(
+        image: np.ndarray,
+        output_type: str = "pil",
+        do_denormalize: Optional[List[bool]] = None,
+        ):
+        def numpy_to_pil(images: np.ndarray):
+            """
+            Convert a numpy image or a batch of images to a PIL image.
+            """
+            if images.ndim == 3:
+                images = images[None, ...]
+            images = (images * 255).round().astype("uint8")
+            if images.shape[-1] == 1:
+                # special case for grayscale (single channel) images
+                pil_images = [Image.fromarray(image.squeeze(), mode="L") for image in images]
+            else:
+                pil_images = [Image.fromarray(image) for image in images]
+            return pil_images
+        def denormalize(images: np.ndarray):
+            """
+            Denormalize an image array to [0,1].
+            """
+            return np.clip(images / 2 + 0.5, 0, 1)
+        if not isinstance(image, np.ndarray):
+            raise ValueError(
+                f"Input for postprocessing is in incorrect format: {type(image)}. We only support np array"
+            )
+        if output_type not in ["latent", "np", "pil"]:
+            deprecation_message = (
+                f"the output_type {output_type} is outdated and has been set to `np`. Please make sure to set it to one of these instead: "
+                "`pil`, `np`, `pt`, `latent`"
+            )
+            logger.warning(deprecation_message)
+            output_type = "np"
+        if output_type == "latent":
+            return image
+        if do_denormalize is None:
+            raise ValueError("do_denormalize is required for postprocessing")
+        image = np.stack(
+            [denormalize(image[i]) if do_denormalize[i] else image[i] for i in range(image.shape[0])], axis=0
+        )
+        image = image.transpose((0, 2, 3, 1))
+        if output_type == "pil":
+            image = numpy_to_pil(image)
+        return image
+    def _encode_prompt(
+        self,
+        prompt: Union[str, List[str]],
+        num_images_per_prompt: int,
+        do_classifier_free_guidance: bool,
+        negative_prompt: Optional[Union[str, list]],
+        prompt_embeds: Optional[np.ndarray] = None,
+        negative_prompt_embeds: Optional[np.ndarray] = None,
+    ):
+        r"""
+        Encodes the prompt into text encoder hidden states.
+        Args:
+            prompt (`Union[str, List[str]]`):
+                prompt to be encoded
+            num_images_per_prompt (`int`):
+                number of images that should be generated per prompt
+            do_classifier_free_guidance (`bool`):
+                whether to use classifier free guidance or not
+            negative_prompt (`Optional[Union[str, list]]`):
+                The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
+                if `guidance_scale` is less than `1`).
+            prompt_embeds (`Optional[np.ndarray]`, defaults to `None`):
+                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                provided, text embeddings will be generated from `prompt` input argument.
+            negative_prompt_embeds (`Optional[np.ndarray]`, defaults to `None`):
+                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
+                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
+                argument.
+        """
+        if isinstance(prompt, str):
+            batch_size = 1
+        elif isinstance(prompt, list):
+            batch_size = len(prompt)
+        else:
+            batch_size = prompt_embeds.shape[0]
+        if prompt_embeds is None:
+            # get prompt text embeddings
+            text_inputs = self.tokenizer(
+                prompt,
+                padding="max_length",
+                max_length=self.tokenizer.model_max_length,
+                truncation=True,
+                return_tensors="np",
+            )
+            text_input_ids = text_inputs.input_ids
+            untruncated_ids = self.tokenizer(prompt, padding="max_length", return_tensors="np").input_ids
+            if not np.array_equal(text_input_ids, untruncated_ids):
+                removed_text = self.tokenizer.batch_decode(
+                    untruncated_ids[:, self.tokenizer.model_max_length - 1 : -1]
+                )
+                logger.warning(
+                    "The following part of your input was truncated because CLIP can only handle sequences up to"
+                    f" {self.tokenizer.model_max_length} tokens: {removed_text}"
+                )
+            prompt_embeds = self.text_encoder(input_ids=text_input_ids.astype(np.int32))[0]
+        prompt_embeds = np.repeat(prompt_embeds, num_images_per_prompt, axis=0)
+        # get unconditional embeddings for classifier free guidance
+        if do_classifier_free_guidance and negative_prompt_embeds is None:
+            uncond_tokens: List[str]
+            if negative_prompt is None:
+                uncond_tokens = [""] * batch_size
+            elif type(prompt) is not type(negative_prompt):
+                raise TypeError(
+                    f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
+                    f" {type(prompt)}."
+                )
+            elif isinstance(negative_prompt, str):
+                uncond_tokens = [negative_prompt] * batch_size
+            elif batch_size != len(negative_prompt):
+                raise ValueError(
+                    f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
+                    f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
+                    " the batch size of `prompt`."
+                )
+            else:
+                uncond_tokens = negative_prompt
+            max_length = prompt_embeds.shape[1]
+            uncond_input = self.tokenizer(
+                uncond_tokens,
+                padding="max_length",
+                max_length=max_length,
+                truncation=True,
+                return_tensors="np",
+            )
+            negative_prompt_embeds = self.text_encoder(input_ids=uncond_input.input_ids.astype(np.int32))[0]
+        if do_classifier_free_guidance:
+            negative_prompt_embeds = np.repeat(negative_prompt_embeds, num_images_per_prompt, axis=0)
+            # For classifier free guidance, we need to do two forward passes.
+            # Here we concatenate the unconditional and text embeddings into a single batch
+            # to avoid doing two forward passes
+            prompt_embeds = np.concatenate([negative_prompt_embeds, prompt_embeds])
+        return prompt_embeds
+    # Copied from https://github.com/huggingface/diffusers/blob/v0.17.1/src/diffusers/pipelines/stable_diffusion/pipeline_onnx_stable_diffusion.py#L217
+    def check_inputs(
+        self,
+        prompt: Union[str, List[str]],
+        height: Optional[int],
+        width: Optional[int],
+        callback_steps: int,
+        negative_prompt: Optional[str] = None,
+        prompt_embeds: Optional[np.ndarray] = None,
+        negative_prompt_embeds: Optional[np.ndarray] = None,
+    ):
+        if height % 8 != 0 or width % 8 != 0:
+            raise ValueError(f"`height` and `width` have to be divisible by 8 but are {height} and {width}.")
+        if (callback_steps is None) or (
+            callback_steps is not None and (not isinstance(callback_steps, int) or callback_steps <= 0)
+        ):
+            raise ValueError(
+                f"`callback_steps` has to be a positive integer but is {callback_steps} of type"
+                f" {type(callback_steps)}."
+            )
+        if prompt is not None and prompt_embeds is not None:
+            raise ValueError(
+                f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
+                " only forward one of the two."
+            )
+        elif prompt is None and prompt_embeds is None:
+            raise ValueError(
+                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
+            )
+        elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
+            raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
+        if negative_prompt is not None and negative_prompt_embeds is not None:
+            raise ValueError(
+                f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
+                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
+            )
+        if prompt_embeds is not None and negative_prompt_embeds is not None:
+            if prompt_embeds.shape != negative_prompt_embeds.shape:
+                raise ValueError(
+                    "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
+                    f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
+                    f" {negative_prompt_embeds.shape}."
+                )
+    # Adapted from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_latents
+    def prepare_latents(self, batch_size, num_channels_latents, height, width, dtype, generator, latents=None):
+        shape = (batch_size, num_channels_latents, height // self.vae_scale_factor, width // self.vae_scale_factor)
+        if isinstance(generator, list) and len(generator) != batch_size:
+            raise ValueError(
+                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
+                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
+            )
+        if latents is None:
+            if isinstance(generator, np.random.RandomState):
+                latents = generator.randn(*shape).astype(dtype)
+            elif isinstance(generator, torch.Generator):
+                latents = torch.randn(*shape, generator=generator).numpy().astype(dtype)
+            else:
+                raise ValueError(
+                    f"Expected `generator` to be of type `np.random.RandomState` or `torch.Generator`, but got"
+                    f" {type(generator)}."
+                )
+        elif latents.shape != shape:
+            raise ValueError(f"Unexpected latents shape, got {latents.shape}, expected {shape}")
+        # scale the initial noise by the standard deviation required by the scheduler
+        latents = latents * np.float64(self.scheduler.init_noise_sigma)
+        return latents
+    # Adapted from https://github.com/huggingface/diffusers/blob/v0.22.0/src/diffusers/pipelines/latent_consistency/pipeline_latent_consistency.py#L264
+    def __call__(
+        self,
+        prompt: Union[str, List[str]] = "",
+        height: Optional[int] = None,
+        width: Optional[int] = None,
+        num_inference_steps: int = 4,
+        original_inference_steps: int = None,
+        guidance_scale: float = 8.5,
+        num_images_per_prompt: int = 1,
+        generator: Optional[Union[np.random.RandomState, torch.Generator]] = None,
+        latents: Optional[np.ndarray] = None,
+        prompt_embeds: Optional[np.ndarray] = None,
+        output_type: str = "pil",
+        return_dict: bool = True,
+        callback: Optional[Callable[[int, int, np.ndarray], None]] = None,
+        callback_steps: int = 1,
+    ):
+        r"""
+        Function invoked when calling the pipeline for generation.
+        Args:
+            prompt (`Optional[Union[str, List[str]]]`, defaults to None):
+                The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
+                instead.
+            height (`Optional[int]`, defaults to None):
+                The height in pixels of the generated image.
+            width (`Optional[int]`, defaults to None):
+                The width in pixels of the generated image.
+            num_inference_steps (`int`, defaults to 50):
+                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
+                expense of slower inference.
+            guidance_scale (`float`, defaults to 7.5):
+                Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
+                `guidance_scale` is defined as `w` of equation 2. of [Imagen
+                Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
+                1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
+                usually at the expense of lower image quality.
+            num_images_per_prompt (`int`, defaults to 1):
+                The number of images to generate per prompt.
+            generator (`Optional[Union[np.random.RandomState, torch.Generator]]`, defaults to `None`):
+                A np.random.RandomState to make generation deterministic.
+            latents (`Optional[np.ndarray]`, defaults to `None`):
+                Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
+                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
+                tensor will ge generated by sampling using the supplied random `generator`.
+            prompt_embeds (`Optional[np.ndarray]`, defaults to `None`):
+                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                provided, text embeddings will be generated from `prompt` input argument.
+            output_type (`str`, defaults to `"pil"`):
+                The output format of the generate image. Choose between
+                [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
+            return_dict (`bool`, defaults to `True`):
+                Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
+                plain tuple.
+            callback (Optional[Callable], defaults to `None`):
+                A function that will be called every `callback_steps` steps during inference. The function will be
+                called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
+            callback_steps (`int`, defaults to 1):
+                The frequency at which the `callback` function will be called. If not specified, the callback will be
+                called at every step.
+            guidance_rescale (`float`, defaults to 0.0):
+                Guidance rescale factor proposed by [Common Diffusion Noise Schedules and Sample Steps are
+                Flawed](https://arxiv.org/pdf/2305.08891.pdf) `guidance_scale` is defined as `φ` in equation 16. of
+                [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/pdf/2305.08891.pdf).
+                Guidance rescale factor should fix overexposure when using zero terminal SNR.
+        Returns:
+            [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
+            [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple.
+            When returning a tuple, the first element is a list with the generated images, and the second element is a
+            list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work"
+            (nsfw) content, according to the `safety_checker`.
+        """
+        height = height or self.unet.config["sample_size"] * self.vae_scale_factor
+        width = width or self.unet.config["sample_size"] * self.vae_scale_factor
+        # Don't need to get negative prompts due to LCM guided distillation
+        negative_prompt = None
+        negative_prompt_embeds = None
+        # check inputs. Raise error if not correct
+        self.check_inputs(
+            prompt, height, width, callback_steps, negative_prompt, prompt_embeds, negative_prompt_embeds
+        )
+        # define call parameters
+        if isinstance(prompt, str):
+            batch_size = 1
+        elif isinstance(prompt, list):
+            batch_size = len(prompt)
+        else:
+            batch_size = prompt_embeds.shape[0]
+        if generator is None:
+            generator = np.random.RandomState()
+        start_time = time.time()
+        prompt_embeds = self._encode_prompt(
+            prompt,
+            num_images_per_prompt,
+            False,
+            negative_prompt,
+            prompt_embeds=prompt_embeds,
+            negative_prompt_embeds=negative_prompt_embeds,
+        )
+        encode_prompt_time = time.time() - start_time
+        print(f"Prompt encoding time: {encode_prompt_time:.2f}s")
+        # set timesteps
+        self.scheduler.set_timesteps(num_inference_steps, original_inference_steps=original_inference_steps)
+        timesteps = self.scheduler.timesteps
+        latents = self.prepare_latents(
+            batch_size * num_images_per_prompt,
+            self.unet.config["in_channels"],
+            height,
+            width,
+            prompt_embeds.dtype,
+            generator,
+            latents,
+        )
+        bs = batch_size * num_images_per_prompt
+        # get Guidance Scale Embedding
+        w = np.full(bs, guidance_scale - 1, dtype=prompt_embeds.dtype)
+        w_embedding = self.get_guidance_scale_embedding(
+            w, embedding_dim=self.unet.config["time_cond_proj_dim"], dtype=prompt_embeds.dtype
+        )
+        # Adapted from diffusers to extend it for other runtimes than ORT
+        timestep_dtype = np.int64
+        num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order
+        inference_start = time.time()
+        for i, t in enumerate(self.progress_bar(timesteps)):
+            timestep = np.array([t], dtype=timestep_dtype)
+            noise_pred = self.unet(
+                sample=latents,
+                timestep=timestep,
+                encoder_hidden_states=prompt_embeds,
+                timestep_cond=w_embedding,
+            )[0]
+            # compute the previous noisy sample x_t -> x_t-1
+            latents, denoised = self.scheduler.step(
+                torch.from_numpy(noise_pred), t, torch.from_numpy(latents), return_dict=False
+            )
+            latents, denoised = latents.numpy(), denoised.numpy()
+            # call the callback, if provided
+            if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
+                if callback is not None and i % callback_steps == 0:
+                    callback(i, t, latents)
+        inference_time = time.time() - inference_start
+        print(f"Inference time: {inference_time:.2f}s")
+        decode_start = time.time()
+        if output_type == "latent":
+            image = denoised
+            has_nsfw_concept = None
+        else:
+            denoised /= self.vae_decoder.config["scaling_factor"]
+            # it seems likes there is a strange result for using half-precision vae decoder if batchsize>1
+            image = np.concatenate(
+                [self.vae_decoder(latent_sample=denoised[i : i + 1])[0] for i in range(denoised.shape[0])]
+            )
+            # image, has_nsfw_concept = self.run_safety_checker(image)
+            has_nsfw_concept = None  # skip safety checker
+        if has_nsfw_concept is None:
+            do_denormalize = [True] * image.shape[0]
+        else:
+            do_denormalize = [not has_nsfw for has_nsfw in has_nsfw_concept]
+        image = self.postprocess(image, output_type=output_type, do_denormalize=do_denormalize)
+        decode_time = time.time() - decode_start
+        print(f"Decode time: {decode_time:.2f}s")
+        total_time = encode_prompt_time + inference_time + decode_time
+        print(f"Total time: {total_time:.2f}s")
+        if not return_dict:
+            return (image, has_nsfw_concept)
+        return StableDiffusionPipelineOutput(images=image, nsfw_content_detected=has_nsfw_concept)
+    # Adapted from https://github.com/huggingface/diffusers/blob/v0.22.0/src/diffusers/pipelines/latent_consistency/pipeline_latent_consistency.py#L264
+    def get_guidance_scale_embedding(self, w, embedding_dim=512, dtype=None):
+        """
+        See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298
+        Args:
+            timesteps (`torch.Tensor`):
+                generate embedding vectors at these timesteps
+            embedding_dim (`int`, *optional*, defaults to 512):
+                dimension of the embeddings to generate
+            dtype:
+                data type of the generated embeddings
+        Returns:
+            `torch.FloatTensor`: Embedding vectors with shape `(len(timesteps), embedding_dim)`
+        """
+        w = w * 1000
+        half_dim = embedding_dim // 2
+        emb = np.log(10000.0) / (half_dim - 1)
+        emb = np.exp(np.arange(half_dim, dtype=dtype) * -emb)
+        emb = w[:, None] * emb[None, :]
+        emb = np.concatenate([np.sin(emb), np.cos(emb)], axis=1)
+        if embedding_dim % 2 == 1:  # zero pad
+            emb = np.pad(emb, [(0, 0), (0, 1)])
+        assert emb.shape == (w.shape[0], embedding_dim)
+        return emb
+def get_image_path(args, **override_kwargs):
+    """ mkdir output folder and encode metadata in the filename
+    """
+    out_folder = os.path.join(args.o, "_".join(args.prompt.replace("/", "_").rsplit(" ")))
+    os.makedirs(out_folder, exist_ok=True)
+    out_fname = f"randomSeed_{override_kwargs.get('seed', None) or args.seed}"
+    out_fname += f"_LCM_"
+    out_fname += f"_numInferenceSteps{override_kwargs.get('num_inference_steps', None) or args.num_inference_steps}"
+    return os.path.join(out_folder, out_fname + ".png")
+def prepare_controlnet_cond(image_path, height, width):
+    image = Image.open(image_path).convert("RGB")
+    image = image.resize((height, width), resample=Image.LANCZOS)
+    image = np.array(image).transpose(2, 0, 1) / 255.0
+    return image
+def main(args):
+    logger.info(f"Setting random seed to {args.seed}")
+    # load scheduler from /scheduler/scheduler_config.json
+    scheduler_config_path = os.path.join(args.i, "scheduler/scheduler_config.json")
+    with open(scheduler_config_path, "r") as f:
+        scheduler_config = json.load(f)
+    user_specified_scheduler = LCMScheduler.from_config(scheduler_config)
+    print("user_specified_scheduler", user_specified_scheduler)
+    pipe = RKNN2LatentConsistencyPipeline(
+        text_encoder=RKNN2Model(os.path.join(args.i, "text_encoder")),
+        unet=RKNN2Model(os.path.join(args.i, "unet")),
+        vae_decoder=RKNN2Model(os.path.join(args.i, "vae_decoder")),
+        scheduler=user_specified_scheduler,
+        tokenizer=CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch16"),
+    )
+    logger.info("Beginning image generation.")
+    image = pipe(
+        prompt=args.prompt,
+        height=int(args.size.split("x")[0]),
+        width=int(args.size.split("x")[1]),
+        num_inference_steps=args.num_inference_steps,
+        guidance_scale=args.guidance_scale,
+        generator=np.random.RandomState(args.seed),
+    )
+    out_path = get_image_path(args)
+    logger.info(f"Saving generated image to {out_path}")
+    image["images"][0].save(out_path)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--prompt",
+        required=True,
+        help="The text prompt to be used for text-to-image generation.")
+    parser.add_argument(
+        "-i",
+        required=True,
+        help=("Path to model directory"))
+    parser.add_argument("-o", required=True)
+    parser.add_argument("--seed",
+                        default=93,
+                        type=int,
+                        help="Random seed to be able to reproduce results")
+    parser.add_argument(
+        "-s",
+        "--size",
+        default="256x256",
+        type=str,
+        help="Image size")
+    parser.add_argument(
+        "--num-inference-steps",
+        default=4,
+        type=int,
+        help="The number of iterations the unet model will be executed throughout the reverse diffusion process")
+    parser.add_argument(
+        "--guidance-scale",
+        default=7.5,
+        type=float,
+        help="Controls the influence of the text prompt on sampling process (0=random images)")
+    args = parser.parse_args()
+    main(args)