license: mit
pipeline_tag: text-to-image
tags:
- diffusion
- efficient
- quantization
- StableDiffusionXLPipeline
- Diffusers
base_model:
- stabilityai/sdxl-turbo
MixDQ Model Card
Model Description
MixDQ is a mixed precision quantization methods that compress the memory and computational usage of text-to-image diffusion models while preserving genration quality. It supports few-step diffusion models (e.g., SDXL-turbo, LCM-lora) to construct both fast and tiny diffusion models. Efficient CUDA kernel implemention is provided for practical resource savings.
Model Sources
for more information, please refer to:
- Project Page: https://a-suozhang.xyz/mixdq.github.io/.
- Arxiv paper: https://arxiv.org/abs/2405.17873
- Github Repository: https://github.com/A-suozhang/MixDQ
Evaluation
We evaluate the MixDQ model using various metrics, including FID (fidelity), CLIPScore (image-text alignment), and ImageReward (human preference). MixDQ can achieve W8A8 quantization without performance loss. The differences between images generated by MixDQ and those generated by FP16 models are negligible.
Method | FID (↓) | ClipScore | ImageReward |
---|---|---|---|
FP16 | 17.15 | 0.2722 | 0.8631 |
MixDQ-W8A8 | 17.03 | 0.2703 | 0.8415 |
MixDQ-W5A8 | 17.23 | 0.2697 | 0.8307 |
Usage
install the prerequisite for Mixdq:
# The Python versions required to run mixdq: 3.8, 3.9, 3.10
pip install -i https://pypi.org/simple/ mixdq-extension
run the pipeline:
pipe = DiffusionPipeline.from_pretrained(
"stabilityai/sdxl-turbo", custom_pipeline="nics-efc/MixDQ",
torch_dtype=torch.float16, variant="fp16"
)
# quant the UNet
pipe.quantize_unet(
w_bit = 8,
a_bit = 8,
bos=True,
)
# The set_cuda_graph func is optional and used for acceleration
pipe.set_cuda_graph(
run_pipeline = True,
)
# test the memory and the lantency of the pipeline or the UNet
pipe.run_for_test(
device="cuda",
output_type="pil",
run_pipeline=True,
path="pipeline_test.png",
profile=True
)
'''
After execution is finished, there will be a report under log/sdxl folder in formats of json.
This report can be opened by tensorboard for users to examine profiling results:
tensorboard --logdir=./log
'''
# run the pipeline
pipe = pipe.to("cuda")
prompts = "A black Honda motorcycle parked in front of a garage."
image = pipe(prompts, num_inference_steps=1, guidance_scale=0.0).images[0]
image.save('mixdq_pipeline.png')
Performance tested on NVIDIA 4080:
UNet Latency (ms) | No CUDA Graph | With CUDA Graph |
---|---|---|
FP16 version | 44.6 | 36.1 |
Quantized version | 59.1 | 24.9 |
Speedup | 0.75 | 1.45 |