File size: 3,231 Bytes

---
license: mit
pipeline_tag: text-to-image
tags:
- diffusion
- efficient
- quantization
- StableDiffusionXLPipeline
- Diffusers
base_model:
- stabilityai/sdxl-turbo
---

# MixDQ Model Card

## Model Description

MixDQ is a mixed precision quantization methods that compress the memory and computational usage of text-to-image diffusion models while preserving genration quality.
It supports few-step diffusion models (e.g., SDXL-turbo, LCM-lora) to construct both fast and tiny diffusion models. Efficient CUDA kernel implemention is provided for practical resource savings.

<img src="https://github.com/A-suozhang/MyPicBed/raw/master/img/mixdq_model_card_0.jpg" width="600">


## Model Sources

for more information, please refer to:

- Project Page: [https://a-suozhang.xyz/mixdq.github.io/](https://a-suozhang.xyz/mixdq.github.io/).
- Arxiv paper: [https://arxiv.org/abs/2405.17873](https://arxiv.org/abs/2405.17873)
- Github Repository: [https://github.com/A-suozhang/MixDQ](https://github.com/A-suozhang/MixDQ)

## Evaluation

We evaluate the MixDQ model using various metrics, including FID (fidelity), CLIPScore (image-text alignment), and ImageReward (human preference). MixDQ can achieve W8A8 quantization without performance loss. The differences between images generated by MixDQ and those generated by FP16 models are negligible.

| Method     | FID (↓) | ClipScore | ImageReward |
|------------|---------|-----------|-------------|
| FP16       | 17.15   | 0.2722    | 0.8631      |
| MixDQ-W8A8 | 17.03   | 0.2703    | 0.8415      |
| MixDQ-W5A8 | 17.23   | 0.2697    | 0.8307      |

## Usage


install the prerequisite for Mixdq:
```shell
  # The Python versions required to run mixdq: 3.8, 3.9, 3.10
  pip install -i https://pypi.org/simple/ mixdq-extension
```

run the pipeline:
```python
  pipe = DiffusionPipeline.from_pretrained(
      "stabilityai/sdxl-turbo", custom_pipeline="nics-efc/MixDQ",
      torch_dtype=torch.float16, variant="fp16"
  )

  # quant the UNet
  pipe.quantize_unet(
                  w_bit = 8, 
                  a_bit = 8, 
                  bos=True, 
                  )

  # The set_cuda_graph func is optional and used for acceleration
  pipe.set_cuda_graph(
      run_pipeline = True,
  )

  # test the memory and the lantency of the pipeline or the UNet
  pipe.run_for_test(
      device="cuda",
      output_type="pil",
      run_pipeline=True,
      path="pipeline_test.png",
      profile=True
  )
  '''
  After execution is finished, there will be a report under log/sdxl folder in formats of json.
  This report can be opened by tensorboard for users to examine profiling results:
  tensorboard --logdir=./log
  '''

  # run the pipeline
  pipe = pipe.to("cuda")
  prompts = "A black Honda motorcycle parked in front of a garage."
  image = pipe(prompts, num_inference_steps=1, guidance_scale=0.0).images[0]  
  image.save('mixdq_pipeline.png')
```



Performance tested on NVIDIA 4080:

| UNet Latency (ms) | No CUDA Graph | With CUDA Graph |
|-------------------|---------------|-----------------|
| FP16 version      | 44.6          | 36.1            |
| Quantized version | 59.1          | 24.9            |
| Speedup           | 0.75          | 1.45            |