[Wenxuan Wang](https://scholar.google.com/citations?user=75OyC-oAAAAJ&hl=zh-CN)
1,2,3*, [Quan Sun](https://scholar.google.cz/citations?user=pVKiHdEAAAAJ&hl=zh-CN&oi=ao)
3*, [Fan Zhang](https://scholar.google.cz/citations?hl=zh-CN&user=VsJ39HMAAAAJ&view_op=list_works&sortby=pubdate)
3, [Yepeng Tang](https://scholar.google.cz/citations?user=CAC_4OUAAAAJ&hl=zh-CN&oi=ao)
4, [Jing Liu](https://scholar.google.com/citations?user=sOI-S7oAAAAJ&hl=zh-CN)
1,2, [Xinlong Wang](https://scholar.google.com/citations?hl=zh-CN&user=DPz0DjYAAAAJ&view_op=list_works&sortby=pubdate/)
3
1[CASIA](http://english.ia.cas.cn/),
2[UCAS](https://english.ucas.ac.cn/),
3[BAAI](https://www.baai.ac.cn/english.html),
4[BJTU](https://en.bjtu.edu.cn/)
* Equal Contribution
| [Paper](https://arxiv.org/abs/2407.20171) | [Code](https://github.com/baaivision/DIVA) |
In this work, we present a simple post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process. We introduce DIVA, which uses the DIffusion model as a Visual Assistant for CLIP. Specifically, DIVA leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (w/o corresponding text). We demonstrate that DIVA improves CLIP's performance on the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities to a large extent (e.g., 3-7% ↑), and enhances the performance of MLLMs and vision models on multimodal understanding and segmentation tasks. Extensive evaluation on 29 image classification and retrieval benchmarks confirms that DIVA preserves CLIP's strong zero-shot capabilities.
## Model Zoo
| Method | Image Size | Params (M) | Average Score |
|----------------------|------------|------------|---------------|
| [OpenAI ViT-L-14]() | 224² | 427.6 | 25.9 (+6.6) |
| [OpenAI ViT-L-14]() | 336² | 427.9 | 25.2 (+5.2) |
| [MetaCLIP ViT-L-14]() | 224² | 427.6 | 27.4 (+3.7) |
| [MetaCLIP ViT-H-14]() | 224² | 986.1 | 31.9 (+6.7) |
| [SigLIP ViT-SO-14]() | 224² | 877.4 | 40.7 (+2.9) |
| [SigLIP ViT-SO-14]() | 384² | 878.0 | 38.5 (+1.5) |
| [DFN ViT-H-14]() | 224² | 986.1 | 43.7 (+4.4) |
| [DFN ViT-H-14]() | 378² | 986.7 | 37.8 (+3.0) |
## 📝 Citation
If you find **DIVA** is helpful for your research, please consider ***citing***📝our paper and give us a github ***star***⭐:
```bib
@article{wang2024diffusion,
title={Diffusion Feedback Helps CLIP See Better},
author={Wang, Wenxuan and Sun, Quan and Zhang, Fan and Tang, Yepeng and Liu, Jing and Wang, Xinlong},
journal={arXiv preprint arXiv:2407.20171},
year={2024}
}
```