BAAI
/

DIVA / README.md
Rookielion's picture
Upload README.md
2355104 verified
metadata
license: apache-2.0

In this work, we present a simple post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process. We introduce DIVA, which uses the DIffusion model as a Visual Assistant for CLIP. Specifically, DIVA leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (w/o corresponding text). We demonstrate that DIVA improves CLIP's performance on the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities to a large extent (e.g., 3-7% ↑), and enhances the performance of MLLMs and vision models on multimodal understanding and segmentation tasks. Extensive evaluation on 29 image classification and retrieval benchmarks confirms that DIVA preserves CLIP's strong zero-shot capabilities.

Model Zoo

Method Image Size Params (M) Average Score
OpenAI ViT-L-14 224² 427.6 25.9 (+6.6)
OpenAI ViT-L-14 336² 427.9 25.2 (+5.2)
MetaCLIP ViT-L-14 224² 427.6 27.4 (+3.7)
MetaCLIP ViT-H-14 224² 986.1 31.9 (+6.7)
SigLIP ViT-SO-14 224² 877.4 40.7 (+2.9)
SigLIP ViT-SO-14 384² 878.0 38.5 (+1.5)
DFN ViT-H-14 224² 986.1 43.7 (+4.4)
DFN ViT-H-14 378² 986.7 37.8 (+3.0)

📝 Citation

If you find DIVA is helpful for your research, please consider citing📝our paper and give us a github star⭐:

@article{wang2024diffusion,
      title={Diffusion Feedback Helps CLIP See Better},
      author={Wang, Wenxuan and Sun, Quan and Zhang, Fan and Tang, Yepeng and Liu, Jing and Wang, Xinlong},
      journal={arXiv preprint arXiv:2407.20171},
      year={2024}
}