arxiv:2406.08478

What If We Recaption Billions of Web Images with LLaMA-3?

Published on Jun 12

· Submitted by

akhaliq on Jun 13

Upvote

Authors:

Xianhang Li ,

Haoqin Tu ,

Mude Hui ,

Zeyu Wang ,

Bingchen Zhao ,

Junfei Xiao ,

Jieru Mei ,

Huangjie Zheng ,

Cihang Xie

Abstract

Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and open-sourced LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users' text instructions, especially in following complex queries. Our project page is https://www.haqtu.me/Recap-Datacomp-1B/

View arXiv page View PDF Add to collection

Community

haotiz

13 days ago

Dear authors,

Thank you for your excellent work and the detailed analysis of the re-captioned datasets. I particularly appreciated your insights on the recaptioning pipeline and the training process for CLIP. I noticed that your work closely relates to the VeCLIP paper, which might be of interest to you. You can find our paper: https://arxiv.org/abs/2310.07699 and code: https://github.pie.apple.com/aiml-oss/ml-veclip. Thanks!