RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm
Abstract
After pre-training on extensive image-text pairs, Contrastive Language-Image Pre-training (CLIP) demonstrates promising performance on a wide variety of benchmarks. However, a substantial volume of non-paired data, such as multimodal interleaved documents, remains underutilized for vision-language representation learning. To fully leverage these unpaired documents, we initially establish a Real-World Data Extraction pipeline to extract high-quality images and texts. Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant realistic texts. To further enhance fine-grained visual information, we propose an image semantic augmented generation module for synthetic text production. Furthermore, we employ a semantic balance sampling strategy to improve dataset diversity, enabling better learning of long-tail concepts. Based on these innovations, we construct RealSyn, a dataset combining realistic and synthetic texts, available in three scales: 15M, 30M, and 100M. Extensive experiments demonstrate that RealSyn effectively advances vision-language representation learning and exhibits strong scalability. Models pre-trained on RealSyn achieve state-of-the-art performance on multiple downstream tasks. To facilitate future research, the RealSyn dataset and pre-trained model weights are released at https://github.com/deepglint/RealSyn.
Community
In this paper, we explore two fundamental questions: 1) How to utilize multimodal interleaved documents for vision-language representation learning. 2) How to effectively leverage both realistic and synthetic texts to enhance representation performance. To this end, we first establish a Real-World Data Extraction pipeline to extract high-quality images and texts. Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant texts. To enhance fine-grained image understanding, we propose a visual semantic augmented generation module for synthetic text production. Furthermore, we employ a semantic balance sampling strategy to improve dataset diversity, enabling better learning of long-tail concepts. Based on these innovations, we construct the RealSyn dataset, which integrates both realistic and synthetic texts with three sizes: 15M, 30M, and 100M. Comprehensive experimental results show that RealSyn is effective for vision-language representation learning and exhibits excellent scalability.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining (2025)
- GME: Improving Universal Multimodal Retrieval by Multimodal LLMs (2024)
- Goku: Flow Based Video Generative Foundation Models (2025)
- Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens (2025)
- ImageRef-VL: Enabling Contextual Image Referencing in Vision-Language Models (2025)
- SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning (2025)
- TSVC:Tripartite Learning with Semantic Variation Consistency for Robust Image-Text Retrieval (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 3
Spaces citing this paper 0
No Space linking this paper