Rethinking Overlooked Aspects in Vision-Language Models
Abstract
Recent advancements in large vision-language models (LVLMs), such as GPT4-V and LLaVA, have been substantial. LLaVA's modular architecture, in particular, offers a blend of simplicity and efficiency. Recent works mainly focus on introducing more pre-training and instruction tuning data to improve model's performance. This paper delves into the often-neglected aspects of data efficiency during pre-training and the selection process for instruction tuning datasets. Our research indicates that merely increasing the size of <PRE_TAG>pre-training data</POST_TAG> does not guarantee improved performance and may, in fact, lead to its degradation. Furthermore, we have established a pipeline to pinpoint the most efficient <PRE_TAG><PRE_TAG>instruction tuning (SFT) dataset</POST_TAG></POST_TAG>, implying that not all SFT data utilized in existing studies are necessary. The primary objective of this paper is not to introduce a state-of-the-art model, but rather to serve as a roadmap for future research, aiming to optimize data usage during pre-training and fine-tuning processes to enhance the performance of vision-language models.
Models citing this paper 3
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 2
Collections including this paper 0
No Collection including this paper