Abstract
Recently, slow-thinking reasoning systems, built upon large language models (LLMs), have garnered widespread attention by scaling the thinking time during inference. There is also growing interest in adapting this capability to multimodal large language models (MLLMs). Given that MLLMs handle more complex data semantics across different modalities, it is intuitively more challenging to implement multimodal slow-thinking systems. To address this issue, in this paper, we explore a straightforward approach by fine-tuning a capable MLLM with a small amount of textual long-form thought data, resulting in a multimodal slow-thinking system, Virgo (Visual reasoning with long thought). We find that these long-form reasoning processes, expressed in natural language, can be effectively transferred to MLLMs. Moreover, it seems that such textual reasoning data can be even more effective than visual reasoning data in eliciting the slow-thinking capacities of MLLMs. While this work is preliminary, it demonstrates that slow-thinking capacities are fundamentally associated with the language model component, which can be transferred across modalities or domains. This finding can be leveraged to guide the development of more powerful slow-thinking reasoning systems. We release our resources at https://github.com/RUCAIBox/Virgo.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision (2024)
- LLaVA-CoT: Let Vision Language Models Reason Step-by-Step (2024)
- Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models (2024)
- A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method&Challenges (2024)
- Training-Free Mitigation of Language Reasoning Degradation After Multimodal Instruction Tuning (2024)
- Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems (2024)
- Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper