arxiv:2412.05237

MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

Published on Dec 6

· Submitted by

yuexiang96 on Dec 9

#3 Paper of the day

Upvote

Authors:

Tuney Zheng ,

Bo Li ,

Graham Neubig ,

Wenhu Chen ,

Xiang Yue

Abstract

Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales. To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales. Experiments demonstrate that training MLLMs on this dataset significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation studies further highlight the importance of key components, such as rewriting and self-filtering, in the dataset construction process.

View arXiv page View PDF Add to collection

Community

yuexiang96

Paper author Paper submitter 6 days ago

We introduce a simple, scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed rationales. Our model, MAmmoTH-VL-8B, achieves very impressive performance on various datasets:

MMMU (Val): 50.8
MMMU-Pro (Vision): 25.3
MMStar: 63.0
MMBench: 83.4
MMVet: 62.3
MathVerse: 34.2
MathVista: 67.6
ChartQA: 86.2
DocVQA: 93.7
RealWorldQA: 69.9
MuirBench: 55.1
MEGA-Bench: 28.2

Check out more detailed results in our paper!

librarian-bot

5 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

Abstract

Community

Models citing this paper 2

Datasets citing this paper 1

Spaces citing this paper 2

Collections including this paper 7