Papers
arxiv:2408.16357

Law of Vision Representation in MLLMs

Published on Aug 29
· Submitted by chenfengx on Aug 30
#1 Paper of the day
Authors:
,

Abstract

We present the "Law of Vision Representation" in multimodal large language models (MLLMs). It reveals a strong correlation between the combination of cross-modal alignment, correspondence in vision representation, and MLLM performance. We quantify the two factors using the cross-modal Alignment and Correspondence score (AC score). Through extensive experiments involving thirteen different vision representation settings and evaluations across eight benchmarks, we find that the AC score is linearly correlated to model performance. By leveraging this relationship, we are able to identify and train the optimal vision representation only, which does not require finetuning the language model every time, resulting in a 99.7% reduction in computational cost.

Community

Paper author Paper submitter

We study how to connect the visual-representations to the performance of MLLM, and propose an AC policy to suggest which vision model we should use! 😉

·

Hi @chenfengx congrats on this work!

It would be great to update the pipeline_tag: text-generation to pipeline_tag: image-text-to-text in each of the model repositories, which is more appropriate for VLMs (models like LLaVa, Florence-2, PaliGemma etc are also using this tag).

This way people can discover them from https://huggingface.co/models?pipeline_tag=image-text-to-text.

Cheers!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 26

Browse 26 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2408.16357 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2408.16357 in a Space README.md to link it from this page.

Collections including this paper 23