Papers
arxiv:2412.04467

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Published on Dec 5
· Submitted by Senqiao on Dec 6
#1 Paper of the day
Authors:
,

Abstract

Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs. However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. The proposed VisionZip can be widely applied to image and video understanding tasks and is well-suited for multi-turn dialogues in real-world scenarios, where previous methods tend to underperform. Experimental results show that VisionZip outperforms the previous state-of-the-art method by at least 5% performance gains across nearly all settings. Moreover, our method significantly enhances model inference speed, improving the prefilling time by 8x and enabling the LLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while achieving better results. Furthermore, we analyze the causes of this redundancy and encourage the community to focus on extracting better visual features rather than merely increasing token length. Our code is available at https://github.com/dvlab-research/VisionZip .

Community

Paper author Paper submitter
edited 9 days ago

🚀 Demo: http://202.104.135.156:7860/
🌟 Video: https://youtu.be/sytaAzmxxpo?si=IieArmQ7YNf2dVyM
🎯 Code: https://github.com/dvlab-research/VisionZip

Usage:

pip install visionzip
from visionzip import visionzip
model = visionzip(model)
Paper author Paper submitter

🔥Highlights:

VisionZip achieves state-of-the-art performance among efficient VLM methods. By retaining only 10% of visual tokens, it achieves nearly 95% of the performance in training-free mode.

VisionZip can be applied during the inference stage (without incurring any additional training cost), the efficient tuning stage (to achieve better results), and the training stage (almost no performance degradation,saving 2× memory and 2× training time).

VisionZip significantly reduces the pre-filling time by 8x and the total inference time by 2x(with KV cache enabled).

Similar methods will destroy the model’s performance on OCR tasks, especially those with high text density.

·
Paper author

Thank you for your interest in our work. OCR capability was also a concern during the development of VisionZIP. However, our results show that it does not cause a significant drop in performance. For example, with LLaVA-1.5 retains only 64 tokens, the TextVQA benchmark still achieves 96.2%.

We believe this is because the local textual information is highly aggregated in the deeper layers of the vision encoder. Therefore, even when a large number of tokens are dropped, there is minimal impact. We suggest you could also try inputting different visual tokens in our demo to explore this further.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2412.04467 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2412.04467 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2412.04467 in a Space README.md to link it from this page.

Collections including this paper 17