Papers
arxiv:2407.15841

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Published on Jul 22
· Submitted by akhaliq on Jul 23
#1 Paper of the day
Authors:
,
,

Abstract

We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language model (LLM) that can jointly capture the detailed spatial semantics and long-range temporal context without exceeding the token budget of commonly used LLMs. This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled video frames in an effective way. Specifically, the Slow pathway extracts features at a low frame rate while keeping as many spatial details as possible (e.g., with 24x24 tokens), and the Fast pathway operates on a high frame rate but uses a larger spatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. As a result, this design allows us to adequately capture both spatial and temporal features that are beneficial for understanding details along the video. Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks. On some benchmarks, it achieves comparable or even better performance compared to state-of-the-art Video LLMs that are fine-tuned on video datasets.

Community

Paper submitter

Screen Shot 2024-07-22 at 11.21.58 PM.png

Thanks to all the authors for such a comprehensive and insightful research paper.

I have a question that I hope you can clarify. In the section discussing the slow pathway, it is mentioned that it outputs 24×24 tokens. However, in the subsequent calculation, it is shown as 12×24 tokens. Could you please clarify this?
image.png

Additionally, is there a plan to release the models or code to facilitate the reproduction of the results and application to downstream tasks?

·
Paper author
edited Jul 23

Thank you for your interest in our paper. The idea of the Slow pathway is to use a low frame rate but keep more tokens in each frame. There are multiple design choices to achieve this, and 24x24 tokens are the original output from the vision encoder. However, using 24x24 tokens gets OOM problem and we found applying proper pooling operations doesn't decrease the performance. Thus, we use 12x24 tokens as default for SlowFast-LLaVA to keep as many spatial details as possible. All numbers in the main results are based on 12x24 tokens. We will clarify this in the revision.

We are still working on the code release. Stay tuned!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.15841 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.15841 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.15841 in a Space README.md to link it from this page.

Collections including this paper 13