arxiv:2407.15841

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Published on Jul 22

· Submitted by

akhaliq on Jul 23

#1 Paper of the day

Upvote

Authors:

Mingze Xu ,

Mingfei Gao ,

Zhe Gan ,

Hong-You Chen ,

Zhengfeng Lai ,

Abstract

We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language model (LLM) that can jointly capture the detailed spatial semantics and long-range temporal context without exceeding the token budget of commonly used LLMs. This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled video frames in an effective way. Specifically, the Slow pathway extracts features at a low frame rate while keeping as many spatial details as possible (e.g., with 24x24 tokens), and the Fast pathway operates on a high frame rate but uses a larger spatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. As a result, this design allows us to adequately capture both spatial and temporal features that are beneficial for understanding details along the video. Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks. On some benchmarks, it achieves comparable or even better performance compared to state-of-the-art Video LLMs that are fine-tuned on video datasets.

View arXiv page View PDF Add to collection

Community

akhaliq

Paper submitter Jul 23

bhimrazy

Jul 23

Thanks to all the authors for such a comprehensive and insightful research paper.

I have a question that I hope you can clarify. In the section discussing the slow pathway, it is mentioned that it outputs 24×24 tokens. However, in the subsequent calculation, it is shown as 12×24 tokens. Could you please clarify this?

Additionally, is there a plan to release the models or code to facilitate the reproduction of the results and application to downstream tasks?

xumingze0308

Paper author Jul 23

•

edited Jul 23

Thank you for your interest in our paper. The idea of the Slow pathway is to use a low frame rate but keep more tokens in each frame. There are multiple design choices to achieve this, and 24x24 tokens are the original output from the vision encoder. However, using 24x24 tokens gets OOM problem and we found applying proper pooling operations doesn't decrease the performance. Thus, we use 12x24 tokens as default for SlowFast-LLaVA to keep as many spatial details as possible. All numbers in the main results are based on 12x24 tokens. We will clarify this in the revision.

We are still working on the code release. Stay tuned!