nielsr's picture
nielsr HF staff
Update model card with paper information and correct Github link
a6c7333 verified
|
raw
history blame
3.62 kB
metadata
base_model:
  - openbmb/MiniCPM-Llama3-V-2_5
datasets:
  - MBZUAI/VideoInstruct-100K
  - Share14/ShareGemini
library_name: transformers
license: apache-2.0
pipeline_tag: video-text-to-text
tags:
  - MiniCPM-V
  - finetune
  - MLLM

Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation

💻 GitHub   |    📑 Paper   

Model Summary

This model is a part of the project Sparrow. It's a video-LLM fine-tuned from the image-LLM MiniCPM-Llama3-V-2_5.

Paper

Title: T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs

Abstract:

The success of Multimodal Large Language Models (MLLMs) in the image domain has garnered wide attention from the research community. Drawing on previous successful experiences, researchers have recently explored extending the success to the video understanding realms. Apart from training from scratch, an efficient way is to utilize the pre-trained image-LLMs, leading to two mainstream approaches, i.e. zero-shot inference and further fine-tuning with video data. In this work, our study of these approaches harvests an effective data augmentation method. We first make a deeper inspection of the zero-shot inference way and identify two limitations, i.e. limited generalization and lack of temporal understanding capabilities. Thus, we further investigate the fine-tuning approach and find a low learning efficiency when simply using all the video data samples, which can be attributed to a lack of instruction diversity. Aiming at this issue, we develop a method called T2Vid to synthesize video-like samples to enrich the instruction diversity in the training corpus. Integrating these data enables a simple and efficient training scheme, which achieves performance comparable to or even superior to using full video datasets by training with just 15% the sample size. Meanwhile, we find that the proposed scheme can boost the performance of long video understanding without training with long video samples. We hope our study will spark more thinking about using MLLMs for video understanding and curation of high-quality data.

License

Model License

  • The code in this repo is released under the Apache-2.0 License.
  • The usage of MiniCPM-V series model weights must strictly follow MiniCPM Model License.md.
  • The models and weights of MiniCPM are completely free for academic research. After filling out a "questionnaire" for registration, are also available for free commercial use.

Statement

  • As an LLM, MiniCPM-Llama3-V 2.5 generates contents by learning a large mount of texts, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-Llama3-V 2.5 does not represent the views and positions of the model developers
  • We will not be liable for any problems arising from the use of the MinCPM-V open Source model, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.

Training dataset

  • 100K video instruction data from Video-ChatGPT
  • 100K video caption data from ShareGemini