Grounded-VideoLLM Model Card

Grounded-VideoLLM is a Video-LLM adept at fine-grained temporal grounding, which not only excels in grounding tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA, but also shows great potential as a versatile video assistant for general video understanding.

Model details

Model date:

Grounded-VideoLLM-Phi3.5-Vision-Instruct-4B was trained in Oct. 2024.

Grounded-VideoLLM-LLaVA-Next-Llama3-8B was trained in Oct. 2024.

Paper or resources for more information: Paper, Code

Citation

If you find our project useful, hope you can star our repo and cite our paper as follows:

@article{wang2024grounded,
  title={Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models},
  author={Wang, Haibo and Xu, Zhiyang and Cheng, Yu and Diao, Shizhe and Zhou, Yufan and Cao, Yixin and Wang, Qifan and Ge, Weifeng and Huang, Lifu},
  journal={arXiv preprint arXiv:2410.03290},
  year={2024}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.