|
--- |
|
license: bsd-3-clause |
|
pipeline_tag: video-text-to-text |
|
--- |
|
|
|
# E.T. Chat |
|
|
|
[arXiv](https://arxiv.org/abs/2409.18111) | [Project Page](https://polyu-chenlab.github.io/etbench) | [GitHub](https://github.com/PolyU-ChenLab/ETBench) |
|
|
|
E.T. Chat is a novel time-sensitive Video-LLM that reformulates timestamp prediction as an embedding matching problem, serving as a strong baseline on E.T. Bench. E.T. Chat consists of a visual encoder, a frame compressor, and a LLM. A special token \<vid\> is introduced to trigger frame embedding matching for timestamp prediction. |
|
|
|
## π Model Details |
|
|
|
### Model Description |
|
|
|
- **Developed by:** Ye Liu |
|
- **Model type:** Multi-modal Large Language Model |
|
- **Language(s):** English |
|
- **License:** BSD-3-Clause |
|
|
|
### Training Data |
|
|
|
The stage-1 checkpoint of E.T. Chat was trained from [WebVid](https://maxbain.com/webvid-dataset/) and [LCS-558K](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) datasets. |
|
|
|
### More Details |
|
|
|
Please refer to our [GitHub Repository](https://github.com/PolyU-ChenLab/ETBench) for more details about this model. |
|
|
|
## π Citation |
|
|
|
Please kindly cite our paper if you find this project helpful. |
|
|
|
``` |
|
@inproceedings{liu2024etbench, |
|
title={E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding}, |
|
author={Liu, Ye and Ma, Zongyang and Qi, Zhongang and Wu, Yang and Chen, Chang Wen and Shan, Ying}, |
|
booktitle={Neural Information Processing Systems (NeurIPS)}, |
|
year={2024} |
|
} |
|
``` |
|
|