PolyU-ChenLab
/

ETChat-Phi3-Mini-Stage-1

Video-Text-to-Text

Model card Files Files and versions Community

ETChat-Phi3-Mini-Stage-1 / README.md

yeliudev's picture

Add pipeline tag (#1)

f3b0957 verified 24 days ago

|

history blame contribute delete

1.47 kB

	---
	license: bsd-3-clause
	pipeline_tag: video-text-to-text
	---

	# E.T. Chat

	[arXiv](https://arxiv.org/abs/2409.18111) \| [Project Page](https://polyu-chenlab.github.io/etbench) \| [GitHub](https://github.com/PolyU-ChenLab/ETBench)

	E.T. Chat is a novel time-sensitive Video-LLM that reformulates timestamp prediction as an embedding matching problem, serving as a strong baseline on E.T. Bench. E.T. Chat consists of a visual encoder, a frame compressor, and a LLM. A special token \<vid\> is introduced to trigger frame embedding matching for timestamp prediction.

	## 🔖 Model Details

	### Model Description

	- Developed by: Ye Liu
	- Model type: Multi-modal Large Language Model
	- Language(s): English
	- License: BSD-3-Clause

	### Training Data

	The stage-1 checkpoint of E.T. Chat was trained from [WebVid](https://maxbain.com/webvid-dataset/) and [LCS-558K](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) datasets.

	### More Details

	Please refer to our [GitHub Repository](https://github.com/PolyU-ChenLab/ETBench) for more details about this model.

	## 📖 Citation

	Please kindly cite our paper if you find this project helpful.

	```
	@inproceedings{liu2024etbench,
	title={E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding},
	author={Liu, Ye and Ma, Zongyang and Qi, Zhongang and Wu, Yang and Chen, Chang Wen and Shan, Ying},
	booktitle={Neural Information Processing Systems (NeurIPS)},
	year={2024}
	}
	```