ViViT (Video Vision Transformer)
ViViT model as introduced in the paper ViViT: A Video Vision Transformer by Arnab et al. and first released in this repository.
Disclaimer: The team releasing ViViT did not write a model card for this model so this model card has been written by the Hugging Face team.
Model description
ViViT is an extension of the Vision Transformer (ViT) to video.
We refer to the paper for details.
Intended uses & limitations
The model is mostly meant to intended to be fine-tuned on a downstream task, like video classification. See the model hub to look for fine-tuned versions on a task that interests you.
How to use
For code examples, we refer to the documentation.
BibTeX entry and citation info
@misc{arnab2021vivit,
title={ViViT: A Video Vision Transformer},
author={Anurag Arnab and Mostafa Dehghani and Georg Heigold and Chen Sun and Mario Lučić and Cordelia Schmid},
year={2021},
eprint={2103.15691},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
- Downloads last month
- 1,178
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.