TVLT

Textless Vision-Language Transformer (TLVT) model, pre-trained-only. It was introduced in the paper TVLT: Textless Vision-Language Transformer by Tang et al. and first released in this repository.

Disclaimer: The team releasing TVLT did not write a model card for this model so this model card has been written by the Hugging Face team.

Model description

TVLT is based on the MAE model, but extends it to audio-visual pre-training.

Intended uses & limitations

It's recommended to fine-tune the model on a task that involves audio and/or video.

How to use

For code examples, we refer to the documentation.

BibTeX entry and citation info

@misc{https://doi.org/10.48550/arxiv.2209.14156,
  doi = {10.48550/ARXIV.2209.14156},
  
  url = {https://arxiv.org/abs/2209.14156},
  
  author = {Tang, Zineng and Cho, Jaemin and Nie, Yixin and Bansal, Mohit},
  
  keywords = {Computer Vision and Pattern Recognition (cs.CV), Artificial Intelligence (cs.AI), Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {TVLT: Textless Vision-Language Transformer},
  
  publisher = {arXiv},
  
  year = {2022},
  
  copyright = {arXiv.org perpetual, non-exclusive license}
}
Downloads last month
80
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.