metadata
license: apache-2.0
tags: null
Vision-and-Language Transformer (ViLT), pre-trained only
Vision-and-Language Transformer (ViLT) model pre-trained on GCC+SBU+COCO+VG (200k steps). It was introduced in the paper ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision by Kim et al. and first released in this repository.
Disclaimer: The team releasing ViLT did not write a model card for this model so this model card has been written by the Hugging Face team.
Model description
(to do)
Intended uses & limitations
You can use the raw model for visual question answering.
How to use
(to do)
Training data
(to do)
Training procedure
Preprocessing
(to do)
Pretraining
(to do)
Evaluation results
(to do)
BibTeX entry and citation info
@misc{kim2021vilt,
title={ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision},
author={Wonjae Kim and Bokyung Son and Ildoo Kim},
year={2021},
eprint={2102.03334},
archivePrefix={arXiv},
primaryClass={stat.ML}
}