--- license: apache-2.0 --- # Vision-and-Language Transformer (ViLT), fine-tuned on VSR zeroshot split Vision-and-Language Transformer (ViLT) model fine-tuned on zeroshot split of [Visual Spatial Reasoning (VSR)](https://arxiv.org/abs/2205.00363). ViLT was introduced in the paper [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Kim et al. and first released in [this repository](https://github.com/dandelin/ViLT). ## Intended uses & limitations You can use the model to determine whether a sentence is true or false given an image. ### How to use Here is how to use the model in PyTorch: ``` from transformers import ViltProcessor, ViltForImagesAndTextClassification import requests from PIL import Image image = Image.open(requests.get("https://camo.githubusercontent.com/ffcbeada14077b8e6d4b16817c91f78ba50aace210a1e4754418f1413d99797f/687474703a2f2f696d616765732e636f636f646174617365742e6f72672f747261696e323031372f3030303030303038303333362e6a7067", stream=True).raw) text = "The person is ahead of the cow." processor = ViltProcessor.from_pretrained("juletxara/vilt-vsr-zeroshot") model = ViltForImagesAndTextClassification.from_pretrained("juletxara/vilt-vsr-zeroshot") # prepare inputs encoding = processor(image, text, return_tensors="pt") # forward pass outputs = model(input_ids=encoding.input_ids, pixel_values=encoding.pixel_values.unsqueeze(0)) logits = outputs.logits idx = logits.argmax(-1).item() print("Predicted answer:", model.config.id2label[idx]) ``` ## Training data (to do) ## Training procedure ### Preprocessing (to do) ### Pretraining (to do) ## Evaluation results (to do) ### BibTeX entry and citation info ```bibtex @misc{kim2021vilt, title={ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision}, author={Wonjae Kim and Bokyung Son and Ildoo Kim}, year={2021}, eprint={2102.03334}, archivePrefix={arXiv}, primaryClass={stat.ML} } @article{liu2022visual, title={Visual Spatial Reasoning}, author={Liu, Fangyu and Emerson, Guy and Collier, Nigel}, journal={arXiv preprint arXiv:2205.00363}, year={2022} } ```