Spaces:

flax-community
/

Multilingual-VQA

Runtime error

File size: 734 Bytes

7a89f67
82bb660

## Abstract 
This project is focused on Mutilingual Visual Question Answering. Most of the existing datasets and models on this task work with English-only image-text pairs. Our intention here is to provide a Proof-of-Concept with our simple CLIP Vision + BERT model which can be trained on multilingual text checkpoints with pre-trained image encoders and made to perform well enough. 

Due to lack of good-quality multilingual data, we translate subsets of the Conceptual 12M dataset into English (already in English), French, German and Spanish using the mBART-50 models.  We achieved 0.49 accuracy on the multilingual validation set we created. With better captions, and hyperparameter-tuning, we expect to see higher performance.