Spaces:
Runtime error
Runtime error
Our novel contributions include:
- A multilingual variant of the Conceptual-12M dataset containing 2.5M image-text pairs each in four languages - English, French, German and Spanish, translated using mBART-50 model.
- Multilingual variants of the VQAv2 train and validation sets containing four times the original data in English, French, German and Spanish, translated using Marian models.
- A fusion of CLIP Vision Transformer and BERT model where BERT embeddings are concatenated with visual embeddings at the very beginning and passed through BERT self-attention layers. This is based on the VisualBERT model.
- A pre-trained checkpooint on our multilingual Conceptual-12M variant with 67.85% validation accuracy.
- A fine-tuned checkpoint on our multilingual variant of the VQAv2 dataset with 49.76% validation accuracy.