gchhablani commited on
Commit
9dcb033
1 Parent(s): 4eb611e

Update intro

Browse files
Files changed (1) hide show
  1. sections/intro.md +2 -2
sections/intro.md CHANGED
@@ -23,5 +23,5 @@ Our novel contributions include:
23
  - A [multilingual variant of the Conceptual-12M dataset](https://huggingface.co/datasets/flax-community/conceptual-12m-mbart-50-multilingual) containing 2.5M image-text pairs each in four languages - English, French, German and Spanish, translated using mBART-50 model.
24
  - [Multilingual variants of the VQAv2 train and validation sets](https://huggingface.co/datasets/flax-community/multilingual-vqa) containing four times the original data in English, French, German and Spanish, translated using Marian models.
25
  - [A fusion of CLIP Vision Transformer and BERT model](https://github.com/gchhablani/multilingual-vqa/tree/main/models/flax_clip_vision_bert) where BERT embeddings are concatenated with visual embeddings at the very beginning and passed through BERT self-attention layers. This is based on the [VisualBERT](https://arxiv.org/abs/1908.03557) model.
26
- - A [pre-trained checkpooint](https://huggingface.co/flax-community/clip-vision-bert-cc12m-70k) on our multilingual with **0.69** validation accuracy.
27
- - A [fine-tuned checkpoint](https://huggingface.co/flax-community/clip-vision-bert-vqa-ft-6k) on our multilingual variant of the VQAv2 dataset with **0.49** validation accuracy.
 
23
  - A [multilingual variant of the Conceptual-12M dataset](https://huggingface.co/datasets/flax-community/conceptual-12m-mbart-50-multilingual) containing 2.5M image-text pairs each in four languages - English, French, German and Spanish, translated using mBART-50 model.
24
  - [Multilingual variants of the VQAv2 train and validation sets](https://huggingface.co/datasets/flax-community/multilingual-vqa) containing four times the original data in English, French, German and Spanish, translated using Marian models.
25
  - [A fusion of CLIP Vision Transformer and BERT model](https://github.com/gchhablani/multilingual-vqa/tree/main/models/flax_clip_vision_bert) where BERT embeddings are concatenated with visual embeddings at the very beginning and passed through BERT self-attention layers. This is based on the [VisualBERT](https://arxiv.org/abs/1908.03557) model.
26
+ - A [pre-trained checkpooint](https://huggingface.co/flax-community/clip-vision-bert-cc12m-70k) on our multilingual with **67.85%** validation accuracy.
27
+ - A [fine-tuned checkpoint](https://huggingface.co/flax-community/clip-vision-bert-vqa-ft-6k) on our multilingual variant of the VQAv2 dataset with **49.76%** validation accuracy.