gchhablani commited on
Commit
6e91dae
1 Parent(s): b29be09

Update intro

Browse files
Files changed (1) hide show
  1. sections/intro.md +1 -1
sections/intro.md CHANGED
@@ -12,7 +12,7 @@ A major **advantage that comes from using transformers is their simplicity and t
12
 
13
  While building a low-resource non-English VQA approach has several benefits of its own, a multilingual VQA task is interesting because it will help create a generic approach/model that works decently well across several languages.
14
 
15
- **With the aim of democratizing such an challenging yet interesting task, in this project, we focus on Mutilingual Visual Question Answering (MVQA)**. Our intention here is to provide a Proof-of-Concept with our simple CLIP Vision + BERT baseline which leverages a multilingual checkpoint with pre-trained image encoders. Our model currently supports for four languages - **English, French, German and Spanish**.
16
 
17
  We follow the two-staged training approach, our pre-training task being text-only Masked Language Modeling (MLM). Our pre-training dataset comes from Conceptual-12M dataset where we use mBART-50 for translation. Our fine-tuning dataset is taken from the VQAv2 dataset and its translation is done using MarianMT models.
18
 
 
12
 
13
  While building a low-resource non-English VQA approach has several benefits of its own, a multilingual VQA task is interesting because it will help create a generic approach/model that works decently well across several languages.
14
 
15
+ **With the aim of democratizing such an challenging yet interesting task, in this project, we focus on Mutilingual Visual Question Answering (MVQA)**. Our intention here is to provide a Proof-of-Concept with our simple CLIP-Vision-BERT baseline which leverages a multilingual checkpoint with pre-trained image encoders. Our model currently supports for four languages - **English, French, German and Spanish**.
16
 
17
  We follow the two-staged training approach, our pre-training task being text-only Masked Language Modeling (MLM). Our pre-training dataset comes from Conceptual-12M dataset where we use mBART-50 for translation. Our fine-tuning dataset is taken from the VQAv2 dataset and its translation is done using MarianMT models.
18