gchhablani commited on
Commit
3c8ae45
1 Parent(s): a4ce24c

Update abstract

Browse files
Files changed (1) hide show
  1. sections/abstract.md +21 -4
sections/abstract.md CHANGED
@@ -1,6 +1,23 @@
1
- ## Abstract
2
- This project is focused on Mutilingual Visual Question Answering. Most of the existing datasets and models on this task work with English-only image-text pairs. Our intention here is to provide a Proof-of-Concept with our simple CLIP Vision + BERT model which can be trained on multilingual text checkpoints with pre-trained image encoders and made to perform well enough.
3
 
4
- Due to lack of good-quality multilingual data, we translate subsets of the Conceptual 12M dataset into English (already in English), French, German and Spanish using the mBART-50 models. We get an eval accuracy of 0.69 on the MLM task.
5
 
6
- We achieved 0.49 accuracy on the multilingual validation set of VQAv2 we created using Marian models. With better captions, and hyperparameter-tuning, we expect to see higher performance.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Visual Question Answering (VQA) is a task where we expect the AI to answer a question about a given image.
 
2
 
3
+ VQA has been an active area of research for the past 4-5 years, with most datasets using natural images found online. Two examples of such datasets: [VQAv2](https://visualqa.org/challenge.html), [GQA](https://cs.stanford.edu/people/dorarad/gqa/about.html). VQA is a particularly interesting multi-modal machine learning challenge because it has several interesting applications across several domains including healthcare chatbots, interactivw-agents, etc.
4
 
5
+ However, most VQA challenges or datasets deal with English-only captions and questions.
6
+ In addition, even recent approaches that have been proposed for VQA generally are obscure due to the reasons that CNN-based object detectors are relatively difficult and more complex.
7
+
8
+ For example, a FasterRCNN approach uses the following steps:
9
+ - a FPN (Feature Pyramid Net) over a ResNet backbone, and
10
+ - then a RPN (Regision Proposal Network) layer detects proposals in those features, and
11
+ - then the ROI (Region of Interest) heads get the box proposals in the original image, and
12
+ - the the boxes are selected using a NMS (Non-max suppression),
13
+ - and then the features for selected boxes.
14
+
15
+ A major advantage that comes from using transformers is their simplicity and their accessibility - thanks to HuggingFace team, ViT and Transformers authors. For ViT models, for example, all one needs to do is pass the normalized images to the transformer.
16
+
17
+ While building a low-resource non-English VQA approach has several benefits of its own, a multilingual VQA task is interesting because it will help create a generic approach/model that works decently well across several languages.
18
+
19
+ With the aim of democratizing such an obscure yet interesting task, in this project, we focus on Mutilingual Visual Question Answering (MVQA). Our intention here is to provide a Proof-of-Concept with our simple CLIP Vision + BERT baseline which leverages a multilingual checkpoint with pre-trained image encoders. Our model currently supports for four languages - English, French, German and Spanish.
20
+
21
+ We follow the two-staged training approach, our pre-training task being text-only Masked Language Modeling (MLM). Our pre-training dataset comes from Conceptual-12M dataset where we use mBART-50 for translation. Our fine-tuning dataset is taken from the VQAv2 dataset and its translation is done using MarianMT models.
22
+
23
+ We achieve an eval accuracy of 0.69 on our MLM task, while our fine-tuned model is able to achieve a 0.49 eval accuracy on our multilingual VQAv2 validation set. With better captions, hyperparameter-tuning, and further training, we expect to see higher performance.