gchhablani commited on
Commit
1733625
1 Parent(s): 4d398ed

Update social impact and future work

Browse files
sections/conclusion_future_work/future_work.md CHANGED
@@ -2,4 +2,7 @@ We hope to improve this project in the future by using:
2
  - Superior translation model: Translation has a very huge impact on how the end model would perform. Better translators (for e.g. Google Translate API) and language specific seq2seq models for translation are able to generate better data, both for high-resource and low-resource languages.
3
  - Checking translation quality: Inspecting quality of translated data is as important as the translation model itself. For this we'll either require native speakers to manually inspect a sample of translated data or devise some unsupervised translation quality metrics for the same.
4
  - More data: Currently we are using only 2.5M images of Conceptual 12M for image captioning. We plan to include other datasets like Conceptual Captions 3M, subset of YFCC100M dataset etc.
5
- - Low resource languages: With better translation tools we also wish to train our model in low resource languages which would further democratize the image captioning solution and help people realise the potential of language systems.
 
 
 
 
2
  - Superior translation model: Translation has a very huge impact on how the end model would perform. Better translators (for e.g. Google Translate API) and language specific seq2seq models for translation are able to generate better data, both for high-resource and low-resource languages.
3
  - Checking translation quality: Inspecting quality of translated data is as important as the translation model itself. For this we'll either require native speakers to manually inspect a sample of translated data or devise some unsupervised translation quality metrics for the same.
4
  - More data: Currently we are using only 2.5M images of Conceptual 12M for image captioning. We plan to include other datasets like Conceptual Captions 3M, subset of YFCC100M dataset etc.
5
+ - Low resource languages: With better translation tools we also wish to train our model in low resource languages which would further democratize the image captioning solution and help people realise the potential of language systems.
6
+ - More models: Currently we stick to CLIP-ViT and mBART-50. However, there are many multilingual models which can be used in place of mBART-50. ViT transformer, officially, has many many checkpoints which can be combined. We can use any other auto-regressive model insted of seq2seq trained on multilingual data in order to create a diverse set of models specifically for this task.
7
+ - Better deployability: We intend to make several different versions of the model in order to help make it available for mobile-phone deployments.
8
+ - More domains: We want to go beyond the domain of natural images, and cover medical, artistic, and satellite images which have several downstream applications and such a model would be very much in demand.
sections/conclusion_future_work/social_impact.md CHANGED
@@ -1 +1,10 @@
1
- Multilingual Visual Question Answering has not received a lot of attention. There are very few multilingual VQA datasets, and that is what we wanted to address here. Our initial plan was to include 4 high-resource and 4 low-resource languages in our training data. However, the existing translations do not perform as well and we would have received poor labels, not to mention, with a longer training time. We hope to improve this in the future by using better translators (for e.g. Google Translate API) to get more multilingual data, especially in low-resource languages. Regardless, our aim with this project was to provide with a pipeline approach to deal with Multilingual visuo-linguistic pretraining and perform Multilingual Visual Question Answering.
 
 
 
 
 
 
 
 
 
 
1
+ Our approach has a significant social impact, considering the sheer number of use-cases for this model and dataset.
2
+ - Translating the data using the existing MT models is a huge plus since multilingual data is scarce.
3
+ - The model we created is easy to use and easy-to-train (hassle-free).
4
+ - A multilingual model which answers questions based on a image has many usecases:
5
+ - Healthcare Chatbots
6
+ - Personal Assistants
7
+ - Devices for visually-impaired people
8
+ and so on.
9
+
10
+ With more and better training, we should be able to produce models that work across several languages and help solve several real-life problems for the community.