sections/conclusion_future_work/conclusion.md · flax-community/Multilingual-VQA at 5d1fa8767edfa036ec5c0cb1b28fd9460649debc

In this project, we presented Proof-of-Concept with our CLIP Vision + BERT model baseline which leverages a multilingual checkpoint with pre-trained image encoders in four languages - English, French, German, and Spanish. Our model performs very well considering the amount of training time we were able to get and achieves 0.49 eval accuracy on our multilingual VQAv2 dataset.