sections/limitations.md · flax-community/Multilingual-VQA at 5d1fa8767edfa036ec5c0cb1b28fd9460649debc

Limitations

Our best fine-tuned model only achieves 0.49 accuracy on the multilingual validation data that we create. This could be because of not-so-great quality translations, sub-optimal hyperparameters and lack of ample training.

Because of this, a proper limitations and bias analysis is difficult because it could be a result of the subpar performance.

We experiment a lot in our Examples section to see where the model fails and where it works well:

The model can answer color questions very well. It does get confused where a human would also would get confused, however.
The counting works decently when the objects are large, of the same type, and not too many in number. But,
- It fails to learn the difference between a Giraffe and Zebra, for example, in a given image, and when it is asked about the count of zebras and giraffes, it answers 0 and 2. Check out Counting Questions subsection in Examples.
- If there are too many objects present, and it is asked about something very specific, it fails to perform well and gives large numbers as a result.
- When the objects are too small in the image, it is unable to focus well in the image and when asked about count, it returns 0.
The model performs okay with size and shape questions. We have seen many examples of this and one is present in Examples section.
Yes/No questions: The performance is similar to that as in color questions. The model is able to work very well on obvious questions but when asked about slightly challenging questions (for example asking whether eyes of the giraffe are closed or not in a far-up image), it doesn't work as well. A human would also get confused while answering such questions so that is expected.
It doesn't work on negatives very well. For example, given an image of a happy person, asking "Is the person happy?" leads to "Yes" but "Is the person not happy?" also leads to yes. This problem has been observed in BERT model as well and needs to be addressed.
It is almost always consistent with multilinguality and leads to the same answer. We try with counting questions, color questions, and a miscellaneous question.
It works very well on questions where objects/places are the answers in all four languages.

Biases

Our model, like any other model, is prone to some biases present in the pre-trained models (CLIP-ViT and mBART). Some bias can also leak in from the mBART-50 and Marian models we used to translate the datasets. Not to mention, the datasets themselves will have some bias.

Because we haven't addressed these issues yet, the model can contain many sexist, racist, stereotypical biases which may be hateful, harmful or ignorant.

Since Conceptual-12M has all names and pronouns removed from the dataset. That could have led to reduction of bias to some extend. We checked for gender and racial/color bias in our examples. We haven't reached a conclusion on how biases our VQA model is, as the answers are not consistently biases in different languages or the different ways questions are asked. It heavily depends on the performance as well. For example, giving an image of a woman and asking "Is this a man?" might lead to "Yes" because of the poor performance itself.

We intend to fix these issues with cleaner/better, varied sources of training data. Only then we can mitigate such biases which affect the users and society.