sections/pretraining/intro.md · flax-community/Multilingual-VQA at 95fff6eb51ff0c2de6b852e42606b944e64a4554

We follow an approach similar to VisualBERT. Instead of using a FasterRCNN to get image features, we use a CLIP Vision (ViT transformer) encoder. The pre-training task is text-only MLM (Masked Language Modeling). We mask only the text tokens and try to predict the masked tokens. The VisualBERT authors also use a sentence-image matching task where two captions are matched against an image, but we skip this for the sake of simplicity.