sections/pretraining/data.md · flax-community/Multilingual-VQA at 8120e79c526c8d2097612e64f89b8ca35277cfe6

The dataset we use for pre-training is a cleaned version of Conceptual 12M. The dataset is downloaded and then broken images are removed which gives us about 10M images. Then we use the MBart50 mbart-large-50-one-to-many-mmt checkpoint to translate the dataset into four different languages - English, French, German, and Spanish, keeping 2.5 million examples of each language. This dataset is used for MLM pre-training.