sections/pretraining/dataset.md · flax-community/multilingual-image-captioning at b4ca4a1867d5fcbd96bc5844d8a276ef821a7563

The dataset we use for pre-training is a cleaned version of Conceptual 12M. The dataset is downloaded and then broken images are removed which gives us about 10M images. To save time, we use 2.5M of these image-text pairs. Then we use the MarianMT Helsinki-NLP/opus-mt-{src}-{tgt} checkpoint to translate the dataset into four different languages - English, French, German, and Spanish, keeping approximately 2.5M examples of each language.