Spaces:

flax-community
/

multilingual-image-captioning

Runtime error

add contributions

099abfc over 3 years ago

1.1 kB

	Our novel contributions include:
	- A [multilingual variant of the Conceptual-12M dataset (mBART50)](https://huggingface.co/datasets/flax-community/conceptual-12m-mbart-50-multilingual) containing 2.5M image-text pairs each in four languages - English, French, German and Spanish, translated using mBART-50 model.
	- A [multilingual variant of the Conceptual-12M dataset (MarianMT)](https://huggingface.co/datasets/flax-community/conceptual-12m-multilingual-marian) containing 2.5M image-text pairs each in four languages - English, French, German and Spanish, translated using MarianMT model.
	- [A fusion of CLIP Vision Transformer and mBART50 model](https://github.com/gchhablani/multilingual-vqa/tree/main/models/flax_clip_vision_bert). It takes in visual embeddings from CLIP-Vision transformer and feeds into the `encoder_hidden_states` of a mBART50 decoder. This is done for deep cross-modal interaction via cross-attention between the two models.
	- A [pre-trained checkpooint](https://huggingface.co/flax-community/clip-vit-base-patch32_mbart-large-50) on our multilingual Conceptual-12M variant.