Spaces:

flax-community
/

multilingual-image-captioning

Runtime error

App Files Files Community

multilingual-image-captioning / sections /pretraining /intro.md

bhavitvyamalik

multiapp

54dc7b4 over 3 years ago

preview code

raw

history blame contribute delete

412 Bytes

A newer version of the Streamlit SDK is available: 1.40.1

Upgrade

We follow an encoder-decoder approach for image captioning, where the image encoder is the CLIP Vision model (a ViT transformer). The pre-training task is image-to-text generation. We take the input tokens and shift them using an <eos> token towards right in order to create the inputs for our model, while the original input tokens become labels. The model is trained on the dataset. in an end-to-end fashion.