New open Vision Language Model by @Google: PaliGemma ๐๐ค
๐ Comes in 3B, pretrained, mix and fine-tuned models in 224, 448 and 896 resolution ๐งฉ Combination of Gemma 2B LLM and SigLIP image encoder ๐ค Supported in transformers
PaliGemma can do.. ๐งฉ Image segmentation and detection! ๐คฏ ๐ Detailed document understanding and reasoning ๐ Visual question answering, captioning and any other VLM task!