vision language models (VLM)
Paper • 2407.07726 • Published • 66Note Code and model available on github and hugging face. PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. strong performance on a wide variety of open-world tasks: evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.
Vision language models are blind
Paper • 2407.06581 • Published • 82Note VLMs struggle with tasks that requires precise spatial information and counting (from 0 to 10),有时候给人感觉VLM模型像是近视了,看不见细节,只能靠猜。
CosmoCLIP: Generalizing Large Vision-Language Models for Astronomical Imaging
Paper • 2407.07315 • Published • 6Note 用于天文学的CLIP,FT on pretrained CLIP.
Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision
Paper • 2407.06189 • Published • 24Note instruction tuning? [Q] the first video self-training approach. Video-STaR allows the utilization of any labeled video dataset for video instruction tuning.
Unveiling Encoder-Free Vision-Language Models
Paper • 2406.11832 • Published • 49Note developing a pure decoder-only architecture across modalities. 语言模型为何能没有encoder?视觉模型可以没有encoder吗?How? 不同的大模型transformer结构整理; [2R]