Multimodal Language Model
What does matter besides data receipt when training a Multimodal language model?
- Paper • 2408.03326 • Published • 59
VILA^2: VILA Augmented VILA
Paper • 2407.17453 • Published • 39PaliGemma: A versatile 3B VLM for transfer
Paper • 2407.07726 • Published • 67openbmb/MiniCPM-V-2_6
Image-Text-to-Text • Updated • 106k • 831xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper • 2408.08872 • Published • 97MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper • 2403.09611 • Published • 124OpenGVLab/InternViT-6B-448px-V1-2
Image Feature Extraction • Updated • 176 • 25How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Paper • 2404.16821 • Published • 53LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Paper • 2403.11703 • Published • 16EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model
Paper • 2406.20076 • Published • 8
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Paper • 2408.11039 • Published • 56Note 1. The intra-image bidirectional attention is important, and replacing it with causal attention hurts text-to-image generation. 2. There is a clear advantage to using the U-Net up and down blocks instead of a simple linear layer for modality mapping.
LISA: Reasoning Segmentation via Large Language Model
Paper • 2308.00692 • Published • 1Note 1. Extract the feature of token from the last hidden layer of LLM and project to SAM decoder. 2. Joint train with pixel-level understanding data often leads to decreased image-level capability.
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
Paper • 2406.19389 • Published • 51Note 1. Image Encoder: a ConvNeXt-L-based CLIP model to reach high resolution. 2. Directly combining a frozen perception module with LLM doesn’t perform well. 3. Use a simple MLP to map the LLM output’s hidden states of the [SEG] token to the visual space. 4. Propose a good Region Encoder Design adapted from a pre-trained Image-Encoder. 5. “ Expression [SEG]." Since the “Expression" is flexible and variable, the LLM is less likely to overfit to a fixed response
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Paper • 2407.15841 • Published • 39Note 1. A Slow pathway extracts features at a low frame rate while keeping as many spatial details as possible. 2. A Fast pathway operates on a high frame rate but uses a larger spatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. 3. Concate them together and bang, here we have a good video features even without training.
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
Paper • 2311.05698 • Published • 9Note 1. Scale to 512 input video frames with the Token Turning Machine Combiner. 2. The ‘Process’ is implemented with a standard Transformer with layers of MHA and MLPs. The functions ‘Read’, ‘Write’, and ‘Output’ is implemented with Attention Pooling.
Building and better understanding vision-language models: insights and future directions
Paper • 2408.12637 • Published • 119Note 1. Training methods: 1.1 progressively higher-quality data, the maximum image resolution gradually increases, and more model parts are unfrozen. 2. Dataset 2.1 Apply image deduplication, it is possible to train on just half of the LAION dataset with only a minimal reduction in performance compared to using the full dataset
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Paper • 2408.15998 • Published • 83Note 1. Unfreezing the CLIP encoder significantly improves when interpolating to a higher MLLM input resolution that differs from the pre-training resolution. 2. Introduce a Pre-Alignment training stage: 2.1 Traini each pre-trained vision expert with their own projector on SFT data, while keeping the language model frozen,
NVLM: Open Frontier-Class Multimodal LLMs
Paper • 2409.11402 • Published • 72allenai/Molmo-7B-D-0924
Image-Text-to-Text • Updated • 70.5k • 441meta-llama/Llama-3.2-11B-Vision-Instruct
Image-Text-to-Text • Updated • 2.27M • • 989
Video Instruction Tuning With Synthetic Data
Paper • 2410.02713 • Published • 37Note 1. Arrange slow and fast frames in an interleaving pattern. p × p pooling and 2p × 2p pooling for slow and fast frames, respectively 2. Use a tagging model to categorize the video content; InsTag (https://arxiv.org/pdf/2308.07074)
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Paper • 2410.16267 • Published • 15Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Paper • 2409.12191 • Published • 75LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Paper • 2410.17434 • Published • 24Multimodal Autoregressive Pre-training of Large Vision Encoders
Paper • 2411.14402 • Published • 36