Multimodal Language Model - a Norm Collection

Norm 's Collections

VAE

Image / Video Gen

Multimodal Language Model

Fundamental Research

Computer Vision

Multimodal Language Model

updated 29 days ago

What does matter besides data receipt when training a Multimodal language model?

LLaVA-OneVision: Easy Visual Task Transfer

Paper • 2408.03326 • Published Aug 6, 2024 • 60
VILA^2: VILA Augmented VILA

Paper • 2407.17453 • Published Jul 24, 2024 • 40
PaliGemma: A versatile 3B VLM for transfer

Paper • 2407.07726 • Published Jul 10, 2024 • 68
openbmb/MiniCPM-V-2_6

Image-Text-to-Text • Updated 22 days ago • 136k • 923
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Paper • 2408.08872 • Published Aug 16, 2024 • 98
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Paper • 2403.09611 • Published Mar 14, 2024 • 126
OpenGVLab/InternViT-6B-448px-V1-2

Image Feature Extraction • Updated Dec 9, 2024 • 118 • 27
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Paper • 2404.16821 • Published Apr 25, 2024 • 56
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

Paper • 2403.11703 • Published Mar 18, 2024 • 17
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

Paper • 2406.20076 • Published Jun 28, 2024 • 9
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Paper • 2408.11039 • Published Aug 20, 2024 • 58

Note 1. The intra-image bidirectional attention is important, and replacing it with causal attention hurts text-to-image generation. 2. There is a clear advantage to using the U-Net up and down blocks instead of a simple linear layer for modality mapping.
LISA: Reasoning Segmentation via Large Language Model

Paper • 2308.00692 • Published Aug 1, 2023 • 1

Note 1. Extract the feature of token from the last hidden layer of LLM and project to SAM decoder. 2. Joint train with pixel-level understanding data often leads to decreased image-level capability.
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Paper • 2406.19389 • Published Jun 27, 2024 • 53

Note 1. Image Encoder: a ConvNeXt-L-based CLIP model to reach high resolution. 2. Directly combining a frozen perception module with LLM doesn’t perform well. 3. Use a simple MLP to map the LLM output’s hidden states of the [SEG] token to the visual space. 4. Propose a good Region Encoder Design adapted from a pre-trained Image-Encoder. 5. “ Expression [SEG]." Since the “Expression" is flexible and variable, the LLM is less likely to overfit to a fixed response
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Paper • 2407.15841 • Published Jul 22, 2024 • 40

Note 1. A Slow pathway extracts features at a low frame rate while keeping as many spatial details as possible. 2. A Fast pathway operates on a high frame rate but uses a larger spatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. 3. Concate them together and bang, here we have a good video features even without training.
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

Paper • 2311.05698 • Published Nov 9, 2023 • 10

Note 1. Scale to 512 input video frames with the Token Turning Machine Combiner. 2. The ‘Process’ is implemented with a standard Transformer with layers of MHA and MLPs. The functions ‘Read’, ‘Write’, and ‘Output’ is implemented with Attention Pooling.
Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22, 2024 • 125

Note 1. Training methods: 1.1 progressively higher-quality data, the maximum image resolution gradually increases, and more model parts are unfrozen. 2. Dataset 2.1 Apply image deduplication, it is possible to train on just half of the LAION dataset with only a minimal reduction in performance compared to using the full dataset
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Paper • 2408.15998 • Published Aug 28, 2024 • 86

Note 1. Unfreezing the CLIP encoder significantly improves when interpolating to a higher MLLM input resolution that differs from the pre-training resolution. 2. Introduce a Pre-Alignment training stage: 2.1 Traini each pre-trained vision expert with their own projector on SFT data, while keeping the language model frozen,
NVLM: Open Frontier-Class Multimodal LLMs

Paper • 2409.11402 • Published Sep 17, 2024 • 73
allenai/Molmo-7B-D-0924

Image-Text-to-Text • Updated Oct 10, 2024 • 570k • 503
meta-llama/Llama-3.2-11B-Vision-Instruct

Image-Text-to-Text • Updated Dec 4, 2024 • 2.22M • • 1.29k
Video Instruction Tuning With Synthetic Data

Paper • 2410.02713 • Published Oct 3, 2024 • 38

Note 1. Arrange slow and fast frames in an interleaving pattern. p × p pooling and 2p × 2p pooling for slow and fast frames, respectively 2. Use a tagging model to categorize the video content; InsTag (https://arxiv.org/pdf/2308.07074)
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

Paper • 2410.16267 • Published Oct 21, 2024 • 17
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Paper • 2409.12191 • Published Sep 18, 2024 • 76
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Paper • 2410.17434 • Published Oct 22, 2024 • 26
Multimodal Autoregressive Pre-training of Large Vision Encoders

Paper • 2411.14402 • Published Nov 21, 2024 • 43
PaliGemma 2: A Family of Versatile VLMs for Transfer

Paper • 2412.03555 • Published Dec 4, 2024 • 126

Note uses the prefix "detect all classes" and provides box coordinates and class names for all annotated objects in random order in the target sequence (suffix). To reach the maximum sequence length, noise boxes with random coordinates and a token as the class name are added. No loss is applied to the noise box coordinates during training, but the class tokens are treated with loss as usual.
Apollo: An Exploration of Video Understanding in Large Multimodal Models

Paper • 2412.10360 • Published Dec 13, 2024 • 139

Note 1. While the Qwen2-1.5B and Qwen1.5-4B variants had similar performance, the 4B Qwen1.5-4B was still more correlated than the 1.5B model. 2. 500K samples is sufficient for moderately sized models (2–4 B) to reliably transfer design insights to larger models
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 129
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published about 1 month ago • 42