SliMM: A Simple LMM baseline with Dynamic Visual Resolution πŸš€

[🌐 Project Page] [πŸ“š Paper]

πŸ”₯ Latest Update

Introduction

  • Advanced Techniques: We incorporate native dynamic resolution, as used in Qwen2-VL, for high-resolution visual encoding, replacing the previous cumbersome Multi-Crop/AnyRes methods. Moreover, building on DeepStack [1], we maintain the same principle of interting stacked visual tokens into multiple layers of the LLMs. We propose two enhanced versions for native resolution vision encoding: DeepStack-MidLayers, which improves performance with negligible additional FLOPs by stacking multi-level visual tokens from the middle layers of the vision encoder, and DeepStack-Efficient, which reduces visual token usage while maintaining high performance.

  • Seamless Integration: Easily use LLaVA-format training data in our codebase.

  • Training Efficiency: Fine-tuning on the 748K LLaVA-Next-DATA for on epoch takes only 4 hours for 0.5/2B Qwen2 and 6 hours for a 7B on 8xH100, which is more than 2x faster than LLaVA-OV codebase.

  • Strong Baseline Model for Small LMMs: We establish a robust baseline using widely-used public available datasets, including LCS-758K (Stage-1), LLaVA-OV-MidStage (Stage 1.5), and LLaVA-OneVision SI (Stage 2).

    [1] DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs

Quick Start

git clone https://github.com/MengLcool/SliMM.git
cd SliMM
pip install -e .
# this is very similar to qwen2-vl
from slimm.model.processor import SliMMQwen2VLProcessor
from slimm.model.slimm import SliMMForConditionalGeneration
from slimm.model.utils_vl import process_vision_info

model_path = "menglc/SliMM-DeepStackM-Qwen2-0.5B"

model = SliMMForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto"
)

processor = SliMMQwen2VLProcessor.from_pretrained(model_path)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Benchmarks

Benchmark MMMU (Val) ChartQA (Test) AI2D (test) DocVQA (val)
NanoLLaVA-Qwen1.5-0.5B 28.6 NA NA NA
OmniVLM v1 39.9 59.2 NA NA
OmniVLM v2 40.0 61.9 NA NA
LLaVA-OV-SI-Qwen2.5-0.5B 31.2 61.0 54.2 75.0
LLaVA-OV-Qwen2.5-0.5B 31.4 61.4 57.1 73.7
SliMM-Qwen2-0.5B 30.6 64.2 58.4 77.0
SliMM-DeepStackM-Qwen2-0.5B 31.4 65.2 60.3 77.7

πŸ”— Citation

If you find our work helpful, please consider citing our paper :paperclip: and starring our repo :star2: :

@inproceedings{meng2024deepstack,
  title={DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs},
  author={Meng, Lingchen and Yang, Jianwei and Tian, Rui and Dai, Xiyang and Wu, Zuxuan and Gao, Jianfeng and Jiang, Yu-Gang},
  booktitle={NeurIPS},
  year={2024}
}
Downloads last month
59
Safetensors
Model size
1.28B params
Tensor type
BF16
Β·
Inference Examples
Inference API (serverless) does not yet support transformers models for this pipeline type.

Model tree for menglc/SliMM-DeepStackM-Qwen2-0.5B

Base model

Qwen/Qwen2-0.5B
Finetuned
(59)
this model

Collection including menglc/SliMM-DeepStackM-Qwen2-0.5B