menglc commited on
Commit
202e5c0
Β·
1 Parent(s): 66f8920

update readme

Browse files
Files changed (1) hide show
  1. README.md +104 -0
README.md CHANGED
@@ -4,3 +4,107 @@ license_name: tongyi-qwen
4
  license_link: >-
5
  https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT
6
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  license_link: >-
5
  https://github.com/QwenLM/Qwen/blob/main/Tongyi%20Qianwen%20LICENSE%20AGREEMENT
6
  ---
7
+ # SliMM: A Simple LMM baseline with Dynamic Visual Resolution πŸš€
8
+
9
+ [[🌐 Project Page](https://deepstack-vl.github.io/)]
10
+ [[πŸ“š Paper](https://arxiv.org/abs/2406.04334)]
11
+
12
+
13
+ ## πŸ”₯ Latest Update
14
+ * [2024/12/12] Our [first version](https://huggingface.co/collections/menglc/slimm-675bd737c2965037a6b52d05) is out! We release a strong 0.5B baseline model [SliMM-Qwen2-0.5B](https://huggingface.co/menglc/SliMM-Qwen2-0.5B) and advanced baseline [SliMM-DeepStackM-Qwen2-0.5B](https://huggingface.co/menglc/SliMM-DeepStackM-Qwen2-0.5B). We release a strong 2B model [SliMM-DeepStackE-Qwen2VL-2B](https://huggingface.co/menglc/SliMM-DeepStackE-Qwen2VL-2B) continous fine-tuned from Qwen2VL-2B, which save 4x fewer visual tokens for LLM with. Training scrips are avaliable [here]()!
15
+
16
+
17
+
18
+ ## Introduction
19
+
20
+ * **Advanced Techniques**: We incorporate native dynamic resolution, as used in Qwen2-VL, for high-resolution visual encoding, replacing the previous cumbersome Multi-Crop/AnyRes methods. Moreover, building on DeepStack [1], we maintain the same principle of interting stacked visual tokens into **multiple layers** of the LLMs. We propose two enhanced versions for native resolution vision encoding: DeepStack-MidLayers, which improves performance with negligible additional FLOPs by stacking multi-level visual tokens from the middle layers of the vision encoder, and DeepStack-Efficient, which reduces visual token usage while maintaining high performance.
21
+ * **Seamless Integration**: Easily use LLaVA-format training data in our codebase.
22
+ * **Training Efficiency**: Fine-tuning on the 748K LLaVA-Next-DATA for on epoch takes only 4 hours for 0.5/2B Qwen2 and 6 hours for a 7B on 8xH100, which is more than 2x faster than LLaVA-OV codebase.
23
+ * **Strong Baseline Model for Small LMMs**: We establish a robust baseline using widely-used public available datasets, including LCS-758K (Stage-1), LLaVA-OV-MidStage (Stage 1.5), and LLaVA-OneVision SI (Stage 2).
24
+
25
+ [1] *DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs*
26
+
27
+ ## Quick Start
28
+
29
+
30
+ ```bash
31
+ git clone https://github.com/MengLcool/SliMM.git
32
+ cd SliMM
33
+ pip install -e .
34
+ ```
35
+
36
+ ```Python
37
+ # this is very similar to qwen2-vl
38
+ from slimm.model.processor import SliMMQwen2VLProcessor
39
+ from slimm.model.slimm import SliMMForConditionalGeneration
40
+ from slimm.model.utils_vl import process_vision_info
41
+
42
+ model_path = "menglc/SliMM-DeepStackM-Qwen2-0.5B"
43
+
44
+ model = SliMMForConditionalGeneration.from_pretrained(
45
+ model_path, torch_dtype="auto", device_map="auto"
46
+ )
47
+
48
+ processor = SliMMQwen2VLProcessor.from_pretrained(model_path)
49
+
50
+ messages = [
51
+ {
52
+ "role": "user",
53
+ "content": [
54
+ {
55
+ "type": "image",
56
+ "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
57
+ },
58
+ {"type": "text", "text": "Describe this image."},
59
+ ],
60
+ }
61
+ ]
62
+
63
+ # Preparation for inference
64
+ text = processor.apply_chat_template(
65
+ messages, tokenize=False, add_generation_prompt=True
66
+ )
67
+ image_inputs, video_inputs = process_vision_info(messages)
68
+ inputs = processor(
69
+ text=[text],
70
+ images=image_inputs,
71
+ videos=video_inputs,
72
+ padding=True,
73
+ return_tensors="pt",
74
+ )
75
+ inputs = inputs.to("cuda")
76
+
77
+ # Inference: Generation of the output
78
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
79
+ generated_ids_trimmed = [
80
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
81
+ ]
82
+ output_text = processor.batch_decode(
83
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
84
+ )
85
+ print(output_text)
86
+ ```
87
+
88
+ ## Benchmarks
89
+
90
+ | Benchmark | MMMU (Val) | ChartQA (Test) | AI2D (test) | DocVQA (val)
91
+ |-------------------------|------------|----------------|-------------|-------------|
92
+ |NanoLLaVA-Qwen1.5-0.5B |28.6 | NA |NA |NA |
93
+ |OmniVLM v1 |39.9 | 59.2 |NA |NA |
94
+ |OmniVLM v2 |**40.0** | 61.9 |NA |NA |
95
+ |LLaVA-OV-SI-Qwen2.5-0.5B |31.2 | 61.0 |54.2 |75.0 |
96
+ |LLaVA-OV-Qwen2.5-0.5B |31.4 | 61.4 |57.1 |73.7 |
97
+ |SliMM-Qwen2-0.5B |30.6 | 64.2 |58.4 |77.0 |
98
+ |SliMM-DeepStackM-Qwen2-0.5B|**31.4** | **65.2** |**60.3** |**77.7** |
99
+
100
+ ## πŸ”— Citation
101
+ If you find our work helpful, please consider citing our paper :paperclip: and starring our repo :star2: :
102
+
103
+ ```
104
+ @inproceedings{meng2024deepstack,
105
+ title={DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs},
106
+ author={Meng, Lingchen and Yang, Jianwei and Tian, Rui and Dai, Xiyang and Wu, Zuxuan and Gao, Jianfeng and Jiang, Yu-Gang},
107
+ booktitle={NeurIPS},
108
+ year={2024}
109
+ }
110
+ ```