prince-canuma doubility123 commited on
Commit
426138a
0 Parent(s):

Duplicate from deepseek-ai/deepseek-vl2-tiny

Browse files

Co-authored-by: Wen Liu <doubility123@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: deepseek
4
+ license_link: https://github.com/deepseek-ai/DeepSeek-LLM/blob/HEAD/LICENSE-MODEL
5
+ pipeline_tag: image-text-to-text
6
+ library_name: transformers
7
+ ---
8
+
9
+ ## 1. Introduction
10
+
11
+ Introducing DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL. DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively.
12
+ DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models.
13
+
14
+
15
+ [DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding](https://arxiv.org/abs/2412.10302)
16
+
17
+ [**Github Repository**](https://github.com/deepseek-ai/DeepSeek-VL2)
18
+
19
+
20
+ Zhiyu Wu*, Xiaokang Chen*, Zizheng Pan*, Xingchao Liu*, Wen Liu**, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, Chong Ruan*** (* Equal Contribution, ** Project Lead, *** Corresponding author)
21
+
22
+ ![](https://github.com/deepseek-ai/DeepSeek-VL2/tree/main/images/vl2_teaser.jpeg)
23
+
24
+
25
+ ### 2. Model Summary
26
+
27
+ DeepSeek-VL2-tiny is built on DeepSeekMoE-3B (total activated parameters are 1.0B).
28
+
29
+
30
+ ## 3. Quick Start
31
+
32
+ ### Installation
33
+
34
+ On the basis of `Python >= 3.8` environment, install the necessary dependencies by running the following command:
35
+
36
+ ```shell
37
+ pip install -e .
38
+ ```
39
+
40
+ ### Notifications
41
+ 1. We suggest to use a temperature T <= 0.7 when sampling. We observe a larger temperature decreases the generation quality.
42
+ 2. To keep the number of tokens managable in the context window, we apply dynamic tiling strategy to <=2 images. When there are >=3 images, we directly pad the images to 384*384 as inputs without tiling.
43
+ 3. The main difference between DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2 is the base LLM.
44
+
45
+ ### Simple Inference Example
46
+
47
+ ```python
48
+ import torch
49
+ from transformers import AutoModelForCausalLM
50
+
51
+ from deepseek_vl.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM
52
+ from deepseek_vl.utils.io import load_pil_images
53
+
54
+
55
+ # specify the path to the model
56
+ model_path = "deepseek-ai/deepseek-vl2-small"
57
+ vl_chat_processor: DeepseekVLV2Processor = DeepseekVLV2Processor.from_pretrained(model_path)
58
+ tokenizer = vl_chat_processor.tokenizer
59
+
60
+ vl_gpt: DeepseekVLV2ForCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True)
61
+ vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
62
+
63
+ ## single image conversation example
64
+ conversation = [
65
+ {
66
+ "role": "<|User|>",
67
+ "content": "<image>\n<|ref|>The giraffe at the back.<|/ref|>.",
68
+ "images": ["./images/visual_grounding.jpeg"],
69
+ },
70
+ {"role": "<|Assistant|>", "content": ""},
71
+ ]
72
+
73
+ ## multiple images (or in-context learning) conversation example
74
+ # conversation = [
75
+ # {
76
+ # "role": "User",
77
+ # "content": "<image_placeholder>A dog wearing nothing in the foreground, "
78
+ # "<image_placeholder>a dog wearing a santa hat, "
79
+ # "<image_placeholder>a dog wearing a wizard outfit, and "
80
+ # "<image_placeholder>what's the dog wearing?",
81
+ # "images": [
82
+ # "images/dog_a.png",
83
+ # "images/dog_b.png",
84
+ # "images/dog_c.png",
85
+ # "images/dog_d.png",
86
+ # ],
87
+ # },
88
+ # {"role": "Assistant", "content": ""}
89
+ # ]
90
+
91
+ # load images and prepare for inputs
92
+ pil_images = load_pil_images(conversation)
93
+ prepare_inputs = vl_chat_processor(
94
+ conversations=conversation,
95
+ images=pil_images,
96
+ force_batchify=True,
97
+ system_prompt=""
98
+ ).to(vl_gpt.device)
99
+
100
+ # run image encoder to get the image embeddings
101
+ inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
102
+
103
+ # run the model to get the response
104
+ outputs = vl_gpt.language_model.generate(
105
+ inputs_embeds=inputs_embeds,
106
+ attention_mask=prepare_inputs.attention_mask,
107
+ pad_token_id=tokenizer.eos_token_id,
108
+ bos_token_id=tokenizer.bos_token_id,
109
+ eos_token_id=tokenizer.eos_token_id,
110
+ max_new_tokens=512,
111
+ do_sample=False,
112
+ use_cache=True
113
+ )
114
+
115
+ answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
116
+ print(f"{prepare_inputs['sft_format'][0]}", answer)
117
+ ```
118
+
119
+ ### Gradio Demo (TODO)
120
+
121
+
122
+ ## 4. License
123
+
124
+ This code repository is licensed under [MIT License](./LICENSE-CODE). The use of DeepSeek-VL2 models is subject to [DeepSeek Model License](./LICENSE-MODEL). DeepSeek-VL2 series supports commercial use.
125
+
126
+ ## 5. Citation
127
+
128
+ ```
129
+ @misc{wu2024deepseekvl2mixtureofexpertsvisionlanguagemodels,
130
+ title={DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding},
131
+ author={Zhiyu Wu and Xiaokang Chen and Zizheng Pan and Xingchao Liu and Wen Liu and Damai Dai and Huazuo Gao and Yiyang Ma and Chengyue Wu and Bingxuan Wang and Zhenda Xie and Yu Wu and Kai Hu and Jiawei Wang and Yaofeng Sun and Yukun Li and Yishi Piao and Kang Guan and Aixin Liu and Xin Xie and Yuxiang You and Kai Dong and Xingkai Yu and Haowei Zhang and Liang Zhao and Yisong Wang and Chong Ruan},
132
+ year={2024},
133
+ eprint={2412.10302},
134
+ archivePrefix={arXiv},
135
+ primaryClass={cs.CV},
136
+ url={https://arxiv.org/abs/2412.10302},
137
+ }
138
+ ```
139
+
140
+ ## 6. Contact
141
+
142
+ If you have any questions, please raise an issue or contact us at [service@deepseek.com](mailto:service@deepseek.com).
config.json ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "candidate_resolutions": [
3
+ [
4
+ 384,
5
+ 384
6
+ ],
7
+ [
8
+ 384,
9
+ 768
10
+ ],
11
+ [
12
+ 768,
13
+ 384
14
+ ],
15
+ [
16
+ 384,
17
+ 1152
18
+ ],
19
+ [
20
+ 1152,
21
+ 384
22
+ ],
23
+ [
24
+ 384,
25
+ 1536
26
+ ],
27
+ [
28
+ 1536,
29
+ 384
30
+ ],
31
+ [
32
+ 768,
33
+ 768
34
+ ],
35
+ [
36
+ 384,
37
+ 1920
38
+ ],
39
+ [
40
+ 1920,
41
+ 384
42
+ ],
43
+ [
44
+ 384,
45
+ 2304
46
+ ],
47
+ [
48
+ 2304,
49
+ 384
50
+ ],
51
+ [
52
+ 768,
53
+ 1152
54
+ ],
55
+ [
56
+ 1152,
57
+ 768
58
+ ],
59
+ [
60
+ 384,
61
+ 2688
62
+ ],
63
+ [
64
+ 2688,
65
+ 384
66
+ ],
67
+ [
68
+ 384,
69
+ 3072
70
+ ],
71
+ [
72
+ 3072,
73
+ 384
74
+ ],
75
+ [
76
+ 768,
77
+ 1536
78
+ ],
79
+ [
80
+ 1536,
81
+ 768
82
+ ],
83
+ [
84
+ 384,
85
+ 3456
86
+ ],
87
+ [
88
+ 3456,
89
+ 384
90
+ ],
91
+ [
92
+ 1152,
93
+ 1152
94
+ ]
95
+ ],
96
+ "global_view_pos": "head",
97
+ "language_config": {
98
+ "architectures": [
99
+ "DeepseekV2ForCausalLM"
100
+ ],
101
+ "auto_map": {
102
+ "AutoConfig": "configuration_deepseek.DeepseekV2Config",
103
+ "AutoModel": "modeling_deepseek.DeepseekV2Model",
104
+ "AutoModelForCausalLM": "modeling_deepseek.DeepseekV2ForCausalLM"
105
+ },
106
+ "bos_token_id": 0,
107
+ "eos_token_id": 1,
108
+ "first_k_dense_replace": 1,
109
+ "hidden_size": 1280,
110
+ "intermediate_size": 6848,
111
+ "kv_lora_rank": null,
112
+ "lm_head": true,
113
+ "max_position_embeddings": 4096,
114
+ "model_type": "deepseek_v2",
115
+ "moe_intermediate_size": 896,
116
+ "n_group": 1,
117
+ "n_routed_experts": 64,
118
+ "n_shared_experts": 2,
119
+ "num_attention_heads": 10,
120
+ "num_experts_per_tok": 6,
121
+ "num_hidden_layers": 12,
122
+ "num_key_value_heads": 10,
123
+ "q_lora_rank": null,
124
+ "qk_nope_head_dim": 0,
125
+ "qk_rope_head_dim": 0,
126
+ "rm_head": false,
127
+ "topk_group": 1,
128
+ "topk_method": "greedy",
129
+ "torch_dtype": "bfloat16",
130
+ "use_mla": false,
131
+ "v_head_dim": 0,
132
+ "vocab_size": 129280
133
+ },
134
+ "model_type": "deepseek_vl_v2",
135
+ "projector_config": {
136
+ "model_type": "mlp_projector",
137
+ "n_embed": 1280
138
+ },
139
+ "tile_tag": "2D",
140
+ "torch_dtype": "bfloat16",
141
+ "transformers_version": "4.38.2",
142
+ "vision_config": {
143
+ "layers": 27,
144
+ "mlp_ratio": 3.7362,
145
+ "model_name": "siglip_so400m_patch14_384",
146
+ "model_type": "vision",
147
+ "patch_size": 14,
148
+ "width": 1152
149
+ }
150
+ }
model-00001-of-000001.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cc1e5047280253e224b299677bea16b7960c2436ec676d28911ebd9de3bb0074
3
+ size 6741334208
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
processor_config.json ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_special_token": false,
3
+ "candidate_resolutions": [
4
+ [
5
+ 384,
6
+ 384
7
+ ],
8
+ [
9
+ 384,
10
+ 768
11
+ ],
12
+ [
13
+ 768,
14
+ 384
15
+ ],
16
+ [
17
+ 384,
18
+ 1152
19
+ ],
20
+ [
21
+ 1152,
22
+ 384
23
+ ],
24
+ [
25
+ 384,
26
+ 1536
27
+ ],
28
+ [
29
+ 1536,
30
+ 384
31
+ ],
32
+ [
33
+ 768,
34
+ 768
35
+ ],
36
+ [
37
+ 384,
38
+ 1920
39
+ ],
40
+ [
41
+ 1920,
42
+ 384
43
+ ],
44
+ [
45
+ 384,
46
+ 2304
47
+ ],
48
+ [
49
+ 2304,
50
+ 384
51
+ ],
52
+ [
53
+ 768,
54
+ 1152
55
+ ],
56
+ [
57
+ 1152,
58
+ 768
59
+ ],
60
+ [
61
+ 384,
62
+ 2688
63
+ ],
64
+ [
65
+ 2688,
66
+ 384
67
+ ],
68
+ [
69
+ 384,
70
+ 3072
71
+ ],
72
+ [
73
+ 3072,
74
+ 384
75
+ ],
76
+ [
77
+ 768,
78
+ 1536
79
+ ],
80
+ [
81
+ 1536,
82
+ 768
83
+ ],
84
+ [
85
+ 384,
86
+ 3456
87
+ ],
88
+ [
89
+ 3456,
90
+ 384
91
+ ],
92
+ [
93
+ 1152,
94
+ 1152
95
+ ]
96
+ ],
97
+ "downsample_ratio": 2,
98
+ "ignore_id": -100,
99
+ "image_mean": [
100
+ 0.5,
101
+ 0.5,
102
+ 0.5
103
+ ],
104
+ "image_std": [
105
+ 0.5,
106
+ 0.5,
107
+ 0.5
108
+ ],
109
+ "image_token": "<image>",
110
+ "mask_prompt": false,
111
+ "normalize": true,
112
+ "pad_token": "<\uff5c\u2581pad\u2581\uff5c>",
113
+ "patch_size": 14,
114
+ "processor_class": "DeepseekVLV2Processor",
115
+ "sft_format": "deepseek"
116
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ {
4
+ "content": "<|User|>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false
9
+ },
10
+ {
11
+ "content": "<|Assistant|>",
12
+ "lstrip": false,
13
+ "normalized": false,
14
+ "rstrip": false,
15
+ "single_word": false
16
+ }
17
+ ],
18
+ "bos_token": {
19
+ "content": "<|begin▁of▁sentence|>",
20
+ "lstrip": false,
21
+ "normalized": false,
22
+ "rstrip": false,
23
+ "single_word": false
24
+ },
25
+ "eos_token": {
26
+ "content": "<|end▁of▁sentence|>",
27
+ "lstrip": false,
28
+ "normalized": false,
29
+ "rstrip": false,
30
+ "single_word": false
31
+ },
32
+ "pad_token": {
33
+ "content": "<|▁pad▁|>",
34
+ "lstrip": false,
35
+ "normalized": false,
36
+ "rstrip": false,
37
+ "single_word": false
38
+ }
39
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff