zwt123home123 commited on
Commit
2c3c568
1 Parent(s): 1ccb765

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ examples/red-panda.mp4 filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,887 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
+ base_model:
6
+ - OpenGVLab/InternViT-6B-448px-V1-5
7
+ - internlm/internlm2-chat-20b
8
+ base_model_relation: merge
9
+ language:
10
+ - multilingual
11
+ tags:
12
+ - internvl
13
+ - vision
14
+ - ocr
15
+ - multi-image
16
+ - video
17
+ - custom_code
18
+ ---
19
+
20
+ # InternVL2-26B
21
+
22
+ [\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL) [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[📜 InternVL 1.0 Paper\]](https://arxiv.org/abs/2312.14238) [\[📜 InternVL 1.5 Report\]](https://arxiv.org/abs/2404.16821)
23
+
24
+ [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[🚀 Quick Start\]](#quick-start) [\[📖 中文解读\]](https://zhuanlan.zhihu.com/p/706547971) [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
25
+
26
+ [切换至中文版](#简介)
27
+
28
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/_mLpMwsav5eMeNcZdrIQl.png)
29
+
30
+ ## Introduction
31
+
32
+ We are excited to announce the release of InternVL 2.0, the latest addition to the InternVL series of multimodal large language models. InternVL 2.0 features a variety of **instruction-tuned models**, ranging from 1 billion to 108 billion parameters. This repository contains the instruction-tuned InternVL2-26B model.
33
+
34
+ Compared to the state-of-the-art open-source multimodal large language models, InternVL 2.0 surpasses most open-source models. It demonstrates competitive performance on par with proprietary commercial models across various capabilities, including document and chart comprehension, infographics QA, scene text understanding and OCR tasks, scientific and mathematical problem solving, as well as cultural understanding and integrated multimodal capabilities.
35
+
36
+ InternVL 2.0 is trained with an 8k context window and utilizes training data consisting of long texts, multiple images, and videos, significantly improving its ability to handle these types of inputs compared to InternVL 1.5. For more details, please refer to our [blog](https://internvl.github.io/blog/2024-07-02-InternVL-2.0/) and [GitHub](https://github.com/OpenGVLab/InternVL).
37
+
38
+ | Model Name | Vision Part | Language Part | HF Link | MS Link |
39
+ | :------------------: | :---------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------: | :--------------------------------------------------------------: | :--------------------------------------------------------------------: |
40
+ | InternVL2-1B | [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) | [🤗 link](https://huggingface.co/OpenGVLab/InternVL2-1B) | [🤖 link](https://modelscope.cn/models/OpenGVLab/InternVL2-1B) |
41
+ | InternVL2-2B | [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [internlm2-chat-1_8b](https://huggingface.co/internlm/internlm2-chat-1_8b) | [🤗 link](https://huggingface.co/OpenGVLab/InternVL2-2B) | [🤖 link](https://modelscope.cn/models/OpenGVLab/InternVL2-2B) |
42
+ | InternVL2-4B | [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) | [🤗 link](https://huggingface.co/OpenGVLab/InternVL2-4B) | [🤖 link](https://modelscope.cn/models/OpenGVLab/InternVL2-4B) |
43
+ | InternVL2-8B | [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [internlm2_5-7b-chat](https://huggingface.co/internlm/internlm2_5-7b-chat) | [🤗 link](https://huggingface.co/OpenGVLab/InternVL2-8B) | [🤖 link](https://modelscope.cn/models/OpenGVLab/InternVL2-8B) |
44
+ | InternVL2-26B | [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) | [internlm2-chat-20b](https://huggingface.co/internlm/internlm2-chat-20b) | [🤗 link](https://huggingface.co/OpenGVLab/InternVL2-26B) | [🤖 link](https://modelscope.cn/models/OpenGVLab/InternVL2-26B) |
45
+ | InternVL2-40B | [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) | [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) | [🤗 link](https://huggingface.co/OpenGVLab/InternVL2-40B) | [🤖 link](https://modelscope.cn/models/OpenGVLab/InternVL2-40B) |
46
+ | InternVL2-Llama3-76B | [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) | [Hermes-2-Theta-Llama-3-70B](https://huggingface.co/NousResearch/Hermes-2-Theta-Llama-3-70B) | [🤗 link](https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B) | [🤖 link](https://modelscope.cn/models/OpenGVLab/InternVL2-Llama3-76B) |
47
+
48
+ ## Model Details
49
+
50
+ InternVL 2.0 is a multimodal large language model series, featuring models of various sizes. For each size, we release instruction-tuned models optimized for multimodal tasks. InternVL2-26B consists of [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5), an MLP projector, and [internlm2-chat-20b](https://huggingface.co/internlm/internlm2-chat-20b).
51
+
52
+ ## Performance
53
+
54
+ ### Image Benchmarks
55
+
56
+ | Benchmark | GPT-4T-20240409 | Gemini-1.5-Pro | InternVL-Chat-V1-5 | InternVL2-26B |
57
+ | :--------------------------: | :-------------: | :------------: | :----------------: | :-----------: |
58
+ | Model Size | - | - | 25.5B | 25.5B |
59
+ | | | | | |
60
+ | DocVQA<sub>test</sub> | 87.2 | 86.5 | 90.9 | 92.9 |
61
+ | ChartQA<sub>test</sub> | 78.1 | 81.3 | 83.8 | 84.9 |
62
+ | InfoVQA<sub>test</sub> | - | 72.7 | 72.5 | 75.9 |
63
+ | TextVQA<sub>val</sub> | - | 73.5 | 80.6 | 82.3 |
64
+ | OCRBench | 678 | 754 | 724 | 825 |
65
+ | MME<sub>sum</sub> | 2070.2 | 2110.6 | 2187.8 | 2260.7 |
66
+ | RealWorldQA | 68.0 | 67.5 | 66.0 | 68.3 |
67
+ | AI2D<sub>test</sub> | 89.4 | 80.3 | 80.7 | 84.5 |
68
+ | MMMU<sub>val</sub> | 63.1 / 61.7 | 58.5 / 60.6 | 45.2 / 46.8 | 48.3 / 51.2 |
69
+ | MMBench-EN<sub>test</sub> | 81.0 | 73.9 | 82.2 | 83.4 |
70
+ | MMBench-CN<sub>test</sub> | 80.2 | 73.8 | 82.0 | 82.0 |
71
+ | CCBench<sub>dev</sub> | 57.3 | 28.4 | 69.8 | 73.5 |
72
+ | MMVet<sub>GPT-4-0613</sub> | - | - | 62.8 | 64.2 |
73
+ | MMVet<sub>GPT-4-Turbo</sub> | 67.5 | 64.0 | 55.4 | 62.1 |
74
+ | SEED-Image | - | - | 76.0 | 76.8 |
75
+ | HallBench<sub>avg</sub> | 43.9 | 45.6 | 49.3 | 50.7 |
76
+ | MathVista<sub>testmini</sub> | 58.1 | 57.7 | 53.5 | 59.4 |
77
+ | OpenCompass<sub>avg</sub> | 63.5 | 64.4 | 61.7 | 66.4 |
78
+
79
+ - For more details and evaluation reproduction, please refer to our [Evaluation Guide](https://internvl.readthedocs.io/en/latest/internvl2.0/evaluation.html).
80
+
81
+ - We simultaneously use [InternVL](https://github.com/OpenGVLab/InternVL) and [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit.
82
+
83
+ - For MMMU, we report both the original scores (left side: evaluated using the InternVL codebase for InternVL series models, and sourced from technical reports or webpages for other models) and the VLMEvalKit scores (right side: collected from the OpenCompass leaderboard).
84
+
85
+ - Please note that evaluating the same model using different testing toolkits like [InternVL](https://github.com/OpenGVLab/InternVL) and [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) can result in slight differences, which is normal. Updates to code versions and variations in environment and hardware can also cause minor discrepancies in results.
86
+
87
+ ### Video Benchmarks
88
+
89
+ | Benchmark | GPT-4V | LLaVA-NeXT-Video | InternVL-Chat-V1-5 | InternVL2-26B |
90
+ | :-------------------------: | :----: | :--------------: | :----------------: | :-----------: |
91
+ | Model Size | - | 34B | 25.5B | 25.5B |
92
+ | | | | | |
93
+ | MVBench | - | - | 52.1 | 67.5 |
94
+ | MMBench-Video<sub>8f</sub> | 1.53 | - | 1.26 | 1.27 |
95
+ | MMBench-Video<sub>16f</sub> | 1.68 | - | 1.31 | 1.41 |
96
+ | Video-MME<br>w/o subs | 59.9 | 52.0 | 53.6 | 54.8 |
97
+ | Video-MME<br>w subs | 63.3 | 54.9 | 54.5 | 57.1 |
98
+
99
+ - We evaluate our models on MVBench and Video-MME by extracting 16 frames from each video, and each frame was resized to a 448x448 image.
100
+
101
+ ### Grounding Benchmarks
102
+
103
+ | Model | avg. | RefCOCO<br>(val) | RefCOCO<br>(testA) | RefCOCO<br>(testB) | RefCOCO+<br>(val) | RefCOCO+<br>(testA) | RefCOCO+<br>(testB) | RefCOCO‑g<br>(val) | RefCOCO‑g<br>(test) |
104
+ | :----------------------------: | :--: | :--------------: | :----------------: | :----------------: | :---------------: | :-----------------: | :-----------------: | :----------------: | :-----------------: |
105
+ | UNINEXT-H<br>(Specialist SOTA) | 88.9 | 92.6 | 94.3 | 91.5 | 85.2 | 89.6 | 79.8 | 88.7 | 89.4 |
106
+ | | | | | | | | | | |
107
+ | Mini-InternVL-<br>Chat-2B-V1-5 | 75.8 | 80.7 | 86.7 | 72.9 | 72.5 | 82.3 | 60.8 | 75.6 | 74.9 |
108
+ | Mini-InternVL-<br>Chat-4B-V1-5 | 84.4 | 88.0 | 91.4 | 83.5 | 81.5 | 87.4 | 73.8 | 84.7 | 84.6 |
109
+ | InternVL‑Chat‑V1‑5 | 88.8 | 91.4 | 93.7 | 87.1 | 87.0 | 92.3 | 80.9 | 88.5 | 89.3 |
110
+ | | | | | | | | | | |
111
+ | InternVL2‑1B | 79.9 | 83.6 | 88.7 | 79.8 | 76.0 | 83.6 | 67.7 | 80.2 | 79.9 |
112
+ | InternVL2‑2B | 77.7 | 82.3 | 88.2 | 75.9 | 73.5 | 82.8 | 63.3 | 77.6 | 78.3 |
113
+ | InternVL2‑4B | 84.4 | 88.5 | 91.2 | 83.9 | 81.2 | 87.2 | 73.8 | 84.6 | 84.6 |
114
+ | InternVL2‑8B | 82.9 | 87.1 | 91.1 | 80.7 | 79.8 | 87.9 | 71.4 | 82.7 | 82.7 |
115
+ | InternVL2‑26B | 88.5 | 91.2 | 93.3 | 87.4 | 86.8 | 91.0 | 81.2 | 88.5 | 88.6 |
116
+ | InternVL2‑40B | 90.3 | 93.0 | 94.7 | 89.2 | 88.5 | 92.8 | 83.6 | 90.3 | 90.6 |
117
+ | InternVL2-<br>Llama3‑76B | 90.0 | 92.2 | 94.8 | 88.4 | 88.8 | 93.1 | 82.8 | 89.5 | 90.3 |
118
+
119
+ - We use the following prompt to evaluate InternVL's grounding ability: `Please provide the bounding box coordinates of the region this sentence describes: <ref>{}</ref>`
120
+
121
+ Limitations: Although we have made efforts to ensure the safety of the model during the training process and to encourage the model to generate text that complies with ethical and legal requirements, the model may still produce unexpected outputs due to its size and probabilistic generation paradigm. For example, the generated responses may contain biases, discrimination, or other harmful content. Please do not propagate such content. We are not responsible for any consequences resulting from the dissemination of harmful information.
122
+
123
+ ### Invitation to Evaluate InternVL
124
+
125
+ We welcome MLLM benchmark developers to assess our InternVL1.5 and InternVL2 series models. If you need to add your evaluation results here, please contact me at [wztxy89@163.com](mailto:wztxy89@163.com).
126
+
127
+ ## Quick Start
128
+
129
+ We provide an example code to run InternVL2-26B using `transformers`.
130
+
131
+ We also welcome you to experience the InternVL2 series models in our [online demo](https://internvl.opengvlab.com/).
132
+
133
+ > Please use transformers==4.37.2 to ensure the model works normally.
134
+
135
+ ### Model Loading
136
+
137
+ #### 16-bit (bf16 / fp16)
138
+
139
+ ```python
140
+ import torch
141
+ from transformers import AutoTokenizer, AutoModel
142
+ path = "OpenGVLab/InternVL2-26B"
143
+ model = AutoModel.from_pretrained(
144
+ path,
145
+ torch_dtype=torch.bfloat16,
146
+ low_cpu_mem_usage=True,
147
+ use_flash_attn=True,
148
+ trust_remote_code=True).eval().cuda()
149
+ ```
150
+
151
+ #### BNB 8-bit Quantization
152
+
153
+ ```python
154
+ import torch
155
+ from transformers import AutoTokenizer, AutoModel
156
+ path = "OpenGVLab/InternVL2-26B"
157
+ model = AutoModel.from_pretrained(
158
+ path,
159
+ torch_dtype=torch.bfloat16,
160
+ load_in_8bit=True,
161
+ low_cpu_mem_usage=True,
162
+ use_flash_attn=True,
163
+ trust_remote_code=True).eval()
164
+ ```
165
+
166
+ #### BNB 4-bit Quantization
167
+
168
+ > **⚠️ Warning:** Due to significant quantization errors with BNB 4-bit quantization on InternViT-6B, the model may produce nonsensical outputs and fail to understand images. Therefore, please avoid using BNB 4-bit quantization.
169
+
170
+ #### Multiple GPUs
171
+
172
+ The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not being on the same device. By ensuring that the first and last layers of the large language model (LLM) are on the same device, we prevent such errors.
173
+
174
+ ```python
175
+ import math
176
+ import torch
177
+ from transformers import AutoTokenizer, AutoModel
178
+
179
+ def split_model(model_name):
180
+ device_map = {}
181
+ world_size = torch.cuda.device_count()
182
+ num_layers = {
183
+ 'InternVL2-1B': 24, 'InternVL2-2B': 24, 'InternVL2-4B': 32, 'InternVL2-8B': 32,
184
+ 'InternVL2-26B': 48, 'InternVL2-40B': 60, 'InternVL2-Llama3-76B': 80}[model_name]
185
+ # Since the first GPU will be used for ViT, treat it as half a GPU.
186
+ num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
187
+ num_layers_per_gpu = [num_layers_per_gpu] * world_size
188
+ num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
189
+ layer_cnt = 0
190
+ for i, num_layer in enumerate(num_layers_per_gpu):
191
+ for j in range(num_layer):
192
+ device_map[f'language_model.model.layers.{layer_cnt}'] = i
193
+ layer_cnt += 1
194
+ device_map['vision_model'] = 0
195
+ device_map['mlp1'] = 0
196
+ device_map['language_model.model.tok_embeddings'] = 0
197
+ device_map['language_model.model.embed_tokens'] = 0
198
+ device_map['language_model.output'] = 0
199
+ device_map['language_model.model.norm'] = 0
200
+ device_map['language_model.lm_head'] = 0
201
+ device_map[f'language_model.model.layers.{num_layers - 1}'] = 0
202
+
203
+ return device_map
204
+
205
+ path = "OpenGVLab/InternVL2-26B"
206
+ device_map = split_model('InternVL2-26B')
207
+ model = AutoModel.from_pretrained(
208
+ path,
209
+ torch_dtype=torch.bfloat16,
210
+ low_cpu_mem_usage=True,
211
+ use_flash_attn=True,
212
+ trust_remote_code=True,
213
+ device_map=device_map).eval()
214
+ ```
215
+
216
+ ### Inference with Transformers
217
+
218
+ ```python
219
+ import numpy as np
220
+ import torch
221
+ import torchvision.transforms as T
222
+ from decord import VideoReader, cpu
223
+ from PIL import Image
224
+ from torchvision.transforms.functional import InterpolationMode
225
+ from transformers import AutoModel, AutoTokenizer
226
+
227
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
228
+ IMAGENET_STD = (0.229, 0.224, 0.225)
229
+
230
+ def build_transform(input_size):
231
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
232
+ transform = T.Compose([
233
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
234
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
235
+ T.ToTensor(),
236
+ T.Normalize(mean=MEAN, std=STD)
237
+ ])
238
+ return transform
239
+
240
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
241
+ best_ratio_diff = float('inf')
242
+ best_ratio = (1, 1)
243
+ area = width * height
244
+ for ratio in target_ratios:
245
+ target_aspect_ratio = ratio[0] / ratio[1]
246
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
247
+ if ratio_diff < best_ratio_diff:
248
+ best_ratio_diff = ratio_diff
249
+ best_ratio = ratio
250
+ elif ratio_diff == best_ratio_diff:
251
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
252
+ best_ratio = ratio
253
+ return best_ratio
254
+
255
+ def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
256
+ orig_width, orig_height = image.size
257
+ aspect_ratio = orig_width / orig_height
258
+
259
+ # calculate the existing image aspect ratio
260
+ target_ratios = set(
261
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
262
+ i * j <= max_num and i * j >= min_num)
263
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
264
+
265
+ # find the closest aspect ratio to the target
266
+ target_aspect_ratio = find_closest_aspect_ratio(
267
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
268
+
269
+ # calculate the target width and height
270
+ target_width = image_size * target_aspect_ratio[0]
271
+ target_height = image_size * target_aspect_ratio[1]
272
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
273
+
274
+ # resize the image
275
+ resized_img = image.resize((target_width, target_height))
276
+ processed_images = []
277
+ for i in range(blocks):
278
+ box = (
279
+ (i % (target_width // image_size)) * image_size,
280
+ (i // (target_width // image_size)) * image_size,
281
+ ((i % (target_width // image_size)) + 1) * image_size,
282
+ ((i // (target_width // image_size)) + 1) * image_size
283
+ )
284
+ # split the image
285
+ split_img = resized_img.crop(box)
286
+ processed_images.append(split_img)
287
+ assert len(processed_images) == blocks
288
+ if use_thumbnail and len(processed_images) != 1:
289
+ thumbnail_img = image.resize((image_size, image_size))
290
+ processed_images.append(thumbnail_img)
291
+ return processed_images
292
+
293
+ def load_image(image_file, input_size=448, max_num=12):
294
+ image = Image.open(image_file).convert('RGB')
295
+ transform = build_transform(input_size=input_size)
296
+ images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
297
+ pixel_values = [transform(image) for image in images]
298
+ pixel_values = torch.stack(pixel_values)
299
+ return pixel_values
300
+
301
+ # If you have an 80G A100 GPU, you can put the entire model on a single GPU.
302
+ # Otherwise, you need to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.
303
+ path = 'OpenGVLab/InternVL2-26B'
304
+ model = AutoModel.from_pretrained(
305
+ path,
306
+ torch_dtype=torch.bfloat16,
307
+ low_cpu_mem_usage=True,
308
+ use_flash_attn=True,
309
+ trust_remote_code=True).eval().cuda()
310
+ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
311
+
312
+ # set the max number of tiles in `max_num`
313
+ pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
314
+ generation_config = dict(max_new_tokens=1024, do_sample=True)
315
+
316
+ # pure-text conversation (纯文本对话)
317
+ question = 'Hello, who are you?'
318
+ response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
319
+ print(f'User: {question}\nAssistant: {response}')
320
+
321
+ question = 'Can you tell me a story?'
322
+ response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
323
+ print(f'User: {question}\nAssistant: {response}')
324
+
325
+ # single-image single-round conversation (单图单轮对话)
326
+ question = '<image>\nPlease describe the image shortly.'
327
+ response = model.chat(tokenizer, pixel_values, question, generation_config)
328
+ print(f'User: {question}\nAssistant: {response}')
329
+
330
+ # single-image multi-round conversation (单图多轮对话)
331
+ question = '<image>\nPlease describe the image in detail.'
332
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
333
+ print(f'User: {question}\nAssistant: {response}')
334
+
335
+ question = 'Please write a poem according to the image.'
336
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
337
+ print(f'User: {question}\nAssistant: {response}')
338
+
339
+ # multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
340
+ pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
341
+ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
342
+ pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
343
+
344
+ question = '<image>\nDescribe the two images in detail.'
345
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
346
+ history=None, return_history=True)
347
+ print(f'User: {question}\nAssistant: {response}')
348
+
349
+ question = 'What are the similarities and differences between these two images.'
350
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
351
+ history=history, return_history=True)
352
+ print(f'User: {question}\nAssistant: {response}')
353
+
354
+ # multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
355
+ pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
356
+ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
357
+ pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
358
+ num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
359
+
360
+ question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
361
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
362
+ num_patches_list=num_patches_list,
363
+ history=None, return_history=True)
364
+ print(f'User: {question}\nAssistant: {response}')
365
+
366
+ question = 'What are the similarities and differences between these two images.'
367
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
368
+ num_patches_list=num_patches_list,
369
+ history=history, return_history=True)
370
+ print(f'User: {question}\nAssistant: {response}')
371
+
372
+ # batch inference, single image per sample (单图批处理)
373
+ pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
374
+ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
375
+ num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
376
+ pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
377
+
378
+ questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
379
+ responses = model.batch_chat(tokenizer, pixel_values,
380
+ num_patches_list=num_patches_list,
381
+ questions=questions,
382
+ generation_config=generation_config)
383
+ for question, response in zip(questions, responses):
384
+ print(f'User: {question}\nAssistant: {response}')
385
+
386
+ # video multi-round conversation (视频多轮对话)
387
+ def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
388
+ if bound:
389
+ start, end = bound[0], bound[1]
390
+ else:
391
+ start, end = -100000, 100000
392
+ start_idx = max(first_idx, round(start * fps))
393
+ end_idx = min(round(end * fps), max_frame)
394
+ seg_size = float(end_idx - start_idx) / num_segments
395
+ frame_indices = np.array([
396
+ int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
397
+ for idx in range(num_segments)
398
+ ])
399
+ return frame_indices
400
+
401
+ def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
402
+ vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
403
+ max_frame = len(vr) - 1
404
+ fps = float(vr.get_avg_fps())
405
+
406
+ pixel_values_list, num_patches_list = [], []
407
+ transform = build_transform(input_size=input_size)
408
+ frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
409
+ for frame_index in frame_indices:
410
+ img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
411
+ img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
412
+ pixel_values = [transform(tile) for tile in img]
413
+ pixel_values = torch.stack(pixel_values)
414
+ num_patches_list.append(pixel_values.shape[0])
415
+ pixel_values_list.append(pixel_values)
416
+ pixel_values = torch.cat(pixel_values_list)
417
+ return pixel_values, num_patches_list
418
+
419
+ video_path = './examples/red-panda.mp4'
420
+ pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
421
+ pixel_values = pixel_values.to(torch.bfloat16).cuda()
422
+ video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
423
+ question = video_prefix + 'What is the red panda doing?'
424
+ # Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
425
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
426
+ num_patches_list=num_patches_list, history=None, return_history=True)
427
+ print(f'User: {question}\nAssistant: {response}')
428
+
429
+ question = 'Describe this video in detail. Don\'t repeat.'
430
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config,
431
+ num_patches_list=num_patches_list, history=history, return_history=True)
432
+ print(f'User: {question}\nAssistant: {response}')
433
+ ```
434
+
435
+ #### Streaming output
436
+
437
+ Besides this method, you can also use the following code to get streamed output.
438
+
439
+ ```python
440
+ from transformers import TextIteratorStreamer
441
+ from threading import Thread
442
+
443
+ # Initialize the streamer
444
+ streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True, timeout=10)
445
+ # Define the generation configuration
446
+ generation_config = dict(max_new_tokens=1024, do_sample=False, streamer=streamer)
447
+ # Start the model chat in a separate thread
448
+ thread = Thread(target=model.chat, kwargs=dict(
449
+ tokenizer=tokenizer, pixel_values=pixel_values, question=question,
450
+ history=None, return_history=False, generation_config=generation_config,
451
+ ))
452
+ thread.start()
453
+
454
+ # Initialize an empty string to store the generated text
455
+ generated_text = ''
456
+ # Loop through the streamer to get the new text as it is generated
457
+ for new_text in streamer:
458
+ if new_text == model.conv_template.sep:
459
+ break
460
+ generated_text += new_text
461
+ print(new_text, end='', flush=True) # Print each new chunk of generated text on the same line
462
+ ```
463
+
464
+ ## Finetune
465
+
466
+ Many repositories now support fine-tuning of the InternVL series models, including [InternVL](https://github.com/OpenGVLab/InternVL), [SWIFT](https://github.com/modelscope/ms-swift), [XTurner](https://github.com/InternLM/xtuner), and others. Please refer to their documentation for more details on fine-tuning.
467
+
468
+ ## Deployment
469
+
470
+ ### LMDeploy
471
+
472
+ LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
473
+
474
+ ```sh
475
+ pip install lmdeploy==0.5.3
476
+ ```
477
+
478
+ LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
479
+
480
+ #### A 'Hello, world' example
481
+
482
+ ```python
483
+ from lmdeploy import pipeline, TurbomindEngineConfig
484
+ from lmdeploy.vl import load_image
485
+
486
+ model = 'OpenGVLab/InternVL2-26B'
487
+ image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
488
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
489
+ response = pipe(('describe this image', image))
490
+ print(response.text)
491
+ ```
492
+
493
+ If `ImportError` occurs while executing this case, please install the required dependency packages as prompted.
494
+
495
+ #### Multi-images inference
496
+
497
+ When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased.
498
+
499
+ > Warning: Due to the scarcity of multi-image conversation data, the performance on multi-image tasks may be unstable, and it may require multiple attempts to achieve satisfactory results.
500
+
501
+ ```python
502
+ from lmdeploy import pipeline, TurbomindEngineConfig
503
+ from lmdeploy.vl import load_image
504
+ from lmdeploy.vl.constants import IMAGE_TOKEN
505
+
506
+ model = 'OpenGVLab/InternVL2-26B'
507
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
508
+
509
+ image_urls=[
510
+ 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
511
+ 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
512
+ ]
513
+
514
+ images = [load_image(img_url) for img_url in image_urls]
515
+ # Numbering images improves multi-image conversations
516
+ response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
517
+ print(response.text)
518
+ ```
519
+
520
+ #### Batch prompts inference
521
+
522
+ Conducting inference with batch prompts is quite straightforward; just place them within a list structure:
523
+
524
+ ```python
525
+ from lmdeploy import pipeline, TurbomindEngineConfig
526
+ from lmdeploy.vl import load_image
527
+
528
+ model = 'OpenGVLab/InternVL2-26B'
529
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
530
+
531
+ image_urls=[
532
+ "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
533
+ "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
534
+ ]
535
+ prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
536
+ response = pipe(prompts)
537
+ print(response)
538
+ ```
539
+
540
+ #### Multi-turn conversation
541
+
542
+ There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface.
543
+
544
+ ```python
545
+ from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
546
+ from lmdeploy.vl import load_image
547
+
548
+ model = 'OpenGVLab/InternVL2-26B'
549
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
550
+
551
+ image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
552
+ gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
553
+ sess = pipe.chat(('describe this image', image), gen_config=gen_config)
554
+ print(sess.response.text)
555
+ sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
556
+ print(sess.response.text)
557
+ ```
558
+
559
+ #### Service
560
+
561
+ LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
562
+
563
+ ```shell
564
+ lmdeploy serve api_server OpenGVLab/InternVL2-26B --backend turbomind --server-port 23333
565
+ ```
566
+
567
+ To use the OpenAI-style interface, you need to install OpenAI:
568
+
569
+ ```shell
570
+ pip install openai
571
+ ```
572
+
573
+ Then, use the code below to make the API call:
574
+
575
+ ```python
576
+ from openai import OpenAI
577
+
578
+ client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
579
+ model_name = client.models.list().data[0].id
580
+ response = client.chat.completions.create(
581
+ model=model_name,
582
+ messages=[{
583
+ 'role':
584
+ 'user',
585
+ 'content': [{
586
+ 'type': 'text',
587
+ 'text': 'describe this image',
588
+ }, {
589
+ 'type': 'image_url',
590
+ 'image_url': {
591
+ 'url':
592
+ 'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
593
+ },
594
+ }],
595
+ }],
596
+ temperature=0.8,
597
+ top_p=0.8)
598
+ print(response)
599
+ ```
600
+
601
+ ## License
602
+
603
+ This project is released under the MIT license, while InternLM2 is licensed under the Apache-2.0 license.
604
+
605
+ ## Citation
606
+
607
+ If you find this project useful in your research, please consider citing:
608
+
609
+ ```BibTeX
610
+ @article{chen2023internvl,
611
+ title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
612
+ author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
613
+ journal={arXiv preprint arXiv:2312.14238},
614
+ year={2023}
615
+ }
616
+ @article{chen2024far,
617
+ title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
618
+ author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
619
+ journal={arXiv preprint arXiv:2404.16821},
620
+ year={2024}
621
+ }
622
+ ```
623
+
624
+ ## 简介
625
+
626
+ 我们很高兴宣布 InternVL 2.0 的发布,这是 InternVL 系列多模态大语言模型的最新版本。InternVL 2.0 提供了多种**指令微调**的模型,参数从 10 亿到 1080 亿不等。此仓库包含经过指令微调的 InternVL2-26B 模型。
627
+
628
+ 与最先进的开源多模态大语言模型相比,InternVL 2.0 超越了大多数开源模型。它在各种能力上表现出与闭源商业模型相媲美的竞争力,包括文档和图表理解、信息图表问答、场景文本理解和 OCR 任务、科学和数学问题解决,以及文化理解和综合多模态能力。
629
+
630
+ InternVL 2.0 使用 8k 上下文窗口进行训练,训练数据包含长文本、多图和视频数据,与 InternVL 1.5 相比,其处理这些类型输入的能力显著提高。更多详细信息,请参阅我们的博客和 GitHub。
631
+
632
+ | 模型名称 | 视觉部分 | 语言部分 | HF 链接 | MS 链接 |
633
+ | :------------------: | :---------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------: | :--------------------------------------------------------------: | :--------------------------------------------------------------------: |
634
+ | InternVL2-1B | [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) | [🤗 link](https://huggingface.co/OpenGVLab/InternVL2-1B) | [🤖 link](https://modelscope.cn/models/OpenGVLab/InternVL2-1B) |
635
+ | InternVL2-2B | [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [internlm2-chat-1_8b](https://huggingface.co/internlm/internlm2-chat-1_8b) | [🤗 link](https://huggingface.co/OpenGVLab/InternVL2-2B) | [🤖 link](https://modelscope.cn/models/OpenGVLab/InternVL2-2B) |
636
+ | InternVL2-4B | [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct) | [🤗 link](https://huggingface.co/OpenGVLab/InternVL2-4B) | [🤖 link](https://modelscope.cn/models/OpenGVLab/InternVL2-4B) |
637
+ | InternVL2-8B | [InternViT-300M-448px](https://huggingface.co/OpenGVLab/InternViT-300M-448px) | [internlm2_5-7b-chat](https://huggingface.co/internlm/internlm2_5-7b-chat) | [🤗 link](https://huggingface.co/OpenGVLab/InternVL2-8B) | [🤖 link](https://modelscope.cn/models/OpenGVLab/InternVL2-8B) |
638
+ | InternVL2-26B | [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) | [internlm2-chat-20b](https://huggingface.co/internlm/internlm2-chat-20b) | [🤗 link](https://huggingface.co/OpenGVLab/InternVL2-26B) | [🤖 link](https://modelscope.cn/models/OpenGVLab/InternVL2-26B) |
639
+ | InternVL2-40B | [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) | [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) | [🤗 link](https://huggingface.co/OpenGVLab/InternVL2-40B) | [🤖 link](https://modelscope.cn/models/OpenGVLab/InternVL2-40B) |
640
+ | InternVL2-Llama3-76B | [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) | [Hermes-2-Theta-Llama-3-70B](https://huggingface.co/NousResearch/Hermes-2-Theta-Llama-3-70B) | [🤗 link](https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B) | [🤖 link](https://modelscope.cn/models/OpenGVLab/InternVL2-Llama3-76B) |
641
+
642
+ ## 模型细节
643
+
644
+ InternVL 2.0 是一个多模态大语言模型系列,包含各种规模的模型。对于每个规模的模型,我们都会发布针对多模态任务优化的指令微调模型。InternVL2-26B 包含 [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5)、一个 MLP 投影器和 [internlm2-chat-20b](https://huggingface.co/internlm/internlm2-chat-20b)。
645
+
646
+ ## 性能测试
647
+
648
+ ### 图像相关评测
649
+
650
+ | 评测数据集 | GPT-4T-20240409 | Gemini-1.5-Pro | InternVL-Chat-V1-5 | InternVL2-26B |
651
+ | :--------------------------: | :-------------: | :------------: | :----------------: | :-----------: |
652
+ | 模型大小 | - | - | 25.5B | 25.5B |
653
+ | | | | | |
654
+ | DocVQA<sub>test</sub> | 87.2 | 86.5 | 90.9 | 92.9 |
655
+ | ChartQA<sub>test</sub> | 78.1 | 81.3 | 83.8 | 84.9 |
656
+ | InfoVQA<sub>test</sub> | - | 72.7 | 72.5 | 75.9 |
657
+ | TextVQA<sub>val</sub> | - | 73.5 | 80.6 | 82.3 |
658
+ | OCRBench | 678 | 754 | 724 | 825 |
659
+ | MME<sub>sum</sub> | 2070.2 | 2110.6 | 2187.8 | 2260.7 |
660
+ | RealWorldQA | 68.0 | 67.5 | 66.0 | 68.3 |
661
+ | AI2D<sub>test</sub> | 89.4 | 80.3 | 80.7 | 84.5 |
662
+ | MMMU<sub>val</sub> | 63.1 / 61.7 | 58.5 / 60.6 | 45.2 / 46.8 | 48.3 / 51.2 |
663
+ | MMBench-EN<sub>test</sub> | 81.0 | 73.9 | 82.2 | 83.4 |
664
+ | MMBench-CN<sub>test</sub> | 80.2 | 73.8 | 82.0 | 82.0 |
665
+ | CCBench<sub>dev</sub> | 57.3 | 28.4 | 69.8 | 73.5 |
666
+ | MMVet<sub>GPT-4-0613</sub> | - | - | 62.8 | 64.2 |
667
+ | MMVet<sub>GPT-4-Turbo</sub> | 67.5 | 64.0 | 55.4 | 62.1 |
668
+ | SEED-Image | - | - | 76.0 | 76.8 |
669
+ | HallBench<sub>avg</sub> | 43.9 | 45.6 | 49.3 | 50.7 |
670
+ | MathVista<sub>testmini</sub> | 58.1 | 57.7 | 53.5 | 59.4 |
671
+ | OpenCompass<sub>avg</sub> | 63.5 | 64.4 | 61.7 | 66.4 |
672
+
673
+ - 关于更多的细节以及评测复现,请看我们的[评测指南](https://internvl.readthedocs.io/en/latest/internvl2.0/evaluation.html)。
674
+
675
+ - 我们同时使用 InternVL 和 VLMEvalKit 仓库进行模型评估。具体来说,DocVQA、ChartQA、InfoVQA、TextVQA、MME、AI2D、MMBench、CCBench、MMVet 和 SEED-Image 的结果是使用 InternVL 仓库测试的。OCRBench、RealWorldQA、HallBench 和 MathVista 是使用 VLMEvalKit 进行评估的。
676
+
677
+ - 对于MMMU,我们报告了原始分数(左侧:InternVL系列模型使用InternVL代码库评测,其他模型的分数来自其技术报告或网页)和VLMEvalKit分数(右侧:从OpenCompass排行榜收集)。
678
+
679
+ - 请注意,使用不同的测试工具包(如 InternVL 和 VLMEvalKit)评估同一模型可能会导致细微差异,这是正常的。代码版本的更新、环境和硬件的变化也可能导致结果的微小差异。
680
+
681
+ ### 视频相关评测
682
+
683
+ | 评测数据集 | GPT-4V | LLaVA-NeXT-Video | InternVL-Chat-V1-5 | InternVL2-26B |
684
+ | :-------------------------: | :----: | :--------------: | :----------------: | :-----------: |
685
+ | 模型大小 | - | 34B | 25.5B | 25.5B |
686
+ | | | | | |
687
+ | MVBench | - | - | 52.1 | 67.5 |
688
+ | MMBench-Video<sub>8f</sub> | 1.53 | - | 1.26 | 1.27 |
689
+ | MMBench-Video<sub>16f</sub> | 1.68 | - | 1.31 | 1.41 |
690
+ | Video-MME<br>w/o subs | 59.9 | 52.0 | 53.6 | 54.8 |
691
+ | Video-MME<br>w subs | 63.3 | 54.9 | 54.5 | 57.1 |
692
+
693
+ - 我们通过从每个视频中提取 16 帧来评估我们的模型在 MVBench 和 Video-MME 上的性能,每个视频帧被调整为 448x448 的图像。
694
+
695
+ ### 定位相关评测
696
+
697
+ | 模型 | avg. | RefCOCO<br>(val) | RefCOCO<br>(testA) | RefCOCO<br>(testB) | RefCOCO+<br>(val) | RefCOCO+<br>(testA) | RefCOCO+<br>(testB) | RefCOCO‑g<br>(val) | RefCOCO‑g<br>(test) |
698
+ | :----------------------------: | :--: | :--------------: | :----------------: | :----------------: | :---------------: | :-----------------: | :-----------------: | :----------------: | :-----------------: |
699
+ | UNINEXT-H<br>(Specialist SOTA) | 88.9 | 92.6 | 94.3 | 91.5 | 85.2 | 89.6 | 79.8 | 88.7 | 89.4 |
700
+ | | | | | | | | | | |
701
+ | Mini-InternVL-<br>Chat-2B-V1-5 | 75.8 | 80.7 | 86.7 | 72.9 | 72.5 | 82.3 | 60.8 | 75.6 | 74.9 |
702
+ | Mini-InternVL-<br>Chat-4B-V1-5 | 84.4 | 88.0 | 91.4 | 83.5 | 81.5 | 87.4 | 73.8 | 84.7 | 84.6 |
703
+ | InternVL‑Chat‑V1‑5 | 88.8 | 91.4 | 93.7 | 87.1 | 87.0 | 92.3 | 80.9 | 88.5 | 89.3 |
704
+ | | | | | | | | | | |
705
+ | InternVL2‑1B | 79.9 | 83.6 | 88.7 | 79.8 | 76.0 | 83.6 | 67.7 | 80.2 | 79.9 |
706
+ | InternVL2‑2B | 77.7 | 82.3 | 88.2 | 75.9 | 73.5 | 82.8 | 63.3 | 77.6 | 78.3 |
707
+ | InternVL2‑4B | 84.4 | 88.5 | 91.2 | 83.9 | 81.2 | 87.2 | 73.8 | 84.6 | 84.6 |
708
+ | InternVL2‑8B | 82.9 | 87.1 | 91.1 | 80.7 | 79.8 | 87.9 | 71.4 | 82.7 | 82.7 |
709
+ | InternVL2‑26B | 88.5 | 91.2 | 93.3 | 87.4 | 86.8 | 91.0 | 81.2 | 88.5 | 88.6 |
710
+ | InternVL2‑40B | 90.3 | 93.0 | 94.7 | 89.2 | 88.5 | 92.8 | 83.6 | 90.3 | 90.6 |
711
+ | InternVL2-<br>Llama3‑76B | 90.0 | 92.2 | 94.8 | 88.4 | 88.8 | 93.1 | 82.8 | 89.5 | 90.3 |
712
+
713
+ - 我们使用以下 Prompt 来评测 InternVL 的 Grounding 能力: `Please provide the bounding box coordinates of the region this sentence describes: <ref>{}</ref>`
714
+
715
+ 限制:尽管在训练过程中我们非常注重模型的安全性,尽力促使模型输出符合伦理和法律要求的文本,但受限于模型大小以及概率生成范式,模型可能会产生各种不符合预期的输出,例如回复内容包含偏见、歧视等有害内容,请勿传播这些内容。由于传播不良信息导致的任何后果,本项目不承担责任。
716
+
717
+ ### 邀请评测 InternVL
718
+
719
+ 我们欢迎各位 MLLM benchmark 的开发者对我们的 InternVL1.5 以及 InternVL2 系列模型进行评测。如果需要在此处添加评测结果,请与我联系([wztxy89@163.com](mailto:wztxy89@163.com))。
720
+
721
+ ## 快速启动
722
+
723
+ 我们提供了一个示例代码,用于使用 `transformers` 运行 InternVL2-26B。
724
+
725
+ 我们也欢迎你在我们的[在线demo](https://internvl.opengvlab.com/)中体验InternVL2的系列模型。
726
+
727
+ > 请使用 transformers==4.37.2 以确保模型正常运行。
728
+
729
+ 示例代码请[点击这里](#quick-start)。
730
+
731
+ ## 微调
732
+
733
+ 许多仓库现在都支持 InternVL 系列模型的微调,包括 [InternVL](https://github.com/OpenGVLab/InternVL)、[SWIFT](https://github.com/modelscope/ms-swift)、[XTurner](https://github.com/InternLM/xtuner) 等。请参阅它们的文档以获取更多微调细节。
734
+
735
+ ## 部署
736
+
737
+ ### LMDeploy
738
+
739
+ LMDeploy 是由 MMRazor 和 MMDeploy 团队开发的用于压缩、部署和服务大语言模型(LLM)的工具包。
740
+
741
+ ```sh
742
+ pip install lmdeploy==0.5.3
743
+ ```
744
+
745
+ LMDeploy 将多模态视觉-语言模型(VLM)的复杂推理过程抽象为一个易于使用的管道,类似于大语言模型(LLM)的推理管道。
746
+
747
+ #### 一个“你好,世界”示例
748
+
749
+ ```python
750
+ from lmdeploy import pipeline, TurbomindEngineConfig
751
+ from lmdeploy.vl import load_image
752
+
753
+ model = 'OpenGVLab/InternVL2-26B'
754
+ image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
755
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
756
+ response = pipe(('describe this image', image))
757
+ print(response.text)
758
+ ```
759
+
760
+ 如果在执行此示例时出现 `ImportError`,请按照提示安装所需的依赖包。
761
+
762
+ #### 多图像推理
763
+
764
+ 在处理多张图像时,可以将它们全部放入一个列表中。请注意,多张图像会导致输入 token 数量增加,因此通常需要增加上下文窗口的大小。
765
+
766
+ ```python
767
+ from lmdeploy import pipeline, TurbomindEngineConfig
768
+ from lmdeploy.vl import load_image
769
+ from lmdeploy.vl.constants import IMAGE_TOKEN
770
+
771
+ model = 'OpenGVLab/InternVL2-26B'
772
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
773
+
774
+ image_urls=[
775
+ 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
776
+ 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
777
+ ]
778
+
779
+ images = [load_image(img_url) for img_url in image_urls]
780
+ # Numbering images improves multi-image conversations
781
+ response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
782
+ print(response.text)
783
+ ```
784
+
785
+ #### 批量Prompt推理
786
+
787
+ 使用批量Prompt进行推理非常简单;只需将它们放在一个列表结构中:
788
+
789
+ ```python
790
+ from lmdeploy import pipeline, TurbomindEngineConfig
791
+ from lmdeploy.vl import load_image
792
+
793
+ model = 'OpenGVLab/InternVL2-26B'
794
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
795
+
796
+ image_urls=[
797
+ "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
798
+ "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
799
+ ]
800
+ prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
801
+ response = pipe(prompts)
802
+ print(response)
803
+ ```
804
+
805
+ #### 多轮对话
806
+
807
+ 使用管道进行多轮对话有两种方法。一种是根据 OpenAI 的格式构建消息并使用上述方法,另一种是使用 `pipeline.chat` 接口。
808
+
809
+ ```python
810
+ from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
811
+ from lmdeploy.vl import load_image
812
+
813
+ model = 'OpenGVLab/InternVL2-26B'
814
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
815
+
816
+ image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
817
+ gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
818
+ sess = pipe.chat(('describe this image', image), gen_config=gen_config)
819
+ print(sess.response.text)
820
+ sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
821
+ print(sess.response.text)
822
+ ```
823
+
824
+ #### API部署
825
+
826
+ LMDeploy 的 `api_server` 使模型能够通过一个命令轻松打包成服务。提供的 RESTful API 与 OpenAI 的接口兼容。以下是服务启动的示例:
827
+
828
+ ```shell
829
+ lmdeploy serve api_server OpenGVLab/InternVL2-26B --backend turbomind --server-port 23333
830
+ ```
831
+
832
+ 为了使用OpenAI风格的API接口,您需要安装OpenAI:
833
+
834
+ ```shell
835
+ pip install openai
836
+ ```
837
+
838
+ 然后,使用下面的代码进行API调用:
839
+
840
+ ```python
841
+ from openai import OpenAI
842
+
843
+ client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
844
+ model_name = client.models.list().data[0].id
845
+ response = client.chat.completions.create(
846
+ model=model_name,
847
+ messages=[{
848
+ 'role':
849
+ 'user',
850
+ 'content': [{
851
+ 'type': 'text',
852
+ 'text': 'describe this image',
853
+ }, {
854
+ 'type': 'image_url',
855
+ 'image_url': {
856
+ 'url':
857
+ 'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
858
+ },
859
+ }],
860
+ }],
861
+ temperature=0.8,
862
+ top_p=0.8)
863
+ print(response)
864
+ ```
865
+
866
+ ## 开源许可证
867
+
868
+ 该项目采用 MIT 许可证发布,而 InternLM2 则采用 Apache-2.0 许可证。
869
+
870
+ ## 引用
871
+
872
+ 如果您发现此项目对您的研究有用,可以考虑引用我们的论文:
873
+
874
+ ```BibTeX
875
+ @article{chen2023internvl,
876
+ title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
877
+ author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
878
+ journal={arXiv preprint arXiv:2312.14238},
879
+ year={2023}
880
+ }
881
+ @article{chen2024far,
882
+ title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
883
+ author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
884
+ journal={arXiv preprint arXiv:2404.16821},
885
+ year={2024}
886
+ }
887
+ ```
added_tokens.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</box>": 92552,
3
+ "</img>": 92545,
4
+ "</quad>": 92548,
5
+ "</ref>": 92550,
6
+ "<IMG_CONTEXT>": 92546,
7
+ "<box>": 92551,
8
+ "<img>": 92544,
9
+ "<quad>": 92547,
10
+ "<ref>": 92549
11
+ }
config.json ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_commit_hash": null,
3
+ "architectures": [
4
+ "InternVLChatModel"
5
+ ],
6
+ "auto_map": {
7
+ "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
8
+ "AutoModel": "modeling_internvl_chat.InternVLChatModel",
9
+ "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel"
10
+ },
11
+ "downsample_ratio": 0.5,
12
+ "dynamic_image_size": true,
13
+ "force_image_size": 448,
14
+ "llm_config": {
15
+ "_name_or_path": "internlm/internlm2-chat-20b",
16
+ "add_cross_attention": false,
17
+ "architectures": [
18
+ "InternLM2ForCausalLM"
19
+ ],
20
+ "attn_implementation": "flash_attention_2",
21
+ "auto_map": {
22
+ "AutoConfig": "configuration_internlm2.InternLM2Config",
23
+ "AutoModel": "modeling_internlm2.InternLM2ForCausalLM",
24
+ "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM"
25
+ },
26
+ "bad_words_ids": null,
27
+ "begin_suppress_tokens": null,
28
+ "bias": false,
29
+ "bos_token_id": 1,
30
+ "chunk_size_feed_forward": 0,
31
+ "cross_attention_hidden_size": null,
32
+ "decoder_start_token_id": null,
33
+ "diversity_penalty": 0.0,
34
+ "do_sample": false,
35
+ "early_stopping": false,
36
+ "encoder_no_repeat_ngram_size": 0,
37
+ "eos_token_id": 2,
38
+ "exponential_decay_length_penalty": null,
39
+ "finetuning_task": null,
40
+ "forced_bos_token_id": null,
41
+ "forced_eos_token_id": null,
42
+ "hidden_act": "silu",
43
+ "hidden_size": 6144,
44
+ "id2label": {
45
+ "0": "LABEL_0",
46
+ "1": "LABEL_1"
47
+ },
48
+ "initializer_range": 0.02,
49
+ "intermediate_size": 16384,
50
+ "is_decoder": false,
51
+ "is_encoder_decoder": false,
52
+ "label2id": {
53
+ "LABEL_0": 0,
54
+ "LABEL_1": 1
55
+ },
56
+ "length_penalty": 1.0,
57
+ "max_length": 20,
58
+ "max_position_embeddings": 32768,
59
+ "min_length": 0,
60
+ "model_type": "internlm2",
61
+ "no_repeat_ngram_size": 0,
62
+ "num_attention_heads": 48,
63
+ "num_beam_groups": 1,
64
+ "num_beams": 1,
65
+ "num_hidden_layers": 48,
66
+ "num_key_value_heads": 8,
67
+ "num_return_sequences": 1,
68
+ "output_attentions": false,
69
+ "output_hidden_states": false,
70
+ "output_scores": false,
71
+ "pad_token_id": 2,
72
+ "prefix": null,
73
+ "problem_type": null,
74
+ "pruned_heads": {},
75
+ "remove_invalid_values": false,
76
+ "repetition_penalty": 1.0,
77
+ "return_dict": true,
78
+ "return_dict_in_generate": false,
79
+ "rms_norm_eps": 1e-05,
80
+ "rope_scaling": {
81
+ "factor": 3.0,
82
+ "type": "dynamic"
83
+ },
84
+ "rope_theta": 1000000,
85
+ "sep_token_id": null,
86
+ "suppress_tokens": null,
87
+ "task_specific_params": null,
88
+ "temperature": 1.0,
89
+ "tf_legacy_loss": false,
90
+ "tie_encoder_decoder": false,
91
+ "tie_word_embeddings": false,
92
+ "tokenizer_class": null,
93
+ "top_k": 50,
94
+ "top_p": 1.0,
95
+ "torch_dtype": "bfloat16",
96
+ "torchscript": false,
97
+ "transformers_version": "4.37.2",
98
+ "typical_p": 1.0,
99
+ "use_bfloat16": true,
100
+ "use_cache": true,
101
+ "vocab_size": 92553
102
+ },
103
+ "max_dynamic_patch": 12,
104
+ "min_dynamic_patch": 1,
105
+ "model_type": "internvl_chat",
106
+ "ps_version": "v2",
107
+ "select_layer": -1,
108
+ "template": "internlm2-chat",
109
+ "torch_dtype": "bfloat16",
110
+ "use_backbone_lora": 0,
111
+ "use_llm_lora": 0,
112
+ "use_thumbnail": true,
113
+ "vision_config": {
114
+ "architectures": [
115
+ "InternVisionModel"
116
+ ],
117
+ "attention_dropout": 0.0,
118
+ "drop_path_rate": 0.0,
119
+ "dropout": 0.0,
120
+ "hidden_act": "gelu",
121
+ "hidden_size": 3200,
122
+ "image_size": 448,
123
+ "initializer_factor": 0.1,
124
+ "initializer_range": 1e-10,
125
+ "intermediate_size": 12800,
126
+ "layer_norm_eps": 1e-06,
127
+ "model_type": "intern_vit_6b",
128
+ "norm_type": "rms_norm",
129
+ "num_attention_heads": 25,
130
+ "num_channels": 3,
131
+ "num_hidden_layers": 45,
132
+ "output_attentions": false,
133
+ "output_hidden_states": false,
134
+ "patch_size": 14,
135
+ "qk_normalization": true,
136
+ "qkv_bias": false,
137
+ "return_dict": true,
138
+ "torch_dtype": "bfloat16",
139
+ "transformers_version": "4.37.2",
140
+ "use_bfloat16": true,
141
+ "use_flash_attn": true
142
+ }
143
+ }
configuration_intern_vit.py ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # --------------------------------------------------------
2
+ # InternVL
3
+ # Copyright (c) 2024 OpenGVLab
4
+ # Licensed under The MIT License [see LICENSE for details]
5
+ # --------------------------------------------------------
6
+ import os
7
+ from typing import Union
8
+
9
+ from transformers.configuration_utils import PretrainedConfig
10
+ from transformers.utils import logging
11
+
12
+ logger = logging.get_logger(__name__)
13
+
14
+
15
+ class InternVisionConfig(PretrainedConfig):
16
+ r"""
17
+ This is the configuration class to store the configuration of a [`InternVisionModel`]. It is used to
18
+ instantiate a vision encoder according to the specified arguments, defining the model architecture.
19
+
20
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
21
+ documentation from [`PretrainedConfig`] for more information.
22
+
23
+ Args:
24
+ num_channels (`int`, *optional*, defaults to 3):
25
+ Number of color channels in the input images (e.g., 3 for RGB).
26
+ patch_size (`int`, *optional*, defaults to 14):
27
+ The size (resolution) of each patch.
28
+ image_size (`int`, *optional*, defaults to 224):
29
+ The size (resolution) of each image.
30
+ qkv_bias (`bool`, *optional*, defaults to `False`):
31
+ Whether to add a bias to the queries and values in the self-attention layers.
32
+ hidden_size (`int`, *optional*, defaults to 3200):
33
+ Dimensionality of the encoder layers and the pooler layer.
34
+ num_attention_heads (`int`, *optional*, defaults to 25):
35
+ Number of attention heads for each attention layer in the Transformer encoder.
36
+ intermediate_size (`int`, *optional*, defaults to 12800):
37
+ Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
38
+ qk_normalization (`bool`, *optional*, defaults to `True`):
39
+ Whether to normalize the queries and keys in the self-attention layers.
40
+ num_hidden_layers (`int`, *optional*, defaults to 48):
41
+ Number of hidden layers in the Transformer encoder.
42
+ use_flash_attn (`bool`, *optional*, defaults to `True`):
43
+ Whether to use flash attention mechanism.
44
+ hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
45
+ The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
46
+ `"relu"`, `"selu"` and `"gelu_new"` ``"gelu"` are supported.
47
+ layer_norm_eps (`float`, *optional*, defaults to 1e-6):
48
+ The epsilon used by the layer normalization layers.
49
+ dropout (`float`, *optional*, defaults to 0.0):
50
+ The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
51
+ drop_path_rate (`float`, *optional*, defaults to 0.0):
52
+ Dropout rate for stochastic depth.
53
+ attention_dropout (`float`, *optional*, defaults to 0.0):
54
+ The dropout ratio for the attention probabilities.
55
+ initializer_range (`float`, *optional*, defaults to 0.02):
56
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
57
+ initializer_factor (`float`, *optional*, defaults to 0.1):
58
+ A factor for layer scale.
59
+ """
60
+
61
+ model_type = 'intern_vit_6b'
62
+
63
+ def __init__(
64
+ self,
65
+ num_channels=3,
66
+ patch_size=14,
67
+ image_size=224,
68
+ qkv_bias=False,
69
+ hidden_size=3200,
70
+ num_attention_heads=25,
71
+ intermediate_size=12800,
72
+ qk_normalization=True,
73
+ num_hidden_layers=48,
74
+ use_flash_attn=True,
75
+ hidden_act='gelu',
76
+ norm_type='rms_norm',
77
+ layer_norm_eps=1e-6,
78
+ dropout=0.0,
79
+ drop_path_rate=0.0,
80
+ attention_dropout=0.0,
81
+ initializer_range=0.02,
82
+ initializer_factor=0.1,
83
+ **kwargs,
84
+ ):
85
+ super().__init__(**kwargs)
86
+
87
+ self.hidden_size = hidden_size
88
+ self.intermediate_size = intermediate_size
89
+ self.dropout = dropout
90
+ self.drop_path_rate = drop_path_rate
91
+ self.num_hidden_layers = num_hidden_layers
92
+ self.num_attention_heads = num_attention_heads
93
+ self.num_channels = num_channels
94
+ self.patch_size = patch_size
95
+ self.image_size = image_size
96
+ self.initializer_range = initializer_range
97
+ self.initializer_factor = initializer_factor
98
+ self.attention_dropout = attention_dropout
99
+ self.layer_norm_eps = layer_norm_eps
100
+ self.hidden_act = hidden_act
101
+ self.norm_type = norm_type
102
+ self.qkv_bias = qkv_bias
103
+ self.qk_normalization = qk_normalization
104
+ self.use_flash_attn = use_flash_attn
105
+
106
+ @classmethod
107
+ def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> 'PretrainedConfig':
108
+ config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
109
+
110
+ if 'vision_config' in config_dict:
111
+ config_dict = config_dict['vision_config']
112
+
113
+ if 'model_type' in config_dict and hasattr(cls, 'model_type') and config_dict['model_type'] != cls.model_type:
114
+ logger.warning(
115
+ f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
116
+ f'{cls.model_type}. This is not supported for all configurations of models and can yield errors.'
117
+ )
118
+
119
+ return cls.from_dict(config_dict, **kwargs)
configuration_internlm2.py ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) The InternLM team and The HuggingFace Inc. team. All rights reserved.
2
+ #
3
+ # This code is based on transformers/src/transformers/models/llama/configuration_llama.py
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+ """ InternLM2 model configuration"""
17
+
18
+ from transformers.configuration_utils import PretrainedConfig
19
+ from transformers.utils import logging
20
+
21
+ logger = logging.get_logger(__name__)
22
+
23
+ INTERNLM2_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
24
+
25
+
26
+ # Modified from transformers.model.llama.configuration_llama.LlamaConfig
27
+ class InternLM2Config(PretrainedConfig):
28
+ r"""
29
+ This is the configuration class to store the configuration of a [`InternLM2Model`]. It is used to instantiate
30
+ an InternLM2 model according to the specified arguments, defining the model architecture. Instantiating a
31
+ configuration with the defaults will yield a similar configuration to that of the InternLM2-7B.
32
+
33
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
34
+ documentation from [`PretrainedConfig`] for more information.
35
+
36
+
37
+ Args:
38
+ vocab_size (`int`, *optional*, defaults to 32000):
39
+ Vocabulary size of the InternLM2 model. Defines the number of different tokens that can be represented by the
40
+ `inputs_ids` passed when calling [`InternLM2Model`]
41
+ hidden_size (`int`, *optional*, defaults to 4096):
42
+ Dimension of the hidden representations.
43
+ intermediate_size (`int`, *optional*, defaults to 11008):
44
+ Dimension of the MLP representations.
45
+ num_hidden_layers (`int`, *optional*, defaults to 32):
46
+ Number of hidden layers in the Transformer encoder.
47
+ num_attention_heads (`int`, *optional*, defaults to 32):
48
+ Number of attention heads for each attention layer in the Transformer encoder.
49
+ num_key_value_heads (`int`, *optional*):
50
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
51
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
52
+ `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
53
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
54
+ by meanpooling all the original heads within that group. For more details checkout [this
55
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
56
+ `num_attention_heads`.
57
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
58
+ The non-linear activation function (function or string) in the decoder.
59
+ max_position_embeddings (`int`, *optional*, defaults to 2048):
60
+ The maximum sequence length that this model might ever be used with. Typically set this to something large
61
+ just in case (e.g., 512 or 1024 or 2048).
62
+ initializer_range (`float`, *optional*, defaults to 0.02):
63
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
64
+ rms_norm_eps (`float`, *optional*, defaults to 1e-12):
65
+ The epsilon used by the rms normalization layers.
66
+ use_cache (`bool`, *optional*, defaults to `True`):
67
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
68
+ relevant if `config.is_decoder=True`.
69
+ tie_word_embeddings(`bool`, *optional*, defaults to `False`):
70
+ Whether to tie weight embeddings
71
+ Example:
72
+
73
+ """
74
+ model_type = 'internlm2'
75
+ _auto_class = 'AutoConfig'
76
+
77
+ def __init__( # pylint: disable=W0102
78
+ self,
79
+ vocab_size=103168,
80
+ hidden_size=4096,
81
+ intermediate_size=11008,
82
+ num_hidden_layers=32,
83
+ num_attention_heads=32,
84
+ num_key_value_heads=None,
85
+ hidden_act='silu',
86
+ max_position_embeddings=2048,
87
+ initializer_range=0.02,
88
+ rms_norm_eps=1e-6,
89
+ use_cache=True,
90
+ pad_token_id=0,
91
+ bos_token_id=1,
92
+ eos_token_id=2,
93
+ tie_word_embeddings=False,
94
+ bias=True,
95
+ rope_theta=10000,
96
+ rope_scaling=None,
97
+ attn_implementation='eager',
98
+ **kwargs,
99
+ ):
100
+ self.vocab_size = vocab_size
101
+ self.max_position_embeddings = max_position_embeddings
102
+ self.hidden_size = hidden_size
103
+ self.intermediate_size = intermediate_size
104
+ self.num_hidden_layers = num_hidden_layers
105
+ self.num_attention_heads = num_attention_heads
106
+ self.bias = bias
107
+
108
+ if num_key_value_heads is None:
109
+ num_key_value_heads = num_attention_heads
110
+ self.num_key_value_heads = num_key_value_heads
111
+
112
+ self.hidden_act = hidden_act
113
+ self.initializer_range = initializer_range
114
+ self.rms_norm_eps = rms_norm_eps
115
+ self.use_cache = use_cache
116
+ self.rope_theta = rope_theta
117
+ self.rope_scaling = rope_scaling
118
+ self._rope_scaling_validation()
119
+
120
+ self.attn_implementation = attn_implementation
121
+ if self.attn_implementation is None:
122
+ self.attn_implementation = 'eager'
123
+ super().__init__(
124
+ pad_token_id=pad_token_id,
125
+ bos_token_id=bos_token_id,
126
+ eos_token_id=eos_token_id,
127
+ tie_word_embeddings=tie_word_embeddings,
128
+ **kwargs,
129
+ )
130
+
131
+ def _rope_scaling_validation(self):
132
+ """
133
+ Validate the `rope_scaling` configuration.
134
+ """
135
+ if self.rope_scaling is None:
136
+ return
137
+
138
+ if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2:
139
+ raise ValueError(
140
+ '`rope_scaling` must be a dictionary with with two fields, `type` and `factor`, '
141
+ f'got {self.rope_scaling}'
142
+ )
143
+ rope_scaling_type = self.rope_scaling.get('type', None)
144
+ rope_scaling_factor = self.rope_scaling.get('factor', None)
145
+ if rope_scaling_type is None or rope_scaling_type not in ['linear', 'dynamic']:
146
+ raise ValueError(
147
+ f"`rope_scaling`'s type field must be one of ['linear', 'dynamic'], got {rope_scaling_type}"
148
+ )
149
+ if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor < 1.0:
150
+ raise ValueError(f"`rope_scaling`'s factor field must be a float >= 1, got {rope_scaling_factor}")
configuration_internvl_chat.py ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # --------------------------------------------------------
2
+ # InternVL
3
+ # Copyright (c) 2024 OpenGVLab
4
+ # Licensed under The MIT License [see LICENSE for details]
5
+ # --------------------------------------------------------
6
+
7
+ import copy
8
+
9
+ from transformers import AutoConfig, LlamaConfig
10
+ from transformers.configuration_utils import PretrainedConfig
11
+ from transformers.utils import logging
12
+
13
+ from .configuration_intern_vit import InternVisionConfig
14
+ from .configuration_internlm2 import InternLM2Config
15
+
16
+ logger = logging.get_logger(__name__)
17
+
18
+
19
+ class InternVLChatConfig(PretrainedConfig):
20
+ model_type = 'internvl_chat'
21
+ is_composition = True
22
+
23
+ def __init__(
24
+ self,
25
+ vision_config=None,
26
+ llm_config=None,
27
+ use_backbone_lora=0,
28
+ use_llm_lora=0,
29
+ select_layer=-1,
30
+ force_image_size=None,
31
+ downsample_ratio=0.5,
32
+ template=None,
33
+ dynamic_image_size=False,
34
+ use_thumbnail=False,
35
+ ps_version='v1',
36
+ min_dynamic_patch=1,
37
+ max_dynamic_patch=6,
38
+ **kwargs):
39
+ super().__init__(**kwargs)
40
+
41
+ if vision_config is None:
42
+ vision_config = {}
43
+ logger.info('vision_config is None. Initializing the InternVisionConfig with default values.')
44
+
45
+ if llm_config is None:
46
+ llm_config = {}
47
+ logger.info('llm_config is None. Initializing the LlamaConfig config with default values (`LlamaConfig`).')
48
+
49
+ self.vision_config = InternVisionConfig(**vision_config)
50
+ if llm_config['architectures'][0] == 'LlamaForCausalLM':
51
+ self.llm_config = LlamaConfig(**llm_config)
52
+ elif llm_config['architectures'][0] == 'InternLM2ForCausalLM':
53
+ self.llm_config = InternLM2Config(**llm_config)
54
+ else:
55
+ raise ValueError('Unsupported architecture: {}'.format(llm_config['architectures'][0]))
56
+ self.use_backbone_lora = use_backbone_lora
57
+ self.use_llm_lora = use_llm_lora
58
+ self.select_layer = select_layer
59
+ self.force_image_size = force_image_size
60
+ self.downsample_ratio = downsample_ratio
61
+ self.template = template
62
+ self.dynamic_image_size = dynamic_image_size
63
+ self.use_thumbnail = use_thumbnail
64
+ self.ps_version = ps_version # pixel shuffle version
65
+ self.min_dynamic_patch = min_dynamic_patch
66
+ self.max_dynamic_patch = max_dynamic_patch
67
+
68
+ logger.info(f'vision_select_layer: {self.select_layer}')
69
+ logger.info(f'ps_version: {self.ps_version}')
70
+ logger.info(f'min_dynamic_patch: {self.min_dynamic_patch}')
71
+ logger.info(f'max_dynamic_patch: {self.max_dynamic_patch}')
72
+
73
+ def to_dict(self):
74
+ """
75
+ Serializes this instance to a Python dictionary. Override the default [`~PretrainedConfig.to_dict`].
76
+
77
+ Returns:
78
+ `Dict[str, any]`: Dictionary of all the attributes that make up this configuration instance,
79
+ """
80
+ output = copy.deepcopy(self.__dict__)
81
+ output['vision_config'] = self.vision_config.to_dict()
82
+ output['llm_config'] = self.llm_config.to_dict()
83
+ output['model_type'] = self.__class__.model_type
84
+ output['use_backbone_lora'] = self.use_backbone_lora
85
+ output['use_llm_lora'] = self.use_llm_lora
86
+ output['select_layer'] = self.select_layer
87
+ output['force_image_size'] = self.force_image_size
88
+ output['downsample_ratio'] = self.downsample_ratio
89
+ output['template'] = self.template
90
+ output['dynamic_image_size'] = self.dynamic_image_size
91
+ output['use_thumbnail'] = self.use_thumbnail
92
+ output['ps_version'] = self.ps_version
93
+ output['min_dynamic_patch'] = self.min_dynamic_patch
94
+ output['max_dynamic_patch'] = self.max_dynamic_patch
95
+
96
+ return output
conversation.py ADDED
@@ -0,0 +1,393 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Conversation prompt templates.
3
+
4
+ We kindly request that you import fastchat instead of copying this file if you wish to use it.
5
+ If you have changes in mind, please contribute back so the community can benefit collectively and continue to maintain these valuable templates.
6
+ """
7
+
8
+ import dataclasses
9
+ from enum import IntEnum, auto
10
+ from typing import Any, Dict, List, Tuple, Union
11
+
12
+
13
+ class SeparatorStyle(IntEnum):
14
+ """Separator styles."""
15
+
16
+ ADD_COLON_SINGLE = auto()
17
+ ADD_COLON_TWO = auto()
18
+ ADD_COLON_SPACE_SINGLE = auto()
19
+ NO_COLON_SINGLE = auto()
20
+ NO_COLON_TWO = auto()
21
+ ADD_NEW_LINE_SINGLE = auto()
22
+ LLAMA2 = auto()
23
+ CHATGLM = auto()
24
+ CHATML = auto()
25
+ CHATINTERN = auto()
26
+ DOLLY = auto()
27
+ RWKV = auto()
28
+ PHOENIX = auto()
29
+ ROBIN = auto()
30
+ FALCON_CHAT = auto()
31
+ CHATGLM3 = auto()
32
+ INTERNVL_ZH = auto()
33
+ MPT = auto()
34
+
35
+
36
+ @dataclasses.dataclass
37
+ class Conversation:
38
+ """A class that manages prompt templates and keeps all conversation history."""
39
+
40
+ # The name of this template
41
+ name: str
42
+ # The template of the system prompt
43
+ system_template: str = '{system_message}'
44
+ # The system message
45
+ system_message: str = ''
46
+ # The names of two roles
47
+ roles: Tuple[str] = ('USER', 'ASSISTANT')
48
+ # All messages. Each item is (role, message).
49
+ messages: List[List[str]] = ()
50
+ # The number of few shot examples
51
+ offset: int = 0
52
+ # The separator style and configurations
53
+ sep_style: SeparatorStyle = SeparatorStyle.ADD_COLON_SINGLE
54
+ sep: str = '\n'
55
+ sep2: str = None
56
+ # Stop criteria (the default one is EOS token)
57
+ stop_str: Union[str, List[str]] = None
58
+ # Stops generation if meeting any token in this list
59
+ stop_token_ids: List[int] = None
60
+
61
+ def get_prompt(self) -> str:
62
+ """Get the prompt for generation."""
63
+ system_prompt = self.system_template.format(system_message=self.system_message)
64
+ if self.sep_style == SeparatorStyle.ADD_COLON_SINGLE:
65
+ ret = system_prompt + self.sep
66
+ for role, message in self.messages:
67
+ if message:
68
+ ret += role + ': ' + message + self.sep
69
+ else:
70
+ ret += role + ':'
71
+ return ret
72
+ elif self.sep_style == SeparatorStyle.ADD_COLON_TWO:
73
+ seps = [self.sep, self.sep2]
74
+ ret = system_prompt + seps[0]
75
+ for i, (role, message) in enumerate(self.messages):
76
+ if message:
77
+ ret += role + ': ' + message + seps[i % 2]
78
+ else:
79
+ ret += role + ':'
80
+ return ret
81
+ elif self.sep_style == SeparatorStyle.ADD_COLON_SPACE_SINGLE:
82
+ ret = system_prompt + self.sep
83
+ for role, message in self.messages:
84
+ if message:
85
+ ret += role + ': ' + message + self.sep
86
+ else:
87
+ ret += role + ': ' # must be end with a space
88
+ return ret
89
+ elif self.sep_style == SeparatorStyle.ADD_NEW_LINE_SINGLE:
90
+ ret = '' if system_prompt == '' else system_prompt + self.sep
91
+ for role, message in self.messages:
92
+ if message:
93
+ ret += role + '\n' + message + self.sep
94
+ else:
95
+ ret += role + '\n'
96
+ return ret
97
+ elif self.sep_style == SeparatorStyle.NO_COLON_SINGLE:
98
+ ret = system_prompt
99
+ for role, message in self.messages:
100
+ if message:
101
+ ret += role + message + self.sep
102
+ else:
103
+ ret += role
104
+ return ret
105
+ elif self.sep_style == SeparatorStyle.NO_COLON_TWO:
106
+ seps = [self.sep, self.sep2]
107
+ ret = system_prompt
108
+ for i, (role, message) in enumerate(self.messages):
109
+ if message:
110
+ ret += role + message + seps[i % 2]
111
+ else:
112
+ ret += role
113
+ return ret
114
+ elif self.sep_style == SeparatorStyle.RWKV:
115
+ ret = system_prompt
116
+ for i, (role, message) in enumerate(self.messages):
117
+ if message:
118
+ ret += (
119
+ role
120
+ + ': '
121
+ + message.replace('\r\n', '\n').replace('\n\n', '\n')
122
+ )
123
+ ret += '\n\n'
124
+ else:
125
+ ret += role + ':'
126
+ return ret
127
+ elif self.sep_style == SeparatorStyle.LLAMA2:
128
+ seps = [self.sep, self.sep2]
129
+ if self.system_message:
130
+ ret = system_prompt
131
+ else:
132
+ ret = '[INST] '
133
+ for i, (role, message) in enumerate(self.messages):
134
+ tag = self.roles[i % 2]
135
+ if message:
136
+ if i == 0:
137
+ ret += message + ' '
138
+ else:
139
+ ret += tag + ' ' + message + seps[i % 2]
140
+ else:
141
+ ret += tag
142
+ return ret
143
+ elif self.sep_style == SeparatorStyle.CHATGLM:
144
+ # source: https://huggingface.co/THUDM/chatglm-6b/blob/1d240ba371910e9282298d4592532d7f0f3e9f3e/modeling_chatglm.py#L1302-L1308
145
+ # source2: https://huggingface.co/THUDM/chatglm2-6b/blob/e186c891cf64310ac66ef10a87e6635fa6c2a579/modeling_chatglm.py#L926
146
+ round_add_n = 1 if self.name == 'chatglm2' else 0
147
+ if system_prompt:
148
+ ret = system_prompt + self.sep
149
+ else:
150
+ ret = ''
151
+
152
+ for i, (role, message) in enumerate(self.messages):
153
+ if i % 2 == 0:
154
+ ret += f'[Round {i//2 + round_add_n}]{self.sep}'
155
+
156
+ if message:
157
+ ret += f'{role}:{message}{self.sep}'
158
+ else:
159
+ ret += f'{role}:'
160
+ return ret
161
+ elif self.sep_style == SeparatorStyle.CHATML:
162
+ ret = '' if system_prompt == '' else system_prompt + self.sep + '\n'
163
+ for role, message in self.messages:
164
+ if message:
165
+ ret += role + '\n' + message + self.sep + '\n'
166
+ else:
167
+ ret += role + '\n'
168
+ return ret
169
+ elif self.sep_style == SeparatorStyle.CHATGLM3:
170
+ ret = ''
171
+ if self.system_message:
172
+ ret += system_prompt
173
+ for role, message in self.messages:
174
+ if message:
175
+ ret += role + '\n' + ' ' + message
176
+ else:
177
+ ret += role
178
+ return ret
179
+ elif self.sep_style == SeparatorStyle.CHATINTERN:
180
+ # source: https://huggingface.co/internlm/internlm-chat-7b-8k/blob/bd546fa984b4b0b86958f56bf37f94aa75ab8831/modeling_internlm.py#L771
181
+ seps = [self.sep, self.sep2]
182
+ ret = system_prompt
183
+ for i, (role, message) in enumerate(self.messages):
184
+ # if i % 2 == 0:
185
+ # ret += "<s>"
186
+ if message:
187
+ ret += role + ':' + message + seps[i % 2] + '\n'
188
+ else:
189
+ ret += role + ':'
190
+ return ret
191
+ elif self.sep_style == SeparatorStyle.DOLLY:
192
+ seps = [self.sep, self.sep2]
193
+ ret = system_prompt
194
+ for i, (role, message) in enumerate(self.messages):
195
+ if message:
196
+ ret += role + ':\n' + message + seps[i % 2]
197
+ if i % 2 == 1:
198
+ ret += '\n\n'
199
+ else:
200
+ ret += role + ':\n'
201
+ return ret
202
+ elif self.sep_style == SeparatorStyle.PHOENIX:
203
+ ret = system_prompt
204
+ for role, message in self.messages:
205
+ if message:
206
+ ret += role + ': ' + '<s>' + message + '</s>'
207
+ else:
208
+ ret += role + ': ' + '<s>'
209
+ return ret
210
+ elif self.sep_style == SeparatorStyle.ROBIN:
211
+ ret = system_prompt + self.sep
212
+ for role, message in self.messages:
213
+ if message:
214
+ ret += role + ':\n' + message + self.sep
215
+ else:
216
+ ret += role + ':\n'
217
+ return ret
218
+ elif self.sep_style == SeparatorStyle.FALCON_CHAT:
219
+ ret = ''
220
+ if self.system_message:
221
+ ret += system_prompt + self.sep
222
+ for role, message in self.messages:
223
+ if message:
224
+ ret += role + ': ' + message + self.sep
225
+ else:
226
+ ret += role + ':'
227
+
228
+ return ret
229
+ elif self.sep_style == SeparatorStyle.INTERNVL_ZH:
230
+ seps = [self.sep, self.sep2]
231
+ ret = self.system_message + seps[0]
232
+ for i, (role, message) in enumerate(self.messages):
233
+ if message:
234
+ ret += role + ': ' + message + seps[i % 2]
235
+ else:
236
+ ret += role + ':'
237
+ return ret
238
+ elif self.sep_style == SeparatorStyle.MPT:
239
+ ret = system_prompt + self.sep
240
+ for role, message in self.messages:
241
+ if message:
242
+ if type(message) is tuple:
243
+ message, _, _ = message
244
+ ret += role + message + self.sep
245
+ else:
246
+ ret += role
247
+ return ret
248
+ else:
249
+ raise ValueError(f'Invalid style: {self.sep_style}')
250
+
251
+ def set_system_message(self, system_message: str):
252
+ """Set the system message."""
253
+ self.system_message = system_message
254
+
255
+ def append_message(self, role: str, message: str):
256
+ """Append a new message."""
257
+ self.messages.append([role, message])
258
+
259
+ def update_last_message(self, message: str):
260
+ """Update the last output.
261
+
262
+ The last message is typically set to be None when constructing the prompt,
263
+ so we need to update it in-place after getting the response from a model.
264
+ """
265
+ self.messages[-1][1] = message
266
+
267
+ def to_gradio_chatbot(self):
268
+ """Convert the conversation to gradio chatbot format."""
269
+ ret = []
270
+ for i, (role, msg) in enumerate(self.messages[self.offset :]):
271
+ if i % 2 == 0:
272
+ ret.append([msg, None])
273
+ else:
274
+ ret[-1][-1] = msg
275
+ return ret
276
+
277
+ def to_openai_api_messages(self):
278
+ """Convert the conversation to OpenAI chat completion format."""
279
+ ret = [{'role': 'system', 'content': self.system_message}]
280
+
281
+ for i, (_, msg) in enumerate(self.messages[self.offset :]):
282
+ if i % 2 == 0:
283
+ ret.append({'role': 'user', 'content': msg})
284
+ else:
285
+ if msg is not None:
286
+ ret.append({'role': 'assistant', 'content': msg})
287
+ return ret
288
+
289
+ def copy(self):
290
+ return Conversation(
291
+ name=self.name,
292
+ system_template=self.system_template,
293
+ system_message=self.system_message,
294
+ roles=self.roles,
295
+ messages=[[x, y] for x, y in self.messages],
296
+ offset=self.offset,
297
+ sep_style=self.sep_style,
298
+ sep=self.sep,
299
+ sep2=self.sep2,
300
+ stop_str=self.stop_str,
301
+ stop_token_ids=self.stop_token_ids,
302
+ )
303
+
304
+ def dict(self):
305
+ return {
306
+ 'template_name': self.name,
307
+ 'system_message': self.system_message,
308
+ 'roles': self.roles,
309
+ 'messages': self.messages,
310
+ 'offset': self.offset,
311
+ }
312
+
313
+
314
+ # A global registry for all conversation templates
315
+ conv_templates: Dict[str, Conversation] = {}
316
+
317
+
318
+ def register_conv_template(template: Conversation, override: bool = False):
319
+ """Register a new conversation template."""
320
+ if not override:
321
+ assert (
322
+ template.name not in conv_templates
323
+ ), f'{template.name} has been registered.'
324
+
325
+ conv_templates[template.name] = template
326
+
327
+
328
+ def get_conv_template(name: str) -> Conversation:
329
+ """Get a conversation template."""
330
+ return conv_templates[name].copy()
331
+
332
+
333
+ # Both Hermes-2 and internlm2-chat are chatml-format conversation templates. The difference
334
+ # is that during training, the preprocessing function for the Hermes-2 template doesn't add
335
+ # <s> at the beginning of the tokenized sequence, while the internlm2-chat template does.
336
+ # Therefore, they are completely equivalent during inference.
337
+ register_conv_template(
338
+ Conversation(
339
+ name='Hermes-2',
340
+ system_template='<|im_start|>system\n{system_message}',
341
+ # note: The new system prompt was not used here to avoid changes in benchmark performance.
342
+ # system_message='我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。',
343
+ system_message='你是由上海人工智能实验室联合商汤科技开发的书生多模态大模型,英文名叫InternVL, 是一个有用无害的人工智能助手。',
344
+ roles=('<|im_start|>user\n', '<|im_start|>assistant\n'),
345
+ sep_style=SeparatorStyle.MPT,
346
+ sep='<|im_end|>',
347
+ stop_token_ids=[
348
+ 2,
349
+ 6,
350
+ 7,
351
+ 8,
352
+ ],
353
+ stop_str='<|endoftext|>',
354
+ )
355
+ )
356
+
357
+
358
+ register_conv_template(
359
+ Conversation(
360
+ name='internlm2-chat',
361
+ system_template='<|im_start|>system\n{system_message}',
362
+ # note: The new system prompt was not used here to avoid changes in benchmark performance.
363
+ # system_message='我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。',
364
+ system_message='你是由上海人工智能实验室联合商汤科技开发的书生多模态大模型,英文名叫InternVL, 是一个有用无害的人工智能助手。',
365
+ roles=('<|im_start|>user\n', '<|im_start|>assistant\n'),
366
+ sep_style=SeparatorStyle.MPT,
367
+ sep='<|im_end|>',
368
+ stop_token_ids=[
369
+ 2,
370
+ 92543,
371
+ 92542
372
+ ]
373
+ )
374
+ )
375
+
376
+
377
+ register_conv_template(
378
+ Conversation(
379
+ name='phi3-chat',
380
+ system_template='<|system|>\n{system_message}',
381
+ # note: The new system prompt was not used here to avoid changes in benchmark performance.
382
+ # system_message='我是书生·万象,英文名是InternVL,是由上海人工智能实验室、清华大学及多家合作单位联合开发的多模态大语言模型。',
383
+ system_message='你是由上海人工智能实验室联合商汤科技开发的书生多模态大模型,英文名叫InternVL, 是一个有用无害的人工智能助手。',
384
+ roles=('<|user|>\n', '<|assistant|>\n'),
385
+ sep_style=SeparatorStyle.MPT,
386
+ sep='<|end|>',
387
+ stop_token_ids=[
388
+ 2,
389
+ 32000,
390
+ 32007
391
+ ]
392
+ )
393
+ )
examples/image1.jpg ADDED
examples/image2.jpg ADDED
examples/red-panda.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d921c07bb97224d65a37801541d246067f0d506f08723ffa1ad85c217907ccb8
3
+ size 1867237
generation_config.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "transformers_version": "4.37.2",
4
+ "eos_token_id": [
5
+ 92542,
6
+ 92543
7
+ ]
8
+ }
model-00001-of-00011.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:43bbb2f1ae6d94ebbe6869a4c45b93cb19a562fe06dfab87fa5340c02c9c2f3d
3
+ size 4988569440
model-00002-of-00011.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0bed4eccf3cffcb860c1c41269c41527a6bef7c9ef6892be8f5d32f74886e075
3
+ size 4937253584
model-00003-of-00011.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0b06e8453356567b7d952d1bd5770388cddca47d268cce41aee418e7b3d820b6
3
+ size 4801189400
model-00004-of-00011.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2724dbd89ff4dc8f445e8338597db8ac874615ed61b67b6f324c20f3050d6923
3
+ size 4882322840
model-00005-of-00011.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f160b3f91fedb979f755f05d019547006327b37edd398f991dd3cfc768c4a896
3
+ size 4882322880
model-00006-of-00011.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1477fad5896b88336cb7702e4e352fecbbd2f8b69fe7ef6a55a61455cba93b5c
3
+ size 4983011128
model-00007-of-00011.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dbefad21e014e31a437eb34160e0b5c4ab588b6db54535f57b3deb0111a71a6b
3
+ size 4957820488
model-00008-of-00011.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a147a1e457752174dbee123ed8dd70d65f906c375d11d1935d8d5d2125dc692f
3
+ size 4882322880
model-00009-of-00011.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0d25f9d58383af211d24cb3cf89a53ced6b42b17a525833077773411de0e77a3
3
+ size 4983011128
model-00010-of-00011.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8e15047078bd0fa40e40cd59c5b8baba3d4231a49a079b37314beaf8c032ce9d
3
+ size 4957820488
model-00011-of-00011.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d7decf1bc6ef55c1d7348be8c48ea9409c0113b3f6626d4e1500151555e0d7e0
3
+ size 1772842232
model.safetensors.index.json ADDED
@@ -0,0 +1,941 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 51028372224
4
+ },
5
+ "weight_map": {
6
+ "language_model.model.layers.0.attention.wo.weight": "model-00003-of-00011.safetensors",
7
+ "language_model.model.layers.0.attention.wqkv.weight": "model-00003-of-00011.safetensors",
8
+ "language_model.model.layers.0.attention_norm.weight": "model-00003-of-00011.safetensors",
9
+ "language_model.model.layers.0.feed_forward.w1.weight": "model-00003-of-00011.safetensors",
10
+ "language_model.model.layers.0.feed_forward.w2.weight": "model-00003-of-00011.safetensors",
11
+ "language_model.model.layers.0.feed_forward.w3.weight": "model-00003-of-00011.safetensors",
12
+ "language_model.model.layers.0.ffn_norm.weight": "model-00003-of-00011.safetensors",
13
+ "language_model.model.layers.1.attention.wo.weight": "model-00003-of-00011.safetensors",
14
+ "language_model.model.layers.1.attention.wqkv.weight": "model-00003-of-00011.safetensors",
15
+ "language_model.model.layers.1.attention_norm.weight": "model-00003-of-00011.safetensors",
16
+ "language_model.model.layers.1.feed_forward.w1.weight": "model-00003-of-00011.safetensors",
17
+ "language_model.model.layers.1.feed_forward.w2.weight": "model-00003-of-00011.safetensors",
18
+ "language_model.model.layers.1.feed_forward.w3.weight": "model-00003-of-00011.safetensors",
19
+ "language_model.model.layers.1.ffn_norm.weight": "model-00003-of-00011.safetensors",
20
+ "language_model.model.layers.10.attention.wo.weight": "model-00005-of-00011.safetensors",
21
+ "language_model.model.layers.10.attention.wqkv.weight": "model-00005-of-00011.safetensors",
22
+ "language_model.model.layers.10.attention_norm.weight": "model-00005-of-00011.safetensors",
23
+ "language_model.model.layers.10.feed_forward.w1.weight": "model-00005-of-00011.safetensors",
24
+ "language_model.model.layers.10.feed_forward.w2.weight": "model-00005-of-00011.safetensors",
25
+ "language_model.model.layers.10.feed_forward.w3.weight": "model-00005-of-00011.safetensors",
26
+ "language_model.model.layers.10.ffn_norm.weight": "model-00005-of-00011.safetensors",
27
+ "language_model.model.layers.11.attention.wo.weight": "model-00005-of-00011.safetensors",
28
+ "language_model.model.layers.11.attention.wqkv.weight": "model-00005-of-00011.safetensors",
29
+ "language_model.model.layers.11.attention_norm.weight": "model-00005-of-00011.safetensors",
30
+ "language_model.model.layers.11.feed_forward.w1.weight": "model-00005-of-00011.safetensors",
31
+ "language_model.model.layers.11.feed_forward.w2.weight": "model-00005-of-00011.safetensors",
32
+ "language_model.model.layers.11.feed_forward.w3.weight": "model-00005-of-00011.safetensors",
33
+ "language_model.model.layers.11.ffn_norm.weight": "model-00005-of-00011.safetensors",
34
+ "language_model.model.layers.12.attention.wo.weight": "model-00005-of-00011.safetensors",
35
+ "language_model.model.layers.12.attention.wqkv.weight": "model-00005-of-00011.safetensors",
36
+ "language_model.model.layers.12.attention_norm.weight": "model-00005-of-00011.safetensors",
37
+ "language_model.model.layers.12.feed_forward.w1.weight": "model-00005-of-00011.safetensors",
38
+ "language_model.model.layers.12.feed_forward.w2.weight": "model-00005-of-00011.safetensors",
39
+ "language_model.model.layers.12.feed_forward.w3.weight": "model-00005-of-00011.safetensors",
40
+ "language_model.model.layers.12.ffn_norm.weight": "model-00005-of-00011.safetensors",
41
+ "language_model.model.layers.13.attention.wo.weight": "model-00005-of-00011.safetensors",
42
+ "language_model.model.layers.13.attention.wqkv.weight": "model-00005-of-00011.safetensors",
43
+ "language_model.model.layers.13.attention_norm.weight": "model-00005-of-00011.safetensors",
44
+ "language_model.model.layers.13.feed_forward.w1.weight": "model-00005-of-00011.safetensors",
45
+ "language_model.model.layers.13.feed_forward.w2.weight": "model-00005-of-00011.safetensors",
46
+ "language_model.model.layers.13.feed_forward.w3.weight": "model-00005-of-00011.safetensors",
47
+ "language_model.model.layers.13.ffn_norm.weight": "model-00005-of-00011.safetensors",
48
+ "language_model.model.layers.14.attention.wo.weight": "model-00005-of-00011.safetensors",
49
+ "language_model.model.layers.14.attention.wqkv.weight": "model-00005-of-00011.safetensors",
50
+ "language_model.model.layers.14.attention_norm.weight": "model-00005-of-00011.safetensors",
51
+ "language_model.model.layers.14.feed_forward.w1.weight": "model-00005-of-00011.safetensors",
52
+ "language_model.model.layers.14.feed_forward.w2.weight": "model-00005-of-00011.safetensors",
53
+ "language_model.model.layers.14.feed_forward.w3.weight": "model-00005-of-00011.safetensors",
54
+ "language_model.model.layers.14.ffn_norm.weight": "model-00005-of-00011.safetensors",
55
+ "language_model.model.layers.15.attention.wo.weight": "model-00005-of-00011.safetensors",
56
+ "language_model.model.layers.15.attention.wqkv.weight": "model-00005-of-00011.safetensors",
57
+ "language_model.model.layers.15.attention_norm.weight": "model-00006-of-00011.safetensors",
58
+ "language_model.model.layers.15.feed_forward.w1.weight": "model-00005-of-00011.safetensors",
59
+ "language_model.model.layers.15.feed_forward.w2.weight": "model-00006-of-00011.safetensors",
60
+ "language_model.model.layers.15.feed_forward.w3.weight": "model-00005-of-00011.safetensors",
61
+ "language_model.model.layers.15.ffn_norm.weight": "model-00006-of-00011.safetensors",
62
+ "language_model.model.layers.16.attention.wo.weight": "model-00006-of-00011.safetensors",
63
+ "language_model.model.layers.16.attention.wqkv.weight": "model-00006-of-00011.safetensors",
64
+ "language_model.model.layers.16.attention_norm.weight": "model-00006-of-00011.safetensors",
65
+ "language_model.model.layers.16.feed_forward.w1.weight": "model-00006-of-00011.safetensors",
66
+ "language_model.model.layers.16.feed_forward.w2.weight": "model-00006-of-00011.safetensors",
67
+ "language_model.model.layers.16.feed_forward.w3.weight": "model-00006-of-00011.safetensors",
68
+ "language_model.model.layers.16.ffn_norm.weight": "model-00006-of-00011.safetensors",
69
+ "language_model.model.layers.17.attention.wo.weight": "model-00006-of-00011.safetensors",
70
+ "language_model.model.layers.17.attention.wqkv.weight": "model-00006-of-00011.safetensors",
71
+ "language_model.model.layers.17.attention_norm.weight": "model-00006-of-00011.safetensors",
72
+ "language_model.model.layers.17.feed_forward.w1.weight": "model-00006-of-00011.safetensors",
73
+ "language_model.model.layers.17.feed_forward.w2.weight": "model-00006-of-00011.safetensors",
74
+ "language_model.model.layers.17.feed_forward.w3.weight": "model-00006-of-00011.safetensors",
75
+ "language_model.model.layers.17.ffn_norm.weight": "model-00006-of-00011.safetensors",
76
+ "language_model.model.layers.18.attention.wo.weight": "model-00006-of-00011.safetensors",
77
+ "language_model.model.layers.18.attention.wqkv.weight": "model-00006-of-00011.safetensors",
78
+ "language_model.model.layers.18.attention_norm.weight": "model-00006-of-00011.safetensors",
79
+ "language_model.model.layers.18.feed_forward.w1.weight": "model-00006-of-00011.safetensors",
80
+ "language_model.model.layers.18.feed_forward.w2.weight": "model-00006-of-00011.safetensors",
81
+ "language_model.model.layers.18.feed_forward.w3.weight": "model-00006-of-00011.safetensors",
82
+ "language_model.model.layers.18.ffn_norm.weight": "model-00006-of-00011.safetensors",
83
+ "language_model.model.layers.19.attention.wo.weight": "model-00006-of-00011.safetensors",
84
+ "language_model.model.layers.19.attention.wqkv.weight": "model-00006-of-00011.safetensors",
85
+ "language_model.model.layers.19.attention_norm.weight": "model-00006-of-00011.safetensors",
86
+ "language_model.model.layers.19.feed_forward.w1.weight": "model-00006-of-00011.safetensors",
87
+ "language_model.model.layers.19.feed_forward.w2.weight": "model-00006-of-00011.safetensors",
88
+ "language_model.model.layers.19.feed_forward.w3.weight": "model-00006-of-00011.safetensors",
89
+ "language_model.model.layers.19.ffn_norm.weight": "model-00006-of-00011.safetensors",
90
+ "language_model.model.layers.2.attention.wo.weight": "model-00003-of-00011.safetensors",
91
+ "language_model.model.layers.2.attention.wqkv.weight": "model-00003-of-00011.safetensors",
92
+ "language_model.model.layers.2.attention_norm.weight": "model-00003-of-00011.safetensors",
93
+ "language_model.model.layers.2.feed_forward.w1.weight": "model-00003-of-00011.safetensors",
94
+ "language_model.model.layers.2.feed_forward.w2.weight": "model-00003-of-00011.safetensors",
95
+ "language_model.model.layers.2.feed_forward.w3.weight": "model-00003-of-00011.safetensors",
96
+ "language_model.model.layers.2.ffn_norm.weight": "model-00003-of-00011.safetensors",
97
+ "language_model.model.layers.20.attention.wo.weight": "model-00006-of-00011.safetensors",
98
+ "language_model.model.layers.20.attention.wqkv.weight": "model-00006-of-00011.safetensors",
99
+ "language_model.model.layers.20.attention_norm.weight": "model-00006-of-00011.safetensors",
100
+ "language_model.model.layers.20.feed_forward.w1.weight": "model-00006-of-00011.safetensors",
101
+ "language_model.model.layers.20.feed_forward.w2.weight": "model-00006-of-00011.safetensors",
102
+ "language_model.model.layers.20.feed_forward.w3.weight": "model-00006-of-00011.safetensors",
103
+ "language_model.model.layers.20.ffn_norm.weight": "model-00006-of-00011.safetensors",
104
+ "language_model.model.layers.21.attention.wo.weight": "model-00006-of-00011.safetensors",
105
+ "language_model.model.layers.21.attention.wqkv.weight": "model-00006-of-00011.safetensors",
106
+ "language_model.model.layers.21.attention_norm.weight": "model-00006-of-00011.safetensors",
107
+ "language_model.model.layers.21.feed_forward.w1.weight": "model-00006-of-00011.safetensors",
108
+ "language_model.model.layers.21.feed_forward.w2.weight": "model-00006-of-00011.safetensors",
109
+ "language_model.model.layers.21.feed_forward.w3.weight": "model-00006-of-00011.safetensors",
110
+ "language_model.model.layers.21.ffn_norm.weight": "model-00006-of-00011.safetensors",
111
+ "language_model.model.layers.22.attention.wo.weight": "model-00007-of-00011.safetensors",
112
+ "language_model.model.layers.22.attention.wqkv.weight": "model-00006-of-00011.safetensors",
113
+ "language_model.model.layers.22.attention_norm.weight": "model-00007-of-00011.safetensors",
114
+ "language_model.model.layers.22.feed_forward.w1.weight": "model-00007-of-00011.safetensors",
115
+ "language_model.model.layers.22.feed_forward.w2.weight": "model-00007-of-00011.safetensors",
116
+ "language_model.model.layers.22.feed_forward.w3.weight": "model-00007-of-00011.safetensors",
117
+ "language_model.model.layers.22.ffn_norm.weight": "model-00007-of-00011.safetensors",
118
+ "language_model.model.layers.23.attention.wo.weight": "model-00007-of-00011.safetensors",
119
+ "language_model.model.layers.23.attention.wqkv.weight": "model-00007-of-00011.safetensors",
120
+ "language_model.model.layers.23.attention_norm.weight": "model-00007-of-00011.safetensors",
121
+ "language_model.model.layers.23.feed_forward.w1.weight": "model-00007-of-00011.safetensors",
122
+ "language_model.model.layers.23.feed_forward.w2.weight": "model-00007-of-00011.safetensors",
123
+ "language_model.model.layers.23.feed_forward.w3.weight": "model-00007-of-00011.safetensors",
124
+ "language_model.model.layers.23.ffn_norm.weight": "model-00007-of-00011.safetensors",
125
+ "language_model.model.layers.24.attention.wo.weight": "model-00007-of-00011.safetensors",
126
+ "language_model.model.layers.24.attention.wqkv.weight": "model-00007-of-00011.safetensors",
127
+ "language_model.model.layers.24.attention_norm.weight": "model-00007-of-00011.safetensors",
128
+ "language_model.model.layers.24.feed_forward.w1.weight": "model-00007-of-00011.safetensors",
129
+ "language_model.model.layers.24.feed_forward.w2.weight": "model-00007-of-00011.safetensors",
130
+ "language_model.model.layers.24.feed_forward.w3.weight": "model-00007-of-00011.safetensors",
131
+ "language_model.model.layers.24.ffn_norm.weight": "model-00007-of-00011.safetensors",
132
+ "language_model.model.layers.25.attention.wo.weight": "model-00007-of-00011.safetensors",
133
+ "language_model.model.layers.25.attention.wqkv.weight": "model-00007-of-00011.safetensors",
134
+ "language_model.model.layers.25.attention_norm.weight": "model-00007-of-00011.safetensors",
135
+ "language_model.model.layers.25.feed_forward.w1.weight": "model-00007-of-00011.safetensors",
136
+ "language_model.model.layers.25.feed_forward.w2.weight": "model-00007-of-00011.safetensors",
137
+ "language_model.model.layers.25.feed_forward.w3.weight": "model-00007-of-00011.safetensors",
138
+ "language_model.model.layers.25.ffn_norm.weight": "model-00007-of-00011.safetensors",
139
+ "language_model.model.layers.26.attention.wo.weight": "model-00007-of-00011.safetensors",
140
+ "language_model.model.layers.26.attention.wqkv.weight": "model-00007-of-00011.safetensors",
141
+ "language_model.model.layers.26.attention_norm.weight": "model-00007-of-00011.safetensors",
142
+ "language_model.model.layers.26.feed_forward.w1.weight": "model-00007-of-00011.safetensors",
143
+ "language_model.model.layers.26.feed_forward.w2.weight": "model-00007-of-00011.safetensors",
144
+ "language_model.model.layers.26.feed_forward.w3.weight": "model-00007-of-00011.safetensors",
145
+ "language_model.model.layers.26.ffn_norm.weight": "model-00007-of-00011.safetensors",
146
+ "language_model.model.layers.27.attention.wo.weight": "model-00007-of-00011.safetensors",
147
+ "language_model.model.layers.27.attention.wqkv.weight": "model-00007-of-00011.safetensors",
148
+ "language_model.model.layers.27.attention_norm.weight": "model-00007-of-00011.safetensors",
149
+ "language_model.model.layers.27.feed_forward.w1.weight": "model-00007-of-00011.safetensors",
150
+ "language_model.model.layers.27.feed_forward.w2.weight": "model-00007-of-00011.safetensors",
151
+ "language_model.model.layers.27.feed_forward.w3.weight": "model-00007-of-00011.safetensors",
152
+ "language_model.model.layers.27.ffn_norm.weight": "model-00007-of-00011.safetensors",
153
+ "language_model.model.layers.28.attention.wo.weight": "model-00007-of-00011.safetensors",
154
+ "language_model.model.layers.28.attention.wqkv.weight": "model-00007-of-00011.safetensors",
155
+ "language_model.model.layers.28.attention_norm.weight": "model-00008-of-00011.safetensors",
156
+ "language_model.model.layers.28.feed_forward.w1.weight": "model-00007-of-00011.safetensors",
157
+ "language_model.model.layers.28.feed_forward.w2.weight": "model-00008-of-00011.safetensors",
158
+ "language_model.model.layers.28.feed_forward.w3.weight": "model-00008-of-00011.safetensors",
159
+ "language_model.model.layers.28.ffn_norm.weight": "model-00008-of-00011.safetensors",
160
+ "language_model.model.layers.29.attention.wo.weight": "model-00008-of-00011.safetensors",
161
+ "language_model.model.layers.29.attention.wqkv.weight": "model-00008-of-00011.safetensors",
162
+ "language_model.model.layers.29.attention_norm.weight": "model-00008-of-00011.safetensors",
163
+ "language_model.model.layers.29.feed_forward.w1.weight": "model-00008-of-00011.safetensors",
164
+ "language_model.model.layers.29.feed_forward.w2.weight": "model-00008-of-00011.safetensors",
165
+ "language_model.model.layers.29.feed_forward.w3.weight": "model-00008-of-00011.safetensors",
166
+ "language_model.model.layers.29.ffn_norm.weight": "model-00008-of-00011.safetensors",
167
+ "language_model.model.layers.3.attention.wo.weight": "model-00003-of-00011.safetensors",
168
+ "language_model.model.layers.3.attention.wqkv.weight": "model-00003-of-00011.safetensors",
169
+ "language_model.model.layers.3.attention_norm.weight": "model-00004-of-00011.safetensors",
170
+ "language_model.model.layers.3.feed_forward.w1.weight": "model-00004-of-00011.safetensors",
171
+ "language_model.model.layers.3.feed_forward.w2.weight": "model-00004-of-00011.safetensors",
172
+ "language_model.model.layers.3.feed_forward.w3.weight": "model-00004-of-00011.safetensors",
173
+ "language_model.model.layers.3.ffn_norm.weight": "model-00004-of-00011.safetensors",
174
+ "language_model.model.layers.30.attention.wo.weight": "model-00008-of-00011.safetensors",
175
+ "language_model.model.layers.30.attention.wqkv.weight": "model-00008-of-00011.safetensors",
176
+ "language_model.model.layers.30.attention_norm.weight": "model-00008-of-00011.safetensors",
177
+ "language_model.model.layers.30.feed_forward.w1.weight": "model-00008-of-00011.safetensors",
178
+ "language_model.model.layers.30.feed_forward.w2.weight": "model-00008-of-00011.safetensors",
179
+ "language_model.model.layers.30.feed_forward.w3.weight": "model-00008-of-00011.safetensors",
180
+ "language_model.model.layers.30.ffn_norm.weight": "model-00008-of-00011.safetensors",
181
+ "language_model.model.layers.31.attention.wo.weight": "model-00008-of-00011.safetensors",
182
+ "language_model.model.layers.31.attention.wqkv.weight": "model-00008-of-00011.safetensors",
183
+ "language_model.model.layers.31.attention_norm.weight": "model-00008-of-00011.safetensors",
184
+ "language_model.model.layers.31.feed_forward.w1.weight": "model-00008-of-00011.safetensors",
185
+ "language_model.model.layers.31.feed_forward.w2.weight": "model-00008-of-00011.safetensors",
186
+ "language_model.model.layers.31.feed_forward.w3.weight": "model-00008-of-00011.safetensors",
187
+ "language_model.model.layers.31.ffn_norm.weight": "model-00008-of-00011.safetensors",
188
+ "language_model.model.layers.32.attention.wo.weight": "model-00008-of-00011.safetensors",
189
+ "language_model.model.layers.32.attention.wqkv.weight": "model-00008-of-00011.safetensors",
190
+ "language_model.model.layers.32.attention_norm.weight": "model-00008-of-00011.safetensors",
191
+ "language_model.model.layers.32.feed_forward.w1.weight": "model-00008-of-00011.safetensors",
192
+ "language_model.model.layers.32.feed_forward.w2.weight": "model-00008-of-00011.safetensors",
193
+ "language_model.model.layers.32.feed_forward.w3.weight": "model-00008-of-00011.safetensors",
194
+ "language_model.model.layers.32.ffn_norm.weight": "model-00008-of-00011.safetensors",
195
+ "language_model.model.layers.33.attention.wo.weight": "model-00008-of-00011.safetensors",
196
+ "language_model.model.layers.33.attention.wqkv.weight": "model-00008-of-00011.safetensors",
197
+ "language_model.model.layers.33.attention_norm.weight": "model-00008-of-00011.safetensors",
198
+ "language_model.model.layers.33.feed_forward.w1.weight": "model-00008-of-00011.safetensors",
199
+ "language_model.model.layers.33.feed_forward.w2.weight": "model-00008-of-00011.safetensors",
200
+ "language_model.model.layers.33.feed_forward.w3.weight": "model-00008-of-00011.safetensors",
201
+ "language_model.model.layers.33.ffn_norm.weight": "model-00008-of-00011.safetensors",
202
+ "language_model.model.layers.34.attention.wo.weight": "model-00008-of-00011.safetensors",
203
+ "language_model.model.layers.34.attention.wqkv.weight": "model-00008-of-00011.safetensors",
204
+ "language_model.model.layers.34.attention_norm.weight": "model-00009-of-00011.safetensors",
205
+ "language_model.model.layers.34.feed_forward.w1.weight": "model-00008-of-00011.safetensors",
206
+ "language_model.model.layers.34.feed_forward.w2.weight": "model-00009-of-00011.safetensors",
207
+ "language_model.model.layers.34.feed_forward.w3.weight": "model-00008-of-00011.safetensors",
208
+ "language_model.model.layers.34.ffn_norm.weight": "model-00009-of-00011.safetensors",
209
+ "language_model.model.layers.35.attention.wo.weight": "model-00009-of-00011.safetensors",
210
+ "language_model.model.layers.35.attention.wqkv.weight": "model-00009-of-00011.safetensors",
211
+ "language_model.model.layers.35.attention_norm.weight": "model-00009-of-00011.safetensors",
212
+ "language_model.model.layers.35.feed_forward.w1.weight": "model-00009-of-00011.safetensors",
213
+ "language_model.model.layers.35.feed_forward.w2.weight": "model-00009-of-00011.safetensors",
214
+ "language_model.model.layers.35.feed_forward.w3.weight": "model-00009-of-00011.safetensors",
215
+ "language_model.model.layers.35.ffn_norm.weight": "model-00009-of-00011.safetensors",
216
+ "language_model.model.layers.36.attention.wo.weight": "model-00009-of-00011.safetensors",
217
+ "language_model.model.layers.36.attention.wqkv.weight": "model-00009-of-00011.safetensors",
218
+ "language_model.model.layers.36.attention_norm.weight": "model-00009-of-00011.safetensors",
219
+ "language_model.model.layers.36.feed_forward.w1.weight": "model-00009-of-00011.safetensors",
220
+ "language_model.model.layers.36.feed_forward.w2.weight": "model-00009-of-00011.safetensors",
221
+ "language_model.model.layers.36.feed_forward.w3.weight": "model-00009-of-00011.safetensors",
222
+ "language_model.model.layers.36.ffn_norm.weight": "model-00009-of-00011.safetensors",
223
+ "language_model.model.layers.37.attention.wo.weight": "model-00009-of-00011.safetensors",
224
+ "language_model.model.layers.37.attention.wqkv.weight": "model-00009-of-00011.safetensors",
225
+ "language_model.model.layers.37.attention_norm.weight": "model-00009-of-00011.safetensors",
226
+ "language_model.model.layers.37.feed_forward.w1.weight": "model-00009-of-00011.safetensors",
227
+ "language_model.model.layers.37.feed_forward.w2.weight": "model-00009-of-00011.safetensors",
228
+ "language_model.model.layers.37.feed_forward.w3.weight": "model-00009-of-00011.safetensors",
229
+ "language_model.model.layers.37.ffn_norm.weight": "model-00009-of-00011.safetensors",
230
+ "language_model.model.layers.38.attention.wo.weight": "model-00009-of-00011.safetensors",
231
+ "language_model.model.layers.38.attention.wqkv.weight": "model-00009-of-00011.safetensors",
232
+ "language_model.model.layers.38.attention_norm.weight": "model-00009-of-00011.safetensors",
233
+ "language_model.model.layers.38.feed_forward.w1.weight": "model-00009-of-00011.safetensors",
234
+ "language_model.model.layers.38.feed_forward.w2.weight": "model-00009-of-00011.safetensors",
235
+ "language_model.model.layers.38.feed_forward.w3.weight": "model-00009-of-00011.safetensors",
236
+ "language_model.model.layers.38.ffn_norm.weight": "model-00009-of-00011.safetensors",
237
+ "language_model.model.layers.39.attention.wo.weight": "model-00009-of-00011.safetensors",
238
+ "language_model.model.layers.39.attention.wqkv.weight": "model-00009-of-00011.safetensors",
239
+ "language_model.model.layers.39.attention_norm.weight": "model-00009-of-00011.safetensors",
240
+ "language_model.model.layers.39.feed_forward.w1.weight": "model-00009-of-00011.safetensors",
241
+ "language_model.model.layers.39.feed_forward.w2.weight": "model-00009-of-00011.safetensors",
242
+ "language_model.model.layers.39.feed_forward.w3.weight": "model-00009-of-00011.safetensors",
243
+ "language_model.model.layers.39.ffn_norm.weight": "model-00009-of-00011.safetensors",
244
+ "language_model.model.layers.4.attention.wo.weight": "model-00004-of-00011.safetensors",
245
+ "language_model.model.layers.4.attention.wqkv.weight": "model-00004-of-00011.safetensors",
246
+ "language_model.model.layers.4.attention_norm.weight": "model-00004-of-00011.safetensors",
247
+ "language_model.model.layers.4.feed_forward.w1.weight": "model-00004-of-00011.safetensors",
248
+ "language_model.model.layers.4.feed_forward.w2.weight": "model-00004-of-00011.safetensors",
249
+ "language_model.model.layers.4.feed_forward.w3.weight": "model-00004-of-00011.safetensors",
250
+ "language_model.model.layers.4.ffn_norm.weight": "model-00004-of-00011.safetensors",
251
+ "language_model.model.layers.40.attention.wo.weight": "model-00009-of-00011.safetensors",
252
+ "language_model.model.layers.40.attention.wqkv.weight": "model-00009-of-00011.safetensors",
253
+ "language_model.model.layers.40.attention_norm.weight": "model-00009-of-00011.safetensors",
254
+ "language_model.model.layers.40.feed_forward.w1.weight": "model-00009-of-00011.safetensors",
255
+ "language_model.model.layers.40.feed_forward.w2.weight": "model-00009-of-00011.safetensors",
256
+ "language_model.model.layers.40.feed_forward.w3.weight": "model-00009-of-00011.safetensors",
257
+ "language_model.model.layers.40.ffn_norm.weight": "model-00009-of-00011.safetensors",
258
+ "language_model.model.layers.41.attention.wo.weight": "model-00010-of-00011.safetensors",
259
+ "language_model.model.layers.41.attention.wqkv.weight": "model-00009-of-00011.safetensors",
260
+ "language_model.model.layers.41.attention_norm.weight": "model-00010-of-00011.safetensors",
261
+ "language_model.model.layers.41.feed_forward.w1.weight": "model-00010-of-00011.safetensors",
262
+ "language_model.model.layers.41.feed_forward.w2.weight": "model-00010-of-00011.safetensors",
263
+ "language_model.model.layers.41.feed_forward.w3.weight": "model-00010-of-00011.safetensors",
264
+ "language_model.model.layers.41.ffn_norm.weight": "model-00010-of-00011.safetensors",
265
+ "language_model.model.layers.42.attention.wo.weight": "model-00010-of-00011.safetensors",
266
+ "language_model.model.layers.42.attention.wqkv.weight": "model-00010-of-00011.safetensors",
267
+ "language_model.model.layers.42.attention_norm.weight": "model-00010-of-00011.safetensors",
268
+ "language_model.model.layers.42.feed_forward.w1.weight": "model-00010-of-00011.safetensors",
269
+ "language_model.model.layers.42.feed_forward.w2.weight": "model-00010-of-00011.safetensors",
270
+ "language_model.model.layers.42.feed_forward.w3.weight": "model-00010-of-00011.safetensors",
271
+ "language_model.model.layers.42.ffn_norm.weight": "model-00010-of-00011.safetensors",
272
+ "language_model.model.layers.43.attention.wo.weight": "model-00010-of-00011.safetensors",
273
+ "language_model.model.layers.43.attention.wqkv.weight": "model-00010-of-00011.safetensors",
274
+ "language_model.model.layers.43.attention_norm.weight": "model-00010-of-00011.safetensors",
275
+ "language_model.model.layers.43.feed_forward.w1.weight": "model-00010-of-00011.safetensors",
276
+ "language_model.model.layers.43.feed_forward.w2.weight": "model-00010-of-00011.safetensors",
277
+ "language_model.model.layers.43.feed_forward.w3.weight": "model-00010-of-00011.safetensors",
278
+ "language_model.model.layers.43.ffn_norm.weight": "model-00010-of-00011.safetensors",
279
+ "language_model.model.layers.44.attention.wo.weight": "model-00010-of-00011.safetensors",
280
+ "language_model.model.layers.44.attention.wqkv.weight": "model-00010-of-00011.safetensors",
281
+ "language_model.model.layers.44.attention_norm.weight": "model-00010-of-00011.safetensors",
282
+ "language_model.model.layers.44.feed_forward.w1.weight": "model-00010-of-00011.safetensors",
283
+ "language_model.model.layers.44.feed_forward.w2.weight": "model-00010-of-00011.safetensors",
284
+ "language_model.model.layers.44.feed_forward.w3.weight": "model-00010-of-00011.safetensors",
285
+ "language_model.model.layers.44.ffn_norm.weight": "model-00010-of-00011.safetensors",
286
+ "language_model.model.layers.45.attention.wo.weight": "model-00010-of-00011.safetensors",
287
+ "language_model.model.layers.45.attention.wqkv.weight": "model-00010-of-00011.safetensors",
288
+ "language_model.model.layers.45.attention_norm.weight": "model-00010-of-00011.safetensors",
289
+ "language_model.model.layers.45.feed_forward.w1.weight": "model-00010-of-00011.safetensors",
290
+ "language_model.model.layers.45.feed_forward.w2.weight": "model-00010-of-00011.safetensors",
291
+ "language_model.model.layers.45.feed_forward.w3.weight": "model-00010-of-00011.safetensors",
292
+ "language_model.model.layers.45.ffn_norm.weight": "model-00010-of-00011.safetensors",
293
+ "language_model.model.layers.46.attention.wo.weight": "model-00010-of-00011.safetensors",
294
+ "language_model.model.layers.46.attention.wqkv.weight": "model-00010-of-00011.safetensors",
295
+ "language_model.model.layers.46.attention_norm.weight": "model-00010-of-00011.safetensors",
296
+ "language_model.model.layers.46.feed_forward.w1.weight": "model-00010-of-00011.safetensors",
297
+ "language_model.model.layers.46.feed_forward.w2.weight": "model-00010-of-00011.safetensors",
298
+ "language_model.model.layers.46.feed_forward.w3.weight": "model-00010-of-00011.safetensors",
299
+ "language_model.model.layers.46.ffn_norm.weight": "model-00010-of-00011.safetensors",
300
+ "language_model.model.layers.47.attention.wo.weight": "model-00010-of-00011.safetensors",
301
+ "language_model.model.layers.47.attention.wqkv.weight": "model-00010-of-00011.safetensors",
302
+ "language_model.model.layers.47.attention_norm.weight": "model-00011-of-00011.safetensors",
303
+ "language_model.model.layers.47.feed_forward.w1.weight": "model-00010-of-00011.safetensors",
304
+ "language_model.model.layers.47.feed_forward.w2.weight": "model-00011-of-00011.safetensors",
305
+ "language_model.model.layers.47.feed_forward.w3.weight": "model-00011-of-00011.safetensors",
306
+ "language_model.model.layers.47.ffn_norm.weight": "model-00011-of-00011.safetensors",
307
+ "language_model.model.layers.5.attention.wo.weight": "model-00004-of-00011.safetensors",
308
+ "language_model.model.layers.5.attention.wqkv.weight": "model-00004-of-00011.safetensors",
309
+ "language_model.model.layers.5.attention_norm.weight": "model-00004-of-00011.safetensors",
310
+ "language_model.model.layers.5.feed_forward.w1.weight": "model-00004-of-00011.safetensors",
311
+ "language_model.model.layers.5.feed_forward.w2.weight": "model-00004-of-00011.safetensors",
312
+ "language_model.model.layers.5.feed_forward.w3.weight": "model-00004-of-00011.safetensors",
313
+ "language_model.model.layers.5.ffn_norm.weight": "model-00004-of-00011.safetensors",
314
+ "language_model.model.layers.6.attention.wo.weight": "model-00004-of-00011.safetensors",
315
+ "language_model.model.layers.6.attention.wqkv.weight": "model-00004-of-00011.safetensors",
316
+ "language_model.model.layers.6.attention_norm.weight": "model-00004-of-00011.safetensors",
317
+ "language_model.model.layers.6.feed_forward.w1.weight": "model-00004-of-00011.safetensors",
318
+ "language_model.model.layers.6.feed_forward.w2.weight": "model-00004-of-00011.safetensors",
319
+ "language_model.model.layers.6.feed_forward.w3.weight": "model-00004-of-00011.safetensors",
320
+ "language_model.model.layers.6.ffn_norm.weight": "model-00004-of-00011.safetensors",
321
+ "language_model.model.layers.7.attention.wo.weight": "model-00004-of-00011.safetensors",
322
+ "language_model.model.layers.7.attention.wqkv.weight": "model-00004-of-00011.safetensors",
323
+ "language_model.model.layers.7.attention_norm.weight": "model-00004-of-00011.safetensors",
324
+ "language_model.model.layers.7.feed_forward.w1.weight": "model-00004-of-00011.safetensors",
325
+ "language_model.model.layers.7.feed_forward.w2.weight": "model-00004-of-00011.safetensors",
326
+ "language_model.model.layers.7.feed_forward.w3.weight": "model-00004-of-00011.safetensors",
327
+ "language_model.model.layers.7.ffn_norm.weight": "model-00004-of-00011.safetensors",
328
+ "language_model.model.layers.8.attention.wo.weight": "model-00004-of-00011.safetensors",
329
+ "language_model.model.layers.8.attention.wqkv.weight": "model-00004-of-00011.safetensors",
330
+ "language_model.model.layers.8.attention_norm.weight": "model-00004-of-00011.safetensors",
331
+ "language_model.model.layers.8.feed_forward.w1.weight": "model-00004-of-00011.safetensors",
332
+ "language_model.model.layers.8.feed_forward.w2.weight": "model-00004-of-00011.safetensors",
333
+ "language_model.model.layers.8.feed_forward.w3.weight": "model-00004-of-00011.safetensors",
334
+ "language_model.model.layers.8.ffn_norm.weight": "model-00004-of-00011.safetensors",
335
+ "language_model.model.layers.9.attention.wo.weight": "model-00004-of-00011.safetensors",
336
+ "language_model.model.layers.9.attention.wqkv.weight": "model-00004-of-00011.safetensors",
337
+ "language_model.model.layers.9.attention_norm.weight": "model-00005-of-00011.safetensors",
338
+ "language_model.model.layers.9.feed_forward.w1.weight": "model-00004-of-00011.safetensors",
339
+ "language_model.model.layers.9.feed_forward.w2.weight": "model-00005-of-00011.safetensors",
340
+ "language_model.model.layers.9.feed_forward.w3.weight": "model-00005-of-00011.safetensors",
341
+ "language_model.model.layers.9.ffn_norm.weight": "model-00005-of-00011.safetensors",
342
+ "language_model.model.norm.weight": "model-00011-of-00011.safetensors",
343
+ "language_model.model.tok_embeddings.weight": "model-00003-of-00011.safetensors",
344
+ "language_model.output.weight": "model-00011-of-00011.safetensors",
345
+ "mlp1.0.bias": "model-00011-of-00011.safetensors",
346
+ "mlp1.0.weight": "model-00011-of-00011.safetensors",
347
+ "mlp1.1.bias": "model-00011-of-00011.safetensors",
348
+ "mlp1.1.weight": "model-00011-of-00011.safetensors",
349
+ "mlp1.3.bias": "model-00011-of-00011.safetensors",
350
+ "mlp1.3.weight": "model-00011-of-00011.safetensors",
351
+ "vision_model.embeddings.class_embedding": "model-00001-of-00011.safetensors",
352
+ "vision_model.embeddings.patch_embedding.bias": "model-00001-of-00011.safetensors",
353
+ "vision_model.embeddings.patch_embedding.weight": "model-00001-of-00011.safetensors",
354
+ "vision_model.embeddings.position_embedding": "model-00001-of-00011.safetensors",
355
+ "vision_model.encoder.layers.0.attn.k_norm.weight": "model-00001-of-00011.safetensors",
356
+ "vision_model.encoder.layers.0.attn.proj.bias": "model-00001-of-00011.safetensors",
357
+ "vision_model.encoder.layers.0.attn.proj.weight": "model-00001-of-00011.safetensors",
358
+ "vision_model.encoder.layers.0.attn.q_norm.weight": "model-00001-of-00011.safetensors",
359
+ "vision_model.encoder.layers.0.attn.qkv.weight": "model-00001-of-00011.safetensors",
360
+ "vision_model.encoder.layers.0.ls1": "model-00001-of-00011.safetensors",
361
+ "vision_model.encoder.layers.0.ls2": "model-00001-of-00011.safetensors",
362
+ "vision_model.encoder.layers.0.mlp.fc1.bias": "model-00001-of-00011.safetensors",
363
+ "vision_model.encoder.layers.0.mlp.fc1.weight": "model-00001-of-00011.safetensors",
364
+ "vision_model.encoder.layers.0.mlp.fc2.bias": "model-00001-of-00011.safetensors",
365
+ "vision_model.encoder.layers.0.mlp.fc2.weight": "model-00001-of-00011.safetensors",
366
+ "vision_model.encoder.layers.0.norm1.weight": "model-00001-of-00011.safetensors",
367
+ "vision_model.encoder.layers.0.norm2.weight": "model-00001-of-00011.safetensors",
368
+ "vision_model.encoder.layers.1.attn.k_norm.weight": "model-00001-of-00011.safetensors",
369
+ "vision_model.encoder.layers.1.attn.proj.bias": "model-00001-of-00011.safetensors",
370
+ "vision_model.encoder.layers.1.attn.proj.weight": "model-00001-of-00011.safetensors",
371
+ "vision_model.encoder.layers.1.attn.q_norm.weight": "model-00001-of-00011.safetensors",
372
+ "vision_model.encoder.layers.1.attn.qkv.weight": "model-00001-of-00011.safetensors",
373
+ "vision_model.encoder.layers.1.ls1": "model-00001-of-00011.safetensors",
374
+ "vision_model.encoder.layers.1.ls2": "model-00001-of-00011.safetensors",
375
+ "vision_model.encoder.layers.1.mlp.fc1.bias": "model-00001-of-00011.safetensors",
376
+ "vision_model.encoder.layers.1.mlp.fc1.weight": "model-00001-of-00011.safetensors",
377
+ "vision_model.encoder.layers.1.mlp.fc2.bias": "model-00001-of-00011.safetensors",
378
+ "vision_model.encoder.layers.1.mlp.fc2.weight": "model-00001-of-00011.safetensors",
379
+ "vision_model.encoder.layers.1.norm1.weight": "model-00001-of-00011.safetensors",
380
+ "vision_model.encoder.layers.1.norm2.weight": "model-00001-of-00011.safetensors",
381
+ "vision_model.encoder.layers.10.attn.k_norm.weight": "model-00001-of-00011.safetensors",
382
+ "vision_model.encoder.layers.10.attn.proj.bias": "model-00001-of-00011.safetensors",
383
+ "vision_model.encoder.layers.10.attn.proj.weight": "model-00001-of-00011.safetensors",
384
+ "vision_model.encoder.layers.10.attn.q_norm.weight": "model-00001-of-00011.safetensors",
385
+ "vision_model.encoder.layers.10.attn.qkv.weight": "model-00001-of-00011.safetensors",
386
+ "vision_model.encoder.layers.10.ls1": "model-00001-of-00011.safetensors",
387
+ "vision_model.encoder.layers.10.ls2": "model-00001-of-00011.safetensors",
388
+ "vision_model.encoder.layers.10.mlp.fc1.bias": "model-00001-of-00011.safetensors",
389
+ "vision_model.encoder.layers.10.mlp.fc1.weight": "model-00001-of-00011.safetensors",
390
+ "vision_model.encoder.layers.10.mlp.fc2.bias": "model-00001-of-00011.safetensors",
391
+ "vision_model.encoder.layers.10.mlp.fc2.weight": "model-00001-of-00011.safetensors",
392
+ "vision_model.encoder.layers.10.norm1.weight": "model-00001-of-00011.safetensors",
393
+ "vision_model.encoder.layers.10.norm2.weight": "model-00001-of-00011.safetensors",
394
+ "vision_model.encoder.layers.11.attn.k_norm.weight": "model-00001-of-00011.safetensors",
395
+ "vision_model.encoder.layers.11.attn.proj.bias": "model-00001-of-00011.safetensors",
396
+ "vision_model.encoder.layers.11.attn.proj.weight": "model-00001-of-00011.safetensors",
397
+ "vision_model.encoder.layers.11.attn.q_norm.weight": "model-00001-of-00011.safetensors",
398
+ "vision_model.encoder.layers.11.attn.qkv.weight": "model-00001-of-00011.safetensors",
399
+ "vision_model.encoder.layers.11.ls1": "model-00001-of-00011.safetensors",
400
+ "vision_model.encoder.layers.11.ls2": "model-00001-of-00011.safetensors",
401
+ "vision_model.encoder.layers.11.mlp.fc1.bias": "model-00001-of-00011.safetensors",
402
+ "vision_model.encoder.layers.11.mlp.fc1.weight": "model-00001-of-00011.safetensors",
403
+ "vision_model.encoder.layers.11.mlp.fc2.bias": "model-00001-of-00011.safetensors",
404
+ "vision_model.encoder.layers.11.mlp.fc2.weight": "model-00001-of-00011.safetensors",
405
+ "vision_model.encoder.layers.11.norm1.weight": "model-00001-of-00011.safetensors",
406
+ "vision_model.encoder.layers.11.norm2.weight": "model-00001-of-00011.safetensors",
407
+ "vision_model.encoder.layers.12.attn.k_norm.weight": "model-00001-of-00011.safetensors",
408
+ "vision_model.encoder.layers.12.attn.proj.bias": "model-00001-of-00011.safetensors",
409
+ "vision_model.encoder.layers.12.attn.proj.weight": "model-00001-of-00011.safetensors",
410
+ "vision_model.encoder.layers.12.attn.q_norm.weight": "model-00001-of-00011.safetensors",
411
+ "vision_model.encoder.layers.12.attn.qkv.weight": "model-00001-of-00011.safetensors",
412
+ "vision_model.encoder.layers.12.ls1": "model-00001-of-00011.safetensors",
413
+ "vision_model.encoder.layers.12.ls2": "model-00001-of-00011.safetensors",
414
+ "vision_model.encoder.layers.12.mlp.fc1.bias": "model-00001-of-00011.safetensors",
415
+ "vision_model.encoder.layers.12.mlp.fc1.weight": "model-00001-of-00011.safetensors",
416
+ "vision_model.encoder.layers.12.mlp.fc2.bias": "model-00001-of-00011.safetensors",
417
+ "vision_model.encoder.layers.12.mlp.fc2.weight": "model-00001-of-00011.safetensors",
418
+ "vision_model.encoder.layers.12.norm1.weight": "model-00001-of-00011.safetensors",
419
+ "vision_model.encoder.layers.12.norm2.weight": "model-00001-of-00011.safetensors",
420
+ "vision_model.encoder.layers.13.attn.k_norm.weight": "model-00001-of-00011.safetensors",
421
+ "vision_model.encoder.layers.13.attn.proj.bias": "model-00001-of-00011.safetensors",
422
+ "vision_model.encoder.layers.13.attn.proj.weight": "model-00001-of-00011.safetensors",
423
+ "vision_model.encoder.layers.13.attn.q_norm.weight": "model-00001-of-00011.safetensors",
424
+ "vision_model.encoder.layers.13.attn.qkv.weight": "model-00001-of-00011.safetensors",
425
+ "vision_model.encoder.layers.13.ls1": "model-00001-of-00011.safetensors",
426
+ "vision_model.encoder.layers.13.ls2": "model-00001-of-00011.safetensors",
427
+ "vision_model.encoder.layers.13.mlp.fc1.bias": "model-00001-of-00011.safetensors",
428
+ "vision_model.encoder.layers.13.mlp.fc1.weight": "model-00001-of-00011.safetensors",
429
+ "vision_model.encoder.layers.13.mlp.fc2.bias": "model-00001-of-00011.safetensors",
430
+ "vision_model.encoder.layers.13.mlp.fc2.weight": "model-00001-of-00011.safetensors",
431
+ "vision_model.encoder.layers.13.norm1.weight": "model-00001-of-00011.safetensors",
432
+ "vision_model.encoder.layers.13.norm2.weight": "model-00001-of-00011.safetensors",
433
+ "vision_model.encoder.layers.14.attn.k_norm.weight": "model-00001-of-00011.safetensors",
434
+ "vision_model.encoder.layers.14.attn.proj.bias": "model-00001-of-00011.safetensors",
435
+ "vision_model.encoder.layers.14.attn.proj.weight": "model-00001-of-00011.safetensors",
436
+ "vision_model.encoder.layers.14.attn.q_norm.weight": "model-00001-of-00011.safetensors",
437
+ "vision_model.encoder.layers.14.attn.qkv.weight": "model-00001-of-00011.safetensors",
438
+ "vision_model.encoder.layers.14.ls1": "model-00001-of-00011.safetensors",
439
+ "vision_model.encoder.layers.14.ls2": "model-00001-of-00011.safetensors",
440
+ "vision_model.encoder.layers.14.mlp.fc1.bias": "model-00001-of-00011.safetensors",
441
+ "vision_model.encoder.layers.14.mlp.fc1.weight": "model-00001-of-00011.safetensors",
442
+ "vision_model.encoder.layers.14.mlp.fc2.bias": "model-00001-of-00011.safetensors",
443
+ "vision_model.encoder.layers.14.mlp.fc2.weight": "model-00001-of-00011.safetensors",
444
+ "vision_model.encoder.layers.14.norm1.weight": "model-00001-of-00011.safetensors",
445
+ "vision_model.encoder.layers.14.norm2.weight": "model-00001-of-00011.safetensors",
446
+ "vision_model.encoder.layers.15.attn.k_norm.weight": "model-00001-of-00011.safetensors",
447
+ "vision_model.encoder.layers.15.attn.proj.bias": "model-00001-of-00011.safetensors",
448
+ "vision_model.encoder.layers.15.attn.proj.weight": "model-00001-of-00011.safetensors",
449
+ "vision_model.encoder.layers.15.attn.q_norm.weight": "model-00001-of-00011.safetensors",
450
+ "vision_model.encoder.layers.15.attn.qkv.weight": "model-00001-of-00011.safetensors",
451
+ "vision_model.encoder.layers.15.ls1": "model-00001-of-00011.safetensors",
452
+ "vision_model.encoder.layers.15.ls2": "model-00001-of-00011.safetensors",
453
+ "vision_model.encoder.layers.15.mlp.fc1.bias": "model-00001-of-00011.safetensors",
454
+ "vision_model.encoder.layers.15.mlp.fc1.weight": "model-00001-of-00011.safetensors",
455
+ "vision_model.encoder.layers.15.mlp.fc2.bias": "model-00001-of-00011.safetensors",
456
+ "vision_model.encoder.layers.15.mlp.fc2.weight": "model-00001-of-00011.safetensors",
457
+ "vision_model.encoder.layers.15.norm1.weight": "model-00001-of-00011.safetensors",
458
+ "vision_model.encoder.layers.15.norm2.weight": "model-00001-of-00011.safetensors",
459
+ "vision_model.encoder.layers.16.attn.k_norm.weight": "model-00001-of-00011.safetensors",
460
+ "vision_model.encoder.layers.16.attn.proj.bias": "model-00001-of-00011.safetensors",
461
+ "vision_model.encoder.layers.16.attn.proj.weight": "model-00001-of-00011.safetensors",
462
+ "vision_model.encoder.layers.16.attn.q_norm.weight": "model-00001-of-00011.safetensors",
463
+ "vision_model.encoder.layers.16.attn.qkv.weight": "model-00001-of-00011.safetensors",
464
+ "vision_model.encoder.layers.16.ls1": "model-00001-of-00011.safetensors",
465
+ "vision_model.encoder.layers.16.ls2": "model-00001-of-00011.safetensors",
466
+ "vision_model.encoder.layers.16.mlp.fc1.bias": "model-00001-of-00011.safetensors",
467
+ "vision_model.encoder.layers.16.mlp.fc1.weight": "model-00001-of-00011.safetensors",
468
+ "vision_model.encoder.layers.16.mlp.fc2.bias": "model-00001-of-00011.safetensors",
469
+ "vision_model.encoder.layers.16.mlp.fc2.weight": "model-00001-of-00011.safetensors",
470
+ "vision_model.encoder.layers.16.norm1.weight": "model-00001-of-00011.safetensors",
471
+ "vision_model.encoder.layers.16.norm2.weight": "model-00001-of-00011.safetensors",
472
+ "vision_model.encoder.layers.17.attn.k_norm.weight": "model-00001-of-00011.safetensors",
473
+ "vision_model.encoder.layers.17.attn.proj.bias": "model-00001-of-00011.safetensors",
474
+ "vision_model.encoder.layers.17.attn.proj.weight": "model-00001-of-00011.safetensors",
475
+ "vision_model.encoder.layers.17.attn.q_norm.weight": "model-00001-of-00011.safetensors",
476
+ "vision_model.encoder.layers.17.attn.qkv.weight": "model-00001-of-00011.safetensors",
477
+ "vision_model.encoder.layers.17.ls1": "model-00001-of-00011.safetensors",
478
+ "vision_model.encoder.layers.17.ls2": "model-00001-of-00011.safetensors",
479
+ "vision_model.encoder.layers.17.mlp.fc1.bias": "model-00001-of-00011.safetensors",
480
+ "vision_model.encoder.layers.17.mlp.fc1.weight": "model-00001-of-00011.safetensors",
481
+ "vision_model.encoder.layers.17.mlp.fc2.bias": "model-00001-of-00011.safetensors",
482
+ "vision_model.encoder.layers.17.mlp.fc2.weight": "model-00001-of-00011.safetensors",
483
+ "vision_model.encoder.layers.17.norm1.weight": "model-00001-of-00011.safetensors",
484
+ "vision_model.encoder.layers.17.norm2.weight": "model-00001-of-00011.safetensors",
485
+ "vision_model.encoder.layers.18.attn.k_norm.weight": "model-00001-of-00011.safetensors",
486
+ "vision_model.encoder.layers.18.attn.proj.bias": "model-00001-of-00011.safetensors",
487
+ "vision_model.encoder.layers.18.attn.proj.weight": "model-00001-of-00011.safetensors",
488
+ "vision_model.encoder.layers.18.attn.q_norm.weight": "model-00001-of-00011.safetensors",
489
+ "vision_model.encoder.layers.18.attn.qkv.weight": "model-00001-of-00011.safetensors",
490
+ "vision_model.encoder.layers.18.ls1": "model-00001-of-00011.safetensors",
491
+ "vision_model.encoder.layers.18.ls2": "model-00001-of-00011.safetensors",
492
+ "vision_model.encoder.layers.18.mlp.fc1.bias": "model-00001-of-00011.safetensors",
493
+ "vision_model.encoder.layers.18.mlp.fc1.weight": "model-00001-of-00011.safetensors",
494
+ "vision_model.encoder.layers.18.mlp.fc2.bias": "model-00001-of-00011.safetensors",
495
+ "vision_model.encoder.layers.18.mlp.fc2.weight": "model-00001-of-00011.safetensors",
496
+ "vision_model.encoder.layers.18.norm1.weight": "model-00001-of-00011.safetensors",
497
+ "vision_model.encoder.layers.18.norm2.weight": "model-00001-of-00011.safetensors",
498
+ "vision_model.encoder.layers.19.attn.k_norm.weight": "model-00001-of-00011.safetensors",
499
+ "vision_model.encoder.layers.19.attn.proj.bias": "model-00001-of-00011.safetensors",
500
+ "vision_model.encoder.layers.19.attn.proj.weight": "model-00001-of-00011.safetensors",
501
+ "vision_model.encoder.layers.19.attn.q_norm.weight": "model-00001-of-00011.safetensors",
502
+ "vision_model.encoder.layers.19.attn.qkv.weight": "model-00001-of-00011.safetensors",
503
+ "vision_model.encoder.layers.19.ls1": "model-00001-of-00011.safetensors",
504
+ "vision_model.encoder.layers.19.ls2": "model-00001-of-00011.safetensors",
505
+ "vision_model.encoder.layers.19.mlp.fc1.bias": "model-00001-of-00011.safetensors",
506
+ "vision_model.encoder.layers.19.mlp.fc1.weight": "model-00001-of-00011.safetensors",
507
+ "vision_model.encoder.layers.19.mlp.fc2.bias": "model-00001-of-00011.safetensors",
508
+ "vision_model.encoder.layers.19.mlp.fc2.weight": "model-00001-of-00011.safetensors",
509
+ "vision_model.encoder.layers.19.norm1.weight": "model-00001-of-00011.safetensors",
510
+ "vision_model.encoder.layers.19.norm2.weight": "model-00001-of-00011.safetensors",
511
+ "vision_model.encoder.layers.2.attn.k_norm.weight": "model-00001-of-00011.safetensors",
512
+ "vision_model.encoder.layers.2.attn.proj.bias": "model-00001-of-00011.safetensors",
513
+ "vision_model.encoder.layers.2.attn.proj.weight": "model-00001-of-00011.safetensors",
514
+ "vision_model.encoder.layers.2.attn.q_norm.weight": "model-00001-of-00011.safetensors",
515
+ "vision_model.encoder.layers.2.attn.qkv.weight": "model-00001-of-00011.safetensors",
516
+ "vision_model.encoder.layers.2.ls1": "model-00001-of-00011.safetensors",
517
+ "vision_model.encoder.layers.2.ls2": "model-00001-of-00011.safetensors",
518
+ "vision_model.encoder.layers.2.mlp.fc1.bias": "model-00001-of-00011.safetensors",
519
+ "vision_model.encoder.layers.2.mlp.fc1.weight": "model-00001-of-00011.safetensors",
520
+ "vision_model.encoder.layers.2.mlp.fc2.bias": "model-00001-of-00011.safetensors",
521
+ "vision_model.encoder.layers.2.mlp.fc2.weight": "model-00001-of-00011.safetensors",
522
+ "vision_model.encoder.layers.2.norm1.weight": "model-00001-of-00011.safetensors",
523
+ "vision_model.encoder.layers.2.norm2.weight": "model-00001-of-00011.safetensors",
524
+ "vision_model.encoder.layers.20.attn.k_norm.weight": "model-00001-of-00011.safetensors",
525
+ "vision_model.encoder.layers.20.attn.proj.bias": "model-00002-of-00011.safetensors",
526
+ "vision_model.encoder.layers.20.attn.proj.weight": "model-00002-of-00011.safetensors",
527
+ "vision_model.encoder.layers.20.attn.q_norm.weight": "model-00001-of-00011.safetensors",
528
+ "vision_model.encoder.layers.20.attn.qkv.weight": "model-00001-of-00011.safetensors",
529
+ "vision_model.encoder.layers.20.ls1": "model-00001-of-00011.safetensors",
530
+ "vision_model.encoder.layers.20.ls2": "model-00001-of-00011.safetensors",
531
+ "vision_model.encoder.layers.20.mlp.fc1.bias": "model-00002-of-00011.safetensors",
532
+ "vision_model.encoder.layers.20.mlp.fc1.weight": "model-00002-of-00011.safetensors",
533
+ "vision_model.encoder.layers.20.mlp.fc2.bias": "model-00002-of-00011.safetensors",
534
+ "vision_model.encoder.layers.20.mlp.fc2.weight": "model-00002-of-00011.safetensors",
535
+ "vision_model.encoder.layers.20.norm1.weight": "model-00002-of-00011.safetensors",
536
+ "vision_model.encoder.layers.20.norm2.weight": "model-00002-of-00011.safetensors",
537
+ "vision_model.encoder.layers.21.attn.k_norm.weight": "model-00002-of-00011.safetensors",
538
+ "vision_model.encoder.layers.21.attn.proj.bias": "model-00002-of-00011.safetensors",
539
+ "vision_model.encoder.layers.21.attn.proj.weight": "model-00002-of-00011.safetensors",
540
+ "vision_model.encoder.layers.21.attn.q_norm.weight": "model-00002-of-00011.safetensors",
541
+ "vision_model.encoder.layers.21.attn.qkv.weight": "model-00002-of-00011.safetensors",
542
+ "vision_model.encoder.layers.21.ls1": "model-00002-of-00011.safetensors",
543
+ "vision_model.encoder.layers.21.ls2": "model-00002-of-00011.safetensors",
544
+ "vision_model.encoder.layers.21.mlp.fc1.bias": "model-00002-of-00011.safetensors",
545
+ "vision_model.encoder.layers.21.mlp.fc1.weight": "model-00002-of-00011.safetensors",
546
+ "vision_model.encoder.layers.21.mlp.fc2.bias": "model-00002-of-00011.safetensors",
547
+ "vision_model.encoder.layers.21.mlp.fc2.weight": "model-00002-of-00011.safetensors",
548
+ "vision_model.encoder.layers.21.norm1.weight": "model-00002-of-00011.safetensors",
549
+ "vision_model.encoder.layers.21.norm2.weight": "model-00002-of-00011.safetensors",
550
+ "vision_model.encoder.layers.22.attn.k_norm.weight": "model-00002-of-00011.safetensors",
551
+ "vision_model.encoder.layers.22.attn.proj.bias": "model-00002-of-00011.safetensors",
552
+ "vision_model.encoder.layers.22.attn.proj.weight": "model-00002-of-00011.safetensors",
553
+ "vision_model.encoder.layers.22.attn.q_norm.weight": "model-00002-of-00011.safetensors",
554
+ "vision_model.encoder.layers.22.attn.qkv.weight": "model-00002-of-00011.safetensors",
555
+ "vision_model.encoder.layers.22.ls1": "model-00002-of-00011.safetensors",
556
+ "vision_model.encoder.layers.22.ls2": "model-00002-of-00011.safetensors",
557
+ "vision_model.encoder.layers.22.mlp.fc1.bias": "model-00002-of-00011.safetensors",
558
+ "vision_model.encoder.layers.22.mlp.fc1.weight": "model-00002-of-00011.safetensors",
559
+ "vision_model.encoder.layers.22.mlp.fc2.bias": "model-00002-of-00011.safetensors",
560
+ "vision_model.encoder.layers.22.mlp.fc2.weight": "model-00002-of-00011.safetensors",
561
+ "vision_model.encoder.layers.22.norm1.weight": "model-00002-of-00011.safetensors",
562
+ "vision_model.encoder.layers.22.norm2.weight": "model-00002-of-00011.safetensors",
563
+ "vision_model.encoder.layers.23.attn.k_norm.weight": "model-00002-of-00011.safetensors",
564
+ "vision_model.encoder.layers.23.attn.proj.bias": "model-00002-of-00011.safetensors",
565
+ "vision_model.encoder.layers.23.attn.proj.weight": "model-00002-of-00011.safetensors",
566
+ "vision_model.encoder.layers.23.attn.q_norm.weight": "model-00002-of-00011.safetensors",
567
+ "vision_model.encoder.layers.23.attn.qkv.weight": "model-00002-of-00011.safetensors",
568
+ "vision_model.encoder.layers.23.ls1": "model-00002-of-00011.safetensors",
569
+ "vision_model.encoder.layers.23.ls2": "model-00002-of-00011.safetensors",
570
+ "vision_model.encoder.layers.23.mlp.fc1.bias": "model-00002-of-00011.safetensors",
571
+ "vision_model.encoder.layers.23.mlp.fc1.weight": "model-00002-of-00011.safetensors",
572
+ "vision_model.encoder.layers.23.mlp.fc2.bias": "model-00002-of-00011.safetensors",
573
+ "vision_model.encoder.layers.23.mlp.fc2.weight": "model-00002-of-00011.safetensors",
574
+ "vision_model.encoder.layers.23.norm1.weight": "model-00002-of-00011.safetensors",
575
+ "vision_model.encoder.layers.23.norm2.weight": "model-00002-of-00011.safetensors",
576
+ "vision_model.encoder.layers.24.attn.k_norm.weight": "model-00002-of-00011.safetensors",
577
+ "vision_model.encoder.layers.24.attn.proj.bias": "model-00002-of-00011.safetensors",
578
+ "vision_model.encoder.layers.24.attn.proj.weight": "model-00002-of-00011.safetensors",
579
+ "vision_model.encoder.layers.24.attn.q_norm.weight": "model-00002-of-00011.safetensors",
580
+ "vision_model.encoder.layers.24.attn.qkv.weight": "model-00002-of-00011.safetensors",
581
+ "vision_model.encoder.layers.24.ls1": "model-00002-of-00011.safetensors",
582
+ "vision_model.encoder.layers.24.ls2": "model-00002-of-00011.safetensors",
583
+ "vision_model.encoder.layers.24.mlp.fc1.bias": "model-00002-of-00011.safetensors",
584
+ "vision_model.encoder.layers.24.mlp.fc1.weight": "model-00002-of-00011.safetensors",
585
+ "vision_model.encoder.layers.24.mlp.fc2.bias": "model-00002-of-00011.safetensors",
586
+ "vision_model.encoder.layers.24.mlp.fc2.weight": "model-00002-of-00011.safetensors",
587
+ "vision_model.encoder.layers.24.norm1.weight": "model-00002-of-00011.safetensors",
588
+ "vision_model.encoder.layers.24.norm2.weight": "model-00002-of-00011.safetensors",
589
+ "vision_model.encoder.layers.25.attn.k_norm.weight": "model-00002-of-00011.safetensors",
590
+ "vision_model.encoder.layers.25.attn.proj.bias": "model-00002-of-00011.safetensors",
591
+ "vision_model.encoder.layers.25.attn.proj.weight": "model-00002-of-00011.safetensors",
592
+ "vision_model.encoder.layers.25.attn.q_norm.weight": "model-00002-of-00011.safetensors",
593
+ "vision_model.encoder.layers.25.attn.qkv.weight": "model-00002-of-00011.safetensors",
594
+ "vision_model.encoder.layers.25.ls1": "model-00002-of-00011.safetensors",
595
+ "vision_model.encoder.layers.25.ls2": "model-00002-of-00011.safetensors",
596
+ "vision_model.encoder.layers.25.mlp.fc1.bias": "model-00002-of-00011.safetensors",
597
+ "vision_model.encoder.layers.25.mlp.fc1.weight": "model-00002-of-00011.safetensors",
598
+ "vision_model.encoder.layers.25.mlp.fc2.bias": "model-00002-of-00011.safetensors",
599
+ "vision_model.encoder.layers.25.mlp.fc2.weight": "model-00002-of-00011.safetensors",
600
+ "vision_model.encoder.layers.25.norm1.weight": "model-00002-of-00011.safetensors",
601
+ "vision_model.encoder.layers.25.norm2.weight": "model-00002-of-00011.safetensors",
602
+ "vision_model.encoder.layers.26.attn.k_norm.weight": "model-00002-of-00011.safetensors",
603
+ "vision_model.encoder.layers.26.attn.proj.bias": "model-00002-of-00011.safetensors",
604
+ "vision_model.encoder.layers.26.attn.proj.weight": "model-00002-of-00011.safetensors",
605
+ "vision_model.encoder.layers.26.attn.q_norm.weight": "model-00002-of-00011.safetensors",
606
+ "vision_model.encoder.layers.26.attn.qkv.weight": "model-00002-of-00011.safetensors",
607
+ "vision_model.encoder.layers.26.ls1": "model-00002-of-00011.safetensors",
608
+ "vision_model.encoder.layers.26.ls2": "model-00002-of-00011.safetensors",
609
+ "vision_model.encoder.layers.26.mlp.fc1.bias": "model-00002-of-00011.safetensors",
610
+ "vision_model.encoder.layers.26.mlp.fc1.weight": "model-00002-of-00011.safetensors",
611
+ "vision_model.encoder.layers.26.mlp.fc2.bias": "model-00002-of-00011.safetensors",
612
+ "vision_model.encoder.layers.26.mlp.fc2.weight": "model-00002-of-00011.safetensors",
613
+ "vision_model.encoder.layers.26.norm1.weight": "model-00002-of-00011.safetensors",
614
+ "vision_model.encoder.layers.26.norm2.weight": "model-00002-of-00011.safetensors",
615
+ "vision_model.encoder.layers.27.attn.k_norm.weight": "model-00002-of-00011.safetensors",
616
+ "vision_model.encoder.layers.27.attn.proj.bias": "model-00002-of-00011.safetensors",
617
+ "vision_model.encoder.layers.27.attn.proj.weight": "model-00002-of-00011.safetensors",
618
+ "vision_model.encoder.layers.27.attn.q_norm.weight": "model-00002-of-00011.safetensors",
619
+ "vision_model.encoder.layers.27.attn.qkv.weight": "model-00002-of-00011.safetensors",
620
+ "vision_model.encoder.layers.27.ls1": "model-00002-of-00011.safetensors",
621
+ "vision_model.encoder.layers.27.ls2": "model-00002-of-00011.safetensors",
622
+ "vision_model.encoder.layers.27.mlp.fc1.bias": "model-00002-of-00011.safetensors",
623
+ "vision_model.encoder.layers.27.mlp.fc1.weight": "model-00002-of-00011.safetensors",
624
+ "vision_model.encoder.layers.27.mlp.fc2.bias": "model-00002-of-00011.safetensors",
625
+ "vision_model.encoder.layers.27.mlp.fc2.weight": "model-00002-of-00011.safetensors",
626
+ "vision_model.encoder.layers.27.norm1.weight": "model-00002-of-00011.safetensors",
627
+ "vision_model.encoder.layers.27.norm2.weight": "model-00002-of-00011.safetensors",
628
+ "vision_model.encoder.layers.28.attn.k_norm.weight": "model-00002-of-00011.safetensors",
629
+ "vision_model.encoder.layers.28.attn.proj.bias": "model-00002-of-00011.safetensors",
630
+ "vision_model.encoder.layers.28.attn.proj.weight": "model-00002-of-00011.safetensors",
631
+ "vision_model.encoder.layers.28.attn.q_norm.weight": "model-00002-of-00011.safetensors",
632
+ "vision_model.encoder.layers.28.attn.qkv.weight": "model-00002-of-00011.safetensors",
633
+ "vision_model.encoder.layers.28.ls1": "model-00002-of-00011.safetensors",
634
+ "vision_model.encoder.layers.28.ls2": "model-00002-of-00011.safetensors",
635
+ "vision_model.encoder.layers.28.mlp.fc1.bias": "model-00002-of-00011.safetensors",
636
+ "vision_model.encoder.layers.28.mlp.fc1.weight": "model-00002-of-00011.safetensors",
637
+ "vision_model.encoder.layers.28.mlp.fc2.bias": "model-00002-of-00011.safetensors",
638
+ "vision_model.encoder.layers.28.mlp.fc2.weight": "model-00002-of-00011.safetensors",
639
+ "vision_model.encoder.layers.28.norm1.weight": "model-00002-of-00011.safetensors",
640
+ "vision_model.encoder.layers.28.norm2.weight": "model-00002-of-00011.safetensors",
641
+ "vision_model.encoder.layers.29.attn.k_norm.weight": "model-00002-of-00011.safetensors",
642
+ "vision_model.encoder.layers.29.attn.proj.bias": "model-00002-of-00011.safetensors",
643
+ "vision_model.encoder.layers.29.attn.proj.weight": "model-00002-of-00011.safetensors",
644
+ "vision_model.encoder.layers.29.attn.q_norm.weight": "model-00002-of-00011.safetensors",
645
+ "vision_model.encoder.layers.29.attn.qkv.weight": "model-00002-of-00011.safetensors",
646
+ "vision_model.encoder.layers.29.ls1": "model-00002-of-00011.safetensors",
647
+ "vision_model.encoder.layers.29.ls2": "model-00002-of-00011.safetensors",
648
+ "vision_model.encoder.layers.29.mlp.fc1.bias": "model-00002-of-00011.safetensors",
649
+ "vision_model.encoder.layers.29.mlp.fc1.weight": "model-00002-of-00011.safetensors",
650
+ "vision_model.encoder.layers.29.mlp.fc2.bias": "model-00002-of-00011.safetensors",
651
+ "vision_model.encoder.layers.29.mlp.fc2.weight": "model-00002-of-00011.safetensors",
652
+ "vision_model.encoder.layers.29.norm1.weight": "model-00002-of-00011.safetensors",
653
+ "vision_model.encoder.layers.29.norm2.weight": "model-00002-of-00011.safetensors",
654
+ "vision_model.encoder.layers.3.attn.k_norm.weight": "model-00001-of-00011.safetensors",
655
+ "vision_model.encoder.layers.3.attn.proj.bias": "model-00001-of-00011.safetensors",
656
+ "vision_model.encoder.layers.3.attn.proj.weight": "model-00001-of-00011.safetensors",
657
+ "vision_model.encoder.layers.3.attn.q_norm.weight": "model-00001-of-00011.safetensors",
658
+ "vision_model.encoder.layers.3.attn.qkv.weight": "model-00001-of-00011.safetensors",
659
+ "vision_model.encoder.layers.3.ls1": "model-00001-of-00011.safetensors",
660
+ "vision_model.encoder.layers.3.ls2": "model-00001-of-00011.safetensors",
661
+ "vision_model.encoder.layers.3.mlp.fc1.bias": "model-00001-of-00011.safetensors",
662
+ "vision_model.encoder.layers.3.mlp.fc1.weight": "model-00001-of-00011.safetensors",
663
+ "vision_model.encoder.layers.3.mlp.fc2.bias": "model-00001-of-00011.safetensors",
664
+ "vision_model.encoder.layers.3.mlp.fc2.weight": "model-00001-of-00011.safetensors",
665
+ "vision_model.encoder.layers.3.norm1.weight": "model-00001-of-00011.safetensors",
666
+ "vision_model.encoder.layers.3.norm2.weight": "model-00001-of-00011.safetensors",
667
+ "vision_model.encoder.layers.30.attn.k_norm.weight": "model-00002-of-00011.safetensors",
668
+ "vision_model.encoder.layers.30.attn.proj.bias": "model-00002-of-00011.safetensors",
669
+ "vision_model.encoder.layers.30.attn.proj.weight": "model-00002-of-00011.safetensors",
670
+ "vision_model.encoder.layers.30.attn.q_norm.weight": "model-00002-of-00011.safetensors",
671
+ "vision_model.encoder.layers.30.attn.qkv.weight": "model-00002-of-00011.safetensors",
672
+ "vision_model.encoder.layers.30.ls1": "model-00002-of-00011.safetensors",
673
+ "vision_model.encoder.layers.30.ls2": "model-00002-of-00011.safetensors",
674
+ "vision_model.encoder.layers.30.mlp.fc1.bias": "model-00002-of-00011.safetensors",
675
+ "vision_model.encoder.layers.30.mlp.fc1.weight": "model-00002-of-00011.safetensors",
676
+ "vision_model.encoder.layers.30.mlp.fc2.bias": "model-00002-of-00011.safetensors",
677
+ "vision_model.encoder.layers.30.mlp.fc2.weight": "model-00002-of-00011.safetensors",
678
+ "vision_model.encoder.layers.30.norm1.weight": "model-00002-of-00011.safetensors",
679
+ "vision_model.encoder.layers.30.norm2.weight": "model-00002-of-00011.safetensors",
680
+ "vision_model.encoder.layers.31.attn.k_norm.weight": "model-00002-of-00011.safetensors",
681
+ "vision_model.encoder.layers.31.attn.proj.bias": "model-00002-of-00011.safetensors",
682
+ "vision_model.encoder.layers.31.attn.proj.weight": "model-00002-of-00011.safetensors",
683
+ "vision_model.encoder.layers.31.attn.q_norm.weight": "model-00002-of-00011.safetensors",
684
+ "vision_model.encoder.layers.31.attn.qkv.weight": "model-00002-of-00011.safetensors",
685
+ "vision_model.encoder.layers.31.ls1": "model-00002-of-00011.safetensors",
686
+ "vision_model.encoder.layers.31.ls2": "model-00002-of-00011.safetensors",
687
+ "vision_model.encoder.layers.31.mlp.fc1.bias": "model-00002-of-00011.safetensors",
688
+ "vision_model.encoder.layers.31.mlp.fc1.weight": "model-00002-of-00011.safetensors",
689
+ "vision_model.encoder.layers.31.mlp.fc2.bias": "model-00002-of-00011.safetensors",
690
+ "vision_model.encoder.layers.31.mlp.fc2.weight": "model-00002-of-00011.safetensors",
691
+ "vision_model.encoder.layers.31.norm1.weight": "model-00002-of-00011.safetensors",
692
+ "vision_model.encoder.layers.31.norm2.weight": "model-00002-of-00011.safetensors",
693
+ "vision_model.encoder.layers.32.attn.k_norm.weight": "model-00002-of-00011.safetensors",
694
+ "vision_model.encoder.layers.32.attn.proj.bias": "model-00002-of-00011.safetensors",
695
+ "vision_model.encoder.layers.32.attn.proj.weight": "model-00002-of-00011.safetensors",
696
+ "vision_model.encoder.layers.32.attn.q_norm.weight": "model-00002-of-00011.safetensors",
697
+ "vision_model.encoder.layers.32.attn.qkv.weight": "model-00002-of-00011.safetensors",
698
+ "vision_model.encoder.layers.32.ls1": "model-00002-of-00011.safetensors",
699
+ "vision_model.encoder.layers.32.ls2": "model-00002-of-00011.safetensors",
700
+ "vision_model.encoder.layers.32.mlp.fc1.bias": "model-00002-of-00011.safetensors",
701
+ "vision_model.encoder.layers.32.mlp.fc1.weight": "model-00002-of-00011.safetensors",
702
+ "vision_model.encoder.layers.32.mlp.fc2.bias": "model-00002-of-00011.safetensors",
703
+ "vision_model.encoder.layers.32.mlp.fc2.weight": "model-00002-of-00011.safetensors",
704
+ "vision_model.encoder.layers.32.norm1.weight": "model-00002-of-00011.safetensors",
705
+ "vision_model.encoder.layers.32.norm2.weight": "model-00002-of-00011.safetensors",
706
+ "vision_model.encoder.layers.33.attn.k_norm.weight": "model-00002-of-00011.safetensors",
707
+ "vision_model.encoder.layers.33.attn.proj.bias": "model-00002-of-00011.safetensors",
708
+ "vision_model.encoder.layers.33.attn.proj.weight": "model-00002-of-00011.safetensors",
709
+ "vision_model.encoder.layers.33.attn.q_norm.weight": "model-00002-of-00011.safetensors",
710
+ "vision_model.encoder.layers.33.attn.qkv.weight": "model-00002-of-00011.safetensors",
711
+ "vision_model.encoder.layers.33.ls1": "model-00002-of-00011.safetensors",
712
+ "vision_model.encoder.layers.33.ls2": "model-00002-of-00011.safetensors",
713
+ "vision_model.encoder.layers.33.mlp.fc1.bias": "model-00002-of-00011.safetensors",
714
+ "vision_model.encoder.layers.33.mlp.fc1.weight": "model-00002-of-00011.safetensors",
715
+ "vision_model.encoder.layers.33.mlp.fc2.bias": "model-00002-of-00011.safetensors",
716
+ "vision_model.encoder.layers.33.mlp.fc2.weight": "model-00002-of-00011.safetensors",
717
+ "vision_model.encoder.layers.33.norm1.weight": "model-00002-of-00011.safetensors",
718
+ "vision_model.encoder.layers.33.norm2.weight": "model-00002-of-00011.safetensors",
719
+ "vision_model.encoder.layers.34.attn.k_norm.weight": "model-00002-of-00011.safetensors",
720
+ "vision_model.encoder.layers.34.attn.proj.bias": "model-00002-of-00011.safetensors",
721
+ "vision_model.encoder.layers.34.attn.proj.weight": "model-00002-of-00011.safetensors",
722
+ "vision_model.encoder.layers.34.attn.q_norm.weight": "model-00002-of-00011.safetensors",
723
+ "vision_model.encoder.layers.34.attn.qkv.weight": "model-00002-of-00011.safetensors",
724
+ "vision_model.encoder.layers.34.ls1": "model-00002-of-00011.safetensors",
725
+ "vision_model.encoder.layers.34.ls2": "model-00002-of-00011.safetensors",
726
+ "vision_model.encoder.layers.34.mlp.fc1.bias": "model-00002-of-00011.safetensors",
727
+ "vision_model.encoder.layers.34.mlp.fc1.weight": "model-00002-of-00011.safetensors",
728
+ "vision_model.encoder.layers.34.mlp.fc2.bias": "model-00002-of-00011.safetensors",
729
+ "vision_model.encoder.layers.34.mlp.fc2.weight": "model-00002-of-00011.safetensors",
730
+ "vision_model.encoder.layers.34.norm1.weight": "model-00002-of-00011.safetensors",
731
+ "vision_model.encoder.layers.34.norm2.weight": "model-00002-of-00011.safetensors",
732
+ "vision_model.encoder.layers.35.attn.k_norm.weight": "model-00002-of-00011.safetensors",
733
+ "vision_model.encoder.layers.35.attn.proj.bias": "model-00002-of-00011.safetensors",
734
+ "vision_model.encoder.layers.35.attn.proj.weight": "model-00002-of-00011.safetensors",
735
+ "vision_model.encoder.layers.35.attn.q_norm.weight": "model-00002-of-00011.safetensors",
736
+ "vision_model.encoder.layers.35.attn.qkv.weight": "model-00002-of-00011.safetensors",
737
+ "vision_model.encoder.layers.35.ls1": "model-00002-of-00011.safetensors",
738
+ "vision_model.encoder.layers.35.ls2": "model-00002-of-00011.safetensors",
739
+ "vision_model.encoder.layers.35.mlp.fc1.bias": "model-00002-of-00011.safetensors",
740
+ "vision_model.encoder.layers.35.mlp.fc1.weight": "model-00002-of-00011.safetensors",
741
+ "vision_model.encoder.layers.35.mlp.fc2.bias": "model-00002-of-00011.safetensors",
742
+ "vision_model.encoder.layers.35.mlp.fc2.weight": "model-00002-of-00011.safetensors",
743
+ "vision_model.encoder.layers.35.norm1.weight": "model-00002-of-00011.safetensors",
744
+ "vision_model.encoder.layers.35.norm2.weight": "model-00002-of-00011.safetensors",
745
+ "vision_model.encoder.layers.36.attn.k_norm.weight": "model-00002-of-00011.safetensors",
746
+ "vision_model.encoder.layers.36.attn.proj.bias": "model-00002-of-00011.safetensors",
747
+ "vision_model.encoder.layers.36.attn.proj.weight": "model-00002-of-00011.safetensors",
748
+ "vision_model.encoder.layers.36.attn.q_norm.weight": "model-00002-of-00011.safetensors",
749
+ "vision_model.encoder.layers.36.attn.qkv.weight": "model-00002-of-00011.safetensors",
750
+ "vision_model.encoder.layers.36.ls1": "model-00002-of-00011.safetensors",
751
+ "vision_model.encoder.layers.36.ls2": "model-00002-of-00011.safetensors",
752
+ "vision_model.encoder.layers.36.mlp.fc1.bias": "model-00002-of-00011.safetensors",
753
+ "vision_model.encoder.layers.36.mlp.fc1.weight": "model-00002-of-00011.safetensors",
754
+ "vision_model.encoder.layers.36.mlp.fc2.bias": "model-00002-of-00011.safetensors",
755
+ "vision_model.encoder.layers.36.mlp.fc2.weight": "model-00002-of-00011.safetensors",
756
+ "vision_model.encoder.layers.36.norm1.weight": "model-00002-of-00011.safetensors",
757
+ "vision_model.encoder.layers.36.norm2.weight": "model-00002-of-00011.safetensors",
758
+ "vision_model.encoder.layers.37.attn.k_norm.weight": "model-00002-of-00011.safetensors",
759
+ "vision_model.encoder.layers.37.attn.proj.bias": "model-00002-of-00011.safetensors",
760
+ "vision_model.encoder.layers.37.attn.proj.weight": "model-00002-of-00011.safetensors",
761
+ "vision_model.encoder.layers.37.attn.q_norm.weight": "model-00002-of-00011.safetensors",
762
+ "vision_model.encoder.layers.37.attn.qkv.weight": "model-00002-of-00011.safetensors",
763
+ "vision_model.encoder.layers.37.ls1": "model-00002-of-00011.safetensors",
764
+ "vision_model.encoder.layers.37.ls2": "model-00002-of-00011.safetensors",
765
+ "vision_model.encoder.layers.37.mlp.fc1.bias": "model-00002-of-00011.safetensors",
766
+ "vision_model.encoder.layers.37.mlp.fc1.weight": "model-00002-of-00011.safetensors",
767
+ "vision_model.encoder.layers.37.mlp.fc2.bias": "model-00002-of-00011.safetensors",
768
+ "vision_model.encoder.layers.37.mlp.fc2.weight": "model-00002-of-00011.safetensors",
769
+ "vision_model.encoder.layers.37.norm1.weight": "model-00002-of-00011.safetensors",
770
+ "vision_model.encoder.layers.37.norm2.weight": "model-00002-of-00011.safetensors",
771
+ "vision_model.encoder.layers.38.attn.k_norm.weight": "model-00002-of-00011.safetensors",
772
+ "vision_model.encoder.layers.38.attn.proj.bias": "model-00002-of-00011.safetensors",
773
+ "vision_model.encoder.layers.38.attn.proj.weight": "model-00002-of-00011.safetensors",
774
+ "vision_model.encoder.layers.38.attn.q_norm.weight": "model-00002-of-00011.safetensors",
775
+ "vision_model.encoder.layers.38.attn.qkv.weight": "model-00002-of-00011.safetensors",
776
+ "vision_model.encoder.layers.38.ls1": "model-00002-of-00011.safetensors",
777
+ "vision_model.encoder.layers.38.ls2": "model-00002-of-00011.safetensors",
778
+ "vision_model.encoder.layers.38.mlp.fc1.bias": "model-00002-of-00011.safetensors",
779
+ "vision_model.encoder.layers.38.mlp.fc1.weight": "model-00002-of-00011.safetensors",
780
+ "vision_model.encoder.layers.38.mlp.fc2.bias": "model-00002-of-00011.safetensors",
781
+ "vision_model.encoder.layers.38.mlp.fc2.weight": "model-00002-of-00011.safetensors",
782
+ "vision_model.encoder.layers.38.norm1.weight": "model-00002-of-00011.safetensors",
783
+ "vision_model.encoder.layers.38.norm2.weight": "model-00002-of-00011.safetensors",
784
+ "vision_model.encoder.layers.39.attn.k_norm.weight": "model-00002-of-00011.safetensors",
785
+ "vision_model.encoder.layers.39.attn.proj.bias": "model-00002-of-00011.safetensors",
786
+ "vision_model.encoder.layers.39.attn.proj.weight": "model-00002-of-00011.safetensors",
787
+ "vision_model.encoder.layers.39.attn.q_norm.weight": "model-00002-of-00011.safetensors",
788
+ "vision_model.encoder.layers.39.attn.qkv.weight": "model-00002-of-00011.safetensors",
789
+ "vision_model.encoder.layers.39.ls1": "model-00002-of-00011.safetensors",
790
+ "vision_model.encoder.layers.39.ls2": "model-00002-of-00011.safetensors",
791
+ "vision_model.encoder.layers.39.mlp.fc1.bias": "model-00002-of-00011.safetensors",
792
+ "vision_model.encoder.layers.39.mlp.fc1.weight": "model-00002-of-00011.safetensors",
793
+ "vision_model.encoder.layers.39.mlp.fc2.bias": "model-00002-of-00011.safetensors",
794
+ "vision_model.encoder.layers.39.mlp.fc2.weight": "model-00002-of-00011.safetensors",
795
+ "vision_model.encoder.layers.39.norm1.weight": "model-00002-of-00011.safetensors",
796
+ "vision_model.encoder.layers.39.norm2.weight": "model-00002-of-00011.safetensors",
797
+ "vision_model.encoder.layers.4.attn.k_norm.weight": "model-00001-of-00011.safetensors",
798
+ "vision_model.encoder.layers.4.attn.proj.bias": "model-00001-of-00011.safetensors",
799
+ "vision_model.encoder.layers.4.attn.proj.weight": "model-00001-of-00011.safetensors",
800
+ "vision_model.encoder.layers.4.attn.q_norm.weight": "model-00001-of-00011.safetensors",
801
+ "vision_model.encoder.layers.4.attn.qkv.weight": "model-00001-of-00011.safetensors",
802
+ "vision_model.encoder.layers.4.ls1": "model-00001-of-00011.safetensors",
803
+ "vision_model.encoder.layers.4.ls2": "model-00001-of-00011.safetensors",
804
+ "vision_model.encoder.layers.4.mlp.fc1.bias": "model-00001-of-00011.safetensors",
805
+ "vision_model.encoder.layers.4.mlp.fc1.weight": "model-00001-of-00011.safetensors",
806
+ "vision_model.encoder.layers.4.mlp.fc2.bias": "model-00001-of-00011.safetensors",
807
+ "vision_model.encoder.layers.4.mlp.fc2.weight": "model-00001-of-00011.safetensors",
808
+ "vision_model.encoder.layers.4.norm1.weight": "model-00001-of-00011.safetensors",
809
+ "vision_model.encoder.layers.4.norm2.weight": "model-00001-of-00011.safetensors",
810
+ "vision_model.encoder.layers.40.attn.k_norm.weight": "model-00002-of-00011.safetensors",
811
+ "vision_model.encoder.layers.40.attn.proj.bias": "model-00002-of-00011.safetensors",
812
+ "vision_model.encoder.layers.40.attn.proj.weight": "model-00002-of-00011.safetensors",
813
+ "vision_model.encoder.layers.40.attn.q_norm.weight": "model-00002-of-00011.safetensors",
814
+ "vision_model.encoder.layers.40.attn.qkv.weight": "model-00002-of-00011.safetensors",
815
+ "vision_model.encoder.layers.40.ls1": "model-00002-of-00011.safetensors",
816
+ "vision_model.encoder.layers.40.ls2": "model-00002-of-00011.safetensors",
817
+ "vision_model.encoder.layers.40.mlp.fc1.bias": "model-00003-of-00011.safetensors",
818
+ "vision_model.encoder.layers.40.mlp.fc1.weight": "model-00003-of-00011.safetensors",
819
+ "vision_model.encoder.layers.40.mlp.fc2.bias": "model-00003-of-00011.safetensors",
820
+ "vision_model.encoder.layers.40.mlp.fc2.weight": "model-00003-of-00011.safetensors",
821
+ "vision_model.encoder.layers.40.norm1.weight": "model-00003-of-00011.safetensors",
822
+ "vision_model.encoder.layers.40.norm2.weight": "model-00003-of-00011.safetensors",
823
+ "vision_model.encoder.layers.41.attn.k_norm.weight": "model-00003-of-00011.safetensors",
824
+ "vision_model.encoder.layers.41.attn.proj.bias": "model-00003-of-00011.safetensors",
825
+ "vision_model.encoder.layers.41.attn.proj.weight": "model-00003-of-00011.safetensors",
826
+ "vision_model.encoder.layers.41.attn.q_norm.weight": "model-00003-of-00011.safetensors",
827
+ "vision_model.encoder.layers.41.attn.qkv.weight": "model-00003-of-00011.safetensors",
828
+ "vision_model.encoder.layers.41.ls1": "model-00003-of-00011.safetensors",
829
+ "vision_model.encoder.layers.41.ls2": "model-00003-of-00011.safetensors",
830
+ "vision_model.encoder.layers.41.mlp.fc1.bias": "model-00003-of-00011.safetensors",
831
+ "vision_model.encoder.layers.41.mlp.fc1.weight": "model-00003-of-00011.safetensors",
832
+ "vision_model.encoder.layers.41.mlp.fc2.bias": "model-00003-of-00011.safetensors",
833
+ "vision_model.encoder.layers.41.mlp.fc2.weight": "model-00003-of-00011.safetensors",
834
+ "vision_model.encoder.layers.41.norm1.weight": "model-00003-of-00011.safetensors",
835
+ "vision_model.encoder.layers.41.norm2.weight": "model-00003-of-00011.safetensors",
836
+ "vision_model.encoder.layers.42.attn.k_norm.weight": "model-00003-of-00011.safetensors",
837
+ "vision_model.encoder.layers.42.attn.proj.bias": "model-00003-of-00011.safetensors",
838
+ "vision_model.encoder.layers.42.attn.proj.weight": "model-00003-of-00011.safetensors",
839
+ "vision_model.encoder.layers.42.attn.q_norm.weight": "model-00003-of-00011.safetensors",
840
+ "vision_model.encoder.layers.42.attn.qkv.weight": "model-00003-of-00011.safetensors",
841
+ "vision_model.encoder.layers.42.ls1": "model-00003-of-00011.safetensors",
842
+ "vision_model.encoder.layers.42.ls2": "model-00003-of-00011.safetensors",
843
+ "vision_model.encoder.layers.42.mlp.fc1.bias": "model-00003-of-00011.safetensors",
844
+ "vision_model.encoder.layers.42.mlp.fc1.weight": "model-00003-of-00011.safetensors",
845
+ "vision_model.encoder.layers.42.mlp.fc2.bias": "model-00003-of-00011.safetensors",
846
+ "vision_model.encoder.layers.42.mlp.fc2.weight": "model-00003-of-00011.safetensors",
847
+ "vision_model.encoder.layers.42.norm1.weight": "model-00003-of-00011.safetensors",
848
+ "vision_model.encoder.layers.42.norm2.weight": "model-00003-of-00011.safetensors",
849
+ "vision_model.encoder.layers.43.attn.k_norm.weight": "model-00003-of-00011.safetensors",
850
+ "vision_model.encoder.layers.43.attn.proj.bias": "model-00003-of-00011.safetensors",
851
+ "vision_model.encoder.layers.43.attn.proj.weight": "model-00003-of-00011.safetensors",
852
+ "vision_model.encoder.layers.43.attn.q_norm.weight": "model-00003-of-00011.safetensors",
853
+ "vision_model.encoder.layers.43.attn.qkv.weight": "model-00003-of-00011.safetensors",
854
+ "vision_model.encoder.layers.43.ls1": "model-00003-of-00011.safetensors",
855
+ "vision_model.encoder.layers.43.ls2": "model-00003-of-00011.safetensors",
856
+ "vision_model.encoder.layers.43.mlp.fc1.bias": "model-00003-of-00011.safetensors",
857
+ "vision_model.encoder.layers.43.mlp.fc1.weight": "model-00003-of-00011.safetensors",
858
+ "vision_model.encoder.layers.43.mlp.fc2.bias": "model-00003-of-00011.safetensors",
859
+ "vision_model.encoder.layers.43.mlp.fc2.weight": "model-00003-of-00011.safetensors",
860
+ "vision_model.encoder.layers.43.norm1.weight": "model-00003-of-00011.safetensors",
861
+ "vision_model.encoder.layers.43.norm2.weight": "model-00003-of-00011.safetensors",
862
+ "vision_model.encoder.layers.44.attn.k_norm.weight": "model-00003-of-00011.safetensors",
863
+ "vision_model.encoder.layers.44.attn.proj.bias": "model-00003-of-00011.safetensors",
864
+ "vision_model.encoder.layers.44.attn.proj.weight": "model-00003-of-00011.safetensors",
865
+ "vision_model.encoder.layers.44.attn.q_norm.weight": "model-00003-of-00011.safetensors",
866
+ "vision_model.encoder.layers.44.attn.qkv.weight": "model-00003-of-00011.safetensors",
867
+ "vision_model.encoder.layers.44.ls1": "model-00003-of-00011.safetensors",
868
+ "vision_model.encoder.layers.44.ls2": "model-00003-of-00011.safetensors",
869
+ "vision_model.encoder.layers.44.mlp.fc1.bias": "model-00003-of-00011.safetensors",
870
+ "vision_model.encoder.layers.44.mlp.fc1.weight": "model-00003-of-00011.safetensors",
871
+ "vision_model.encoder.layers.44.mlp.fc2.bias": "model-00003-of-00011.safetensors",
872
+ "vision_model.encoder.layers.44.mlp.fc2.weight": "model-00003-of-00011.safetensors",
873
+ "vision_model.encoder.layers.44.norm1.weight": "model-00003-of-00011.safetensors",
874
+ "vision_model.encoder.layers.44.norm2.weight": "model-00003-of-00011.safetensors",
875
+ "vision_model.encoder.layers.5.attn.k_norm.weight": "model-00001-of-00011.safetensors",
876
+ "vision_model.encoder.layers.5.attn.proj.bias": "model-00001-of-00011.safetensors",
877
+ "vision_model.encoder.layers.5.attn.proj.weight": "model-00001-of-00011.safetensors",
878
+ "vision_model.encoder.layers.5.attn.q_norm.weight": "model-00001-of-00011.safetensors",
879
+ "vision_model.encoder.layers.5.attn.qkv.weight": "model-00001-of-00011.safetensors",
880
+ "vision_model.encoder.layers.5.ls1": "model-00001-of-00011.safetensors",
881
+ "vision_model.encoder.layers.5.ls2": "model-00001-of-00011.safetensors",
882
+ "vision_model.encoder.layers.5.mlp.fc1.bias": "model-00001-of-00011.safetensors",
883
+ "vision_model.encoder.layers.5.mlp.fc1.weight": "model-00001-of-00011.safetensors",
884
+ "vision_model.encoder.layers.5.mlp.fc2.bias": "model-00001-of-00011.safetensors",
885
+ "vision_model.encoder.layers.5.mlp.fc2.weight": "model-00001-of-00011.safetensors",
886
+ "vision_model.encoder.layers.5.norm1.weight": "model-00001-of-00011.safetensors",
887
+ "vision_model.encoder.layers.5.norm2.weight": "model-00001-of-00011.safetensors",
888
+ "vision_model.encoder.layers.6.attn.k_norm.weight": "model-00001-of-00011.safetensors",
889
+ "vision_model.encoder.layers.6.attn.proj.bias": "model-00001-of-00011.safetensors",
890
+ "vision_model.encoder.layers.6.attn.proj.weight": "model-00001-of-00011.safetensors",
891
+ "vision_model.encoder.layers.6.attn.q_norm.weight": "model-00001-of-00011.safetensors",
892
+ "vision_model.encoder.layers.6.attn.qkv.weight": "model-00001-of-00011.safetensors",
893
+ "vision_model.encoder.layers.6.ls1": "model-00001-of-00011.safetensors",
894
+ "vision_model.encoder.layers.6.ls2": "model-00001-of-00011.safetensors",
895
+ "vision_model.encoder.layers.6.mlp.fc1.bias": "model-00001-of-00011.safetensors",
896
+ "vision_model.encoder.layers.6.mlp.fc1.weight": "model-00001-of-00011.safetensors",
897
+ "vision_model.encoder.layers.6.mlp.fc2.bias": "model-00001-of-00011.safetensors",
898
+ "vision_model.encoder.layers.6.mlp.fc2.weight": "model-00001-of-00011.safetensors",
899
+ "vision_model.encoder.layers.6.norm1.weight": "model-00001-of-00011.safetensors",
900
+ "vision_model.encoder.layers.6.norm2.weight": "model-00001-of-00011.safetensors",
901
+ "vision_model.encoder.layers.7.attn.k_norm.weight": "model-00001-of-00011.safetensors",
902
+ "vision_model.encoder.layers.7.attn.proj.bias": "model-00001-of-00011.safetensors",
903
+ "vision_model.encoder.layers.7.attn.proj.weight": "model-00001-of-00011.safetensors",
904
+ "vision_model.encoder.layers.7.attn.q_norm.weight": "model-00001-of-00011.safetensors",
905
+ "vision_model.encoder.layers.7.attn.qkv.weight": "model-00001-of-00011.safetensors",
906
+ "vision_model.encoder.layers.7.ls1": "model-00001-of-00011.safetensors",
907
+ "vision_model.encoder.layers.7.ls2": "model-00001-of-00011.safetensors",
908
+ "vision_model.encoder.layers.7.mlp.fc1.bias": "model-00001-of-00011.safetensors",
909
+ "vision_model.encoder.layers.7.mlp.fc1.weight": "model-00001-of-00011.safetensors",
910
+ "vision_model.encoder.layers.7.mlp.fc2.bias": "model-00001-of-00011.safetensors",
911
+ "vision_model.encoder.layers.7.mlp.fc2.weight": "model-00001-of-00011.safetensors",
912
+ "vision_model.encoder.layers.7.norm1.weight": "model-00001-of-00011.safetensors",
913
+ "vision_model.encoder.layers.7.norm2.weight": "model-00001-of-00011.safetensors",
914
+ "vision_model.encoder.layers.8.attn.k_norm.weight": "model-00001-of-00011.safetensors",
915
+ "vision_model.encoder.layers.8.attn.proj.bias": "model-00001-of-00011.safetensors",
916
+ "vision_model.encoder.layers.8.attn.proj.weight": "model-00001-of-00011.safetensors",
917
+ "vision_model.encoder.layers.8.attn.q_norm.weight": "model-00001-of-00011.safetensors",
918
+ "vision_model.encoder.layers.8.attn.qkv.weight": "model-00001-of-00011.safetensors",
919
+ "vision_model.encoder.layers.8.ls1": "model-00001-of-00011.safetensors",
920
+ "vision_model.encoder.layers.8.ls2": "model-00001-of-00011.safetensors",
921
+ "vision_model.encoder.layers.8.mlp.fc1.bias": "model-00001-of-00011.safetensors",
922
+ "vision_model.encoder.layers.8.mlp.fc1.weight": "model-00001-of-00011.safetensors",
923
+ "vision_model.encoder.layers.8.mlp.fc2.bias": "model-00001-of-00011.safetensors",
924
+ "vision_model.encoder.layers.8.mlp.fc2.weight": "model-00001-of-00011.safetensors",
925
+ "vision_model.encoder.layers.8.norm1.weight": "model-00001-of-00011.safetensors",
926
+ "vision_model.encoder.layers.8.norm2.weight": "model-00001-of-00011.safetensors",
927
+ "vision_model.encoder.layers.9.attn.k_norm.weight": "model-00001-of-00011.safetensors",
928
+ "vision_model.encoder.layers.9.attn.proj.bias": "model-00001-of-00011.safetensors",
929
+ "vision_model.encoder.layers.9.attn.proj.weight": "model-00001-of-00011.safetensors",
930
+ "vision_model.encoder.layers.9.attn.q_norm.weight": "model-00001-of-00011.safetensors",
931
+ "vision_model.encoder.layers.9.attn.qkv.weight": "model-00001-of-00011.safetensors",
932
+ "vision_model.encoder.layers.9.ls1": "model-00001-of-00011.safetensors",
933
+ "vision_model.encoder.layers.9.ls2": "model-00001-of-00011.safetensors",
934
+ "vision_model.encoder.layers.9.mlp.fc1.bias": "model-00001-of-00011.safetensors",
935
+ "vision_model.encoder.layers.9.mlp.fc1.weight": "model-00001-of-00011.safetensors",
936
+ "vision_model.encoder.layers.9.mlp.fc2.bias": "model-00001-of-00011.safetensors",
937
+ "vision_model.encoder.layers.9.mlp.fc2.weight": "model-00001-of-00011.safetensors",
938
+ "vision_model.encoder.layers.9.norm1.weight": "model-00001-of-00011.safetensors",
939
+ "vision_model.encoder.layers.9.norm2.weight": "model-00001-of-00011.safetensors"
940
+ }
941
+ }
modeling_intern_vit.py ADDED
@@ -0,0 +1,429 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # --------------------------------------------------------
2
+ # InternVL
3
+ # Copyright (c) 2024 OpenGVLab
4
+ # Licensed under The MIT License [see LICENSE for details]
5
+ # --------------------------------------------------------
6
+ from typing import Optional, Tuple, Union
7
+
8
+ import torch
9
+ import torch.nn.functional as F
10
+ import torch.utils.checkpoint
11
+ from einops import rearrange
12
+ from timm.models.layers import DropPath
13
+ from torch import nn
14
+ from transformers.activations import ACT2FN
15
+ from transformers.modeling_outputs import (BaseModelOutput,
16
+ BaseModelOutputWithPooling)
17
+ from transformers.modeling_utils import PreTrainedModel
18
+ from transformers.utils import logging
19
+
20
+ from .configuration_intern_vit import InternVisionConfig
21
+
22
+ try:
23
+ from flash_attn.bert_padding import pad_input, unpad_input
24
+ from flash_attn.flash_attn_interface import \
25
+ flash_attn_varlen_qkvpacked_func
26
+ has_flash_attn = True
27
+ except:
28
+ print('FlashAttention2 is not installed.')
29
+ has_flash_attn = False
30
+
31
+ logger = logging.get_logger(__name__)
32
+
33
+
34
+ class FlashAttention(nn.Module):
35
+ """Implement the scaled dot product attention with softmax.
36
+ Arguments
37
+ ---------
38
+ softmax_scale: The temperature to use for the softmax attention.
39
+ (default: 1/sqrt(d_keys) where d_keys is computed at
40
+ runtime)
41
+ attention_dropout: The dropout rate to apply to the attention
42
+ (default: 0.0)
43
+ """
44
+
45
+ def __init__(self, softmax_scale=None, attention_dropout=0.0, device=None, dtype=None):
46
+ super().__init__()
47
+ self.softmax_scale = softmax_scale
48
+ self.dropout_p = attention_dropout
49
+
50
+ def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens=None,
51
+ max_s=None, need_weights=False):
52
+ """Implements the multihead softmax attention.
53
+ Arguments
54
+ ---------
55
+ qkv: The tensor containing the query, key, and value. (B, S, 3, H, D) if key_padding_mask is None
56
+ if unpadded: (nnz, 3, h, d)
57
+ key_padding_mask: a bool tensor of shape (B, S)
58
+ """
59
+ assert not need_weights
60
+ assert qkv.dtype in [torch.float16, torch.bfloat16]
61
+ assert qkv.is_cuda
62
+
63
+ if cu_seqlens is None:
64
+ batch_size = qkv.shape[0]
65
+ seqlen = qkv.shape[1]
66
+ if key_padding_mask is None:
67
+ qkv = rearrange(qkv, 'b s ... -> (b s) ...')
68
+ max_s = seqlen
69
+ cu_seqlens = torch.arange(0, (batch_size + 1) * seqlen, step=seqlen, dtype=torch.int32,
70
+ device=qkv.device)
71
+ output = flash_attn_varlen_qkvpacked_func(
72
+ qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
73
+ softmax_scale=self.softmax_scale, causal=causal
74
+ )
75
+ output = rearrange(output, '(b s) ... -> b s ...', b=batch_size)
76
+ else:
77
+ nheads = qkv.shape[-2]
78
+ x = rearrange(qkv, 'b s three h d -> b s (three h d)')
79
+ x_unpad, indices, cu_seqlens, max_s = unpad_input(x, key_padding_mask)
80
+ x_unpad = rearrange(x_unpad, 'nnz (three h d) -> nnz three h d', three=3, h=nheads)
81
+ output_unpad = flash_attn_varlen_qkvpacked_func(
82
+ x_unpad, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
83
+ softmax_scale=self.softmax_scale, causal=causal
84
+ )
85
+ output = rearrange(pad_input(rearrange(output_unpad, 'nnz h d -> nnz (h d)'),
86
+ indices, batch_size, seqlen),
87
+ 'b s (h d) -> b s h d', h=nheads)
88
+ else:
89
+ assert max_s is not None
90
+ output = flash_attn_varlen_qkvpacked_func(
91
+ qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
92
+ softmax_scale=self.softmax_scale, causal=causal
93
+ )
94
+
95
+ return output, None
96
+
97
+
98
+ class InternRMSNorm(nn.Module):
99
+ def __init__(self, hidden_size, eps=1e-6):
100
+ super().__init__()
101
+ self.weight = nn.Parameter(torch.ones(hidden_size))
102
+ self.variance_epsilon = eps
103
+
104
+ def forward(self, hidden_states):
105
+ input_dtype = hidden_states.dtype
106
+ hidden_states = hidden_states.to(torch.float32)
107
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
108
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
109
+ return self.weight * hidden_states.to(input_dtype)
110
+
111
+
112
+ try:
113
+ from apex.normalization import FusedRMSNorm
114
+
115
+ InternRMSNorm = FusedRMSNorm # noqa
116
+
117
+ logger.info('Discovered apex.normalization.FusedRMSNorm - will use it instead of InternRMSNorm')
118
+ except ImportError:
119
+ # using the normal InternRMSNorm
120
+ pass
121
+ except Exception:
122
+ logger.warning('discovered apex but it failed to load, falling back to InternRMSNorm')
123
+ pass
124
+
125
+
126
+ NORM2FN = {
127
+ 'rms_norm': InternRMSNorm,
128
+ 'layer_norm': nn.LayerNorm,
129
+ }
130
+
131
+
132
+ class InternVisionEmbeddings(nn.Module):
133
+ def __init__(self, config: InternVisionConfig):
134
+ super().__init__()
135
+ self.config = config
136
+ self.embed_dim = config.hidden_size
137
+ self.image_size = config.image_size
138
+ self.patch_size = config.patch_size
139
+
140
+ self.class_embedding = nn.Parameter(
141
+ torch.randn(1, 1, self.embed_dim),
142
+ )
143
+
144
+ self.patch_embedding = nn.Conv2d(
145
+ in_channels=3, out_channels=self.embed_dim, kernel_size=self.patch_size, stride=self.patch_size
146
+ )
147
+
148
+ self.num_patches = (self.image_size // self.patch_size) ** 2
149
+ self.num_positions = self.num_patches + 1
150
+
151
+ self.position_embedding = nn.Parameter(torch.randn(1, self.num_positions, self.embed_dim))
152
+
153
+ def _get_pos_embed(self, pos_embed, H, W):
154
+ target_dtype = pos_embed.dtype
155
+ pos_embed = pos_embed.float().reshape(
156
+ 1, self.image_size // self.patch_size, self.image_size // self.patch_size, -1).permute(0, 3, 1, 2)
157
+ pos_embed = F.interpolate(pos_embed, size=(H, W), mode='bicubic', align_corners=False). \
158
+ reshape(1, -1, H * W).permute(0, 2, 1).to(target_dtype)
159
+ return pos_embed
160
+
161
+ def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor:
162
+ target_dtype = self.patch_embedding.weight.dtype
163
+ patch_embeds = self.patch_embedding(pixel_values) # shape = [*, channel, width, height]
164
+ batch_size, _, height, width = patch_embeds.shape
165
+ patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
166
+ class_embeds = self.class_embedding.expand(batch_size, 1, -1).to(target_dtype)
167
+ embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
168
+ position_embedding = torch.cat([
169
+ self.position_embedding[:, :1, :],
170
+ self._get_pos_embed(self.position_embedding[:, 1:, :], height, width)
171
+ ], dim=1)
172
+ embeddings = embeddings + position_embedding.to(target_dtype)
173
+ return embeddings
174
+
175
+
176
+ class InternAttention(nn.Module):
177
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
178
+
179
+ def __init__(self, config: InternVisionConfig):
180
+ super().__init__()
181
+ self.config = config
182
+ self.embed_dim = config.hidden_size
183
+ self.num_heads = config.num_attention_heads
184
+ self.use_flash_attn = config.use_flash_attn and has_flash_attn
185
+ if config.use_flash_attn and not has_flash_attn:
186
+ print('Warning: Flash Attention is not available, use_flash_attn is set to False.')
187
+ self.head_dim = self.embed_dim // self.num_heads
188
+ if self.head_dim * self.num_heads != self.embed_dim:
189
+ raise ValueError(
190
+ f'embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:'
191
+ f' {self.num_heads}).'
192
+ )
193
+
194
+ self.scale = self.head_dim ** -0.5
195
+ self.qkv = nn.Linear(self.embed_dim, 3 * self.embed_dim, bias=config.qkv_bias)
196
+ self.attn_drop = nn.Dropout(config.attention_dropout)
197
+ self.proj_drop = nn.Dropout(config.dropout)
198
+
199
+ self.qk_normalization = config.qk_normalization
200
+
201
+ if self.qk_normalization:
202
+ self.q_norm = InternRMSNorm(self.embed_dim, eps=config.layer_norm_eps)
203
+ self.k_norm = InternRMSNorm(self.embed_dim, eps=config.layer_norm_eps)
204
+
205
+ if self.use_flash_attn:
206
+ self.inner_attn = FlashAttention(attention_dropout=config.attention_dropout)
207
+ self.proj = nn.Linear(self.embed_dim, self.embed_dim)
208
+
209
+ def _naive_attn(self, x):
210
+ B, N, C = x.shape
211
+ qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
212
+ q, k, v = qkv.unbind(0) # make torchscript happy (cannot use tensor as tuple)
213
+
214
+ if self.qk_normalization:
215
+ B_, H_, N_, D_ = q.shape
216
+ q = self.q_norm(q.transpose(1, 2).flatten(-2, -1)).view(B_, N_, H_, D_).transpose(1, 2)
217
+ k = self.k_norm(k.transpose(1, 2).flatten(-2, -1)).view(B_, N_, H_, D_).transpose(1, 2)
218
+
219
+ attn = ((q * self.scale) @ k.transpose(-2, -1))
220
+ attn = attn.softmax(dim=-1)
221
+ attn = self.attn_drop(attn)
222
+
223
+ x = (attn @ v).transpose(1, 2).reshape(B, N, C)
224
+ x = self.proj(x)
225
+ x = self.proj_drop(x)
226
+ return x
227
+
228
+ def _flash_attn(self, x, key_padding_mask=None, need_weights=False):
229
+ qkv = self.qkv(x)
230
+ qkv = rearrange(qkv, 'b s (three h d) -> b s three h d', three=3, h=self.num_heads)
231
+
232
+ if self.qk_normalization:
233
+ q, k, v = qkv.unbind(2)
234
+ q = self.q_norm(q.flatten(-2, -1)).view(q.shape)
235
+ k = self.k_norm(k.flatten(-2, -1)).view(k.shape)
236
+ qkv = torch.stack([q, k, v], dim=2)
237
+
238
+ context, _ = self.inner_attn(
239
+ qkv, key_padding_mask=key_padding_mask, need_weights=need_weights, causal=False
240
+ )
241
+ outs = self.proj(rearrange(context, 'b s h d -> b s (h d)'))
242
+ outs = self.proj_drop(outs)
243
+ return outs
244
+
245
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
246
+ x = self._naive_attn(hidden_states) if not self.use_flash_attn else self._flash_attn(hidden_states)
247
+ return x
248
+
249
+
250
+ class InternMLP(nn.Module):
251
+ def __init__(self, config: InternVisionConfig):
252
+ super().__init__()
253
+ self.config = config
254
+ self.act = ACT2FN[config.hidden_act]
255
+ self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
256
+ self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)
257
+
258
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
259
+ hidden_states = self.fc1(hidden_states)
260
+ hidden_states = self.act(hidden_states)
261
+ hidden_states = self.fc2(hidden_states)
262
+ return hidden_states
263
+
264
+
265
+ class InternVisionEncoderLayer(nn.Module):
266
+ def __init__(self, config: InternVisionConfig, drop_path_rate: float):
267
+ super().__init__()
268
+ self.embed_dim = config.hidden_size
269
+ self.intermediate_size = config.intermediate_size
270
+ self.norm_type = config.norm_type
271
+
272
+ self.attn = InternAttention(config)
273
+ self.mlp = InternMLP(config)
274
+ self.norm1 = NORM2FN[self.norm_type](self.embed_dim, eps=config.layer_norm_eps)
275
+ self.norm2 = NORM2FN[self.norm_type](self.embed_dim, eps=config.layer_norm_eps)
276
+
277
+ self.ls1 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))
278
+ self.ls2 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))
279
+ self.drop_path1 = DropPath(drop_path_rate) if drop_path_rate > 0. else nn.Identity()
280
+ self.drop_path2 = DropPath(drop_path_rate) if drop_path_rate > 0. else nn.Identity()
281
+
282
+ def forward(
283
+ self,
284
+ hidden_states: torch.Tensor,
285
+ ) -> Tuple[torch.FloatTensor, Optional[torch.FloatTensor], Optional[Tuple[torch.FloatTensor]]]:
286
+ """
287
+ Args:
288
+ hidden_states (`Tuple[torch.FloatTensor, Optional[torch.FloatTensor]]`): input to the layer of shape `(batch, seq_len, embed_dim)`
289
+ """
290
+ hidden_states = hidden_states + self.drop_path1(self.attn(self.norm1(hidden_states).to(hidden_states.dtype)) * self.ls1)
291
+
292
+ hidden_states = hidden_states + self.drop_path2(self.mlp(self.norm2(hidden_states).to(hidden_states.dtype)) * self.ls2)
293
+
294
+ return hidden_states
295
+
296
+
297
+ class InternVisionEncoder(nn.Module):
298
+ """
299
+ Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a
300
+ [`InternEncoderLayer`].
301
+
302
+ Args:
303
+ config (`InternConfig`):
304
+ The corresponding vision configuration for the `InternEncoder`.
305
+ """
306
+
307
+ def __init__(self, config: InternVisionConfig):
308
+ super().__init__()
309
+ self.config = config
310
+ # stochastic depth decay rule
311
+ dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, config.num_hidden_layers)]
312
+ self.layers = nn.ModuleList([
313
+ InternVisionEncoderLayer(config, dpr[idx]) for idx in range(config.num_hidden_layers)])
314
+ self.gradient_checkpointing = True
315
+
316
+ def forward(
317
+ self,
318
+ inputs_embeds,
319
+ output_hidden_states: Optional[bool] = None,
320
+ return_dict: Optional[bool] = None,
321
+ ) -> Union[Tuple, BaseModelOutput]:
322
+ r"""
323
+ Args:
324
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
325
+ Embedded representation of the inputs. Should be float, not int tokens.
326
+ output_hidden_states (`bool`, *optional*):
327
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
328
+ for more detail.
329
+ return_dict (`bool`, *optional*):
330
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
331
+ """
332
+ output_hidden_states = (
333
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
334
+ )
335
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
336
+
337
+ encoder_states = () if output_hidden_states else None
338
+ hidden_states = inputs_embeds
339
+
340
+ for idx, encoder_layer in enumerate(self.layers):
341
+ if output_hidden_states:
342
+ encoder_states = encoder_states + (hidden_states,)
343
+ if self.gradient_checkpointing and self.training:
344
+ layer_outputs = torch.utils.checkpoint.checkpoint(
345
+ encoder_layer,
346
+ hidden_states)
347
+ else:
348
+ layer_outputs = encoder_layer(
349
+ hidden_states,
350
+ )
351
+ hidden_states = layer_outputs
352
+
353
+ if output_hidden_states:
354
+ encoder_states = encoder_states + (hidden_states,)
355
+
356
+ if not return_dict:
357
+ return tuple(v for v in [hidden_states, encoder_states] if v is not None)
358
+ return BaseModelOutput(
359
+ last_hidden_state=hidden_states, hidden_states=encoder_states
360
+ )
361
+
362
+
363
+ class InternVisionModel(PreTrainedModel):
364
+ main_input_name = 'pixel_values'
365
+ _supports_flash_attn_2 = True
366
+ config_class = InternVisionConfig
367
+ _no_split_modules = ['InternVisionEncoderLayer']
368
+
369
+ def __init__(self, config: InternVisionConfig):
370
+ super().__init__(config)
371
+ self.config = config
372
+
373
+ self.embeddings = InternVisionEmbeddings(config)
374
+ self.encoder = InternVisionEncoder(config)
375
+
376
+ def resize_pos_embeddings(self, old_size, new_size, patch_size):
377
+ pos_emb = self.embeddings.position_embedding
378
+ _, num_positions, embed_dim = pos_emb.shape
379
+ cls_emb = pos_emb[:, :1, :]
380
+ pos_emb = pos_emb[:, 1:, :].reshape(1, old_size // patch_size, old_size // patch_size, -1).permute(0, 3, 1, 2)
381
+ pos_emb = F.interpolate(pos_emb.float(), size=new_size // patch_size, mode='bicubic', align_corners=False)
382
+ pos_emb = pos_emb.to(cls_emb.dtype).reshape(1, embed_dim, -1).permute(0, 2, 1)
383
+ pos_emb = torch.cat([cls_emb, pos_emb], dim=1)
384
+ self.embeddings.position_embedding = nn.Parameter(pos_emb)
385
+ self.embeddings.image_size = new_size
386
+ logger.info('Resized position embeddings from {} to {}'.format(old_size, new_size))
387
+
388
+ def get_input_embeddings(self):
389
+ return self.embeddings
390
+
391
+ def forward(
392
+ self,
393
+ pixel_values: Optional[torch.FloatTensor] = None,
394
+ output_hidden_states: Optional[bool] = None,
395
+ return_dict: Optional[bool] = None,
396
+ pixel_embeds: Optional[torch.FloatTensor] = None,
397
+ ) -> Union[Tuple, BaseModelOutputWithPooling]:
398
+ output_hidden_states = (
399
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
400
+ )
401
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
402
+
403
+ if pixel_values is None and pixel_embeds is None:
404
+ raise ValueError('You have to specify pixel_values or pixel_embeds')
405
+
406
+ if pixel_embeds is not None:
407
+ hidden_states = pixel_embeds
408
+ else:
409
+ if len(pixel_values.shape) == 4:
410
+ hidden_states = self.embeddings(pixel_values)
411
+ else:
412
+ raise ValueError(f'wrong pixel_values size: {pixel_values.shape}')
413
+ encoder_outputs = self.encoder(
414
+ inputs_embeds=hidden_states,
415
+ output_hidden_states=output_hidden_states,
416
+ return_dict=return_dict,
417
+ )
418
+ last_hidden_state = encoder_outputs.last_hidden_state
419
+ pooled_output = last_hidden_state[:, 0, :]
420
+
421
+ if not return_dict:
422
+ return (last_hidden_state, pooled_output) + encoder_outputs[1:]
423
+
424
+ return BaseModelOutputWithPooling(
425
+ last_hidden_state=last_hidden_state,
426
+ pooler_output=pooled_output,
427
+ hidden_states=encoder_outputs.hidden_states,
428
+ attentions=encoder_outputs.attentions,
429
+ )
modeling_internlm2.py ADDED
@@ -0,0 +1,1480 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) The InternLM team and The HuggingFace Inc. team. All rights reserved.
2
+ #
3
+ # This code is based on transformers/src/transformers/models/llama/modeling_llama.py
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+ """ PyTorch InternLM2 model."""
17
+ import math
18
+ import queue
19
+ import threading
20
+ import warnings
21
+ from typing import List, Optional, Tuple, Union
22
+
23
+ import torch
24
+ import torch.nn.functional as F
25
+ import torch.utils.checkpoint
26
+ from einops import rearrange
27
+ from torch import nn
28
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
29
+ from transformers.activations import ACT2FN
30
+ from transformers.modeling_outputs import (BaseModelOutputWithPast,
31
+ CausalLMOutputWithPast,
32
+ SequenceClassifierOutputWithPast)
33
+ from transformers.modeling_utils import PreTrainedModel
34
+ from transformers.utils import (add_start_docstrings,
35
+ add_start_docstrings_to_model_forward, logging,
36
+ replace_return_docstrings)
37
+
38
+ try:
39
+ from transformers.generation.streamers import BaseStreamer
40
+ except: # noqa # pylint: disable=bare-except
41
+ BaseStreamer = None
42
+
43
+ from .configuration_internlm2 import InternLM2Config
44
+ import os
45
+ logger = logging.get_logger(__name__)
46
+
47
+ _CONFIG_FOR_DOC = 'InternLM2Config'
48
+
49
+ flash_attn_func, flash_attn_varlen_func = None, None
50
+ pad_input, index_first_axis, unpad_input = None, None, None
51
+ try:
52
+ from flash_attn import flash_attn_func as _flash_attn_func
53
+ from flash_attn import flash_attn_varlen_func as _flash_attn_varlen_func
54
+ from flash_attn.bert_padding import index_first_axis as _index_first_axis
55
+ from flash_attn.bert_padding import pad_input as _pad_input
56
+ from flash_attn.bert_padding import unpad_input as _unpad_input
57
+
58
+ flash_attn_func, flash_attn_varlen_func = _flash_attn_func, _flash_attn_varlen_func
59
+ pad_input, index_first_axis, unpad_input = _pad_input, _index_first_axis, _unpad_input
60
+ has_flash_attn = True
61
+ except:
62
+ has_flash_attn = False
63
+
64
+
65
+ def _import_flash_attn():
66
+ global flash_attn_func, flash_attn_varlen_func
67
+ global pad_input, index_first_axis, unpad_input
68
+ try:
69
+ from flash_attn import flash_attn_func as _flash_attn_func
70
+ from flash_attn import \
71
+ flash_attn_varlen_func as _flash_attn_varlen_func
72
+ from flash_attn.bert_padding import \
73
+ index_first_axis as _index_first_axis
74
+ from flash_attn.bert_padding import pad_input as _pad_input
75
+ from flash_attn.bert_padding import unpad_input as _unpad_input
76
+ flash_attn_func, flash_attn_varlen_func = _flash_attn_func, _flash_attn_varlen_func
77
+ pad_input, index_first_axis, unpad_input = _pad_input, _index_first_axis, _unpad_input
78
+ except ImportError:
79
+ raise ImportError('flash_attn is not installed.')
80
+
81
+
82
+ # Copied from transformers.models.llama.modeling_llama._get_unpad_data
83
+ def _get_unpad_data(attention_mask):
84
+ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
85
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
86
+ max_seqlen_in_batch = seqlens_in_batch.max().item()
87
+ cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0))
88
+ return (
89
+ indices,
90
+ cu_seqlens,
91
+ max_seqlen_in_batch,
92
+ )
93
+
94
+
95
+ # Copied from transformers.models.bart.modeling_bart._make_causal_mask
96
+ def _make_causal_mask(
97
+ input_ids_shape: torch.Size, dtype: torch.dtype, device: torch.device, past_key_values_length: int = 0
98
+ ):
99
+ """
100
+ Make causal mask used for bi-directional self-attention.
101
+ """
102
+ bsz, tgt_len = input_ids_shape
103
+ mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min, device=device), device=device)
104
+ mask_cond = torch.arange(mask.size(-1), device=device)
105
+ mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
106
+ mask = mask.to(dtype)
107
+
108
+ if past_key_values_length > 0:
109
+ mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype, device=device), mask], dim=-1)
110
+ return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length)
111
+
112
+
113
+ # Copied from transformers.models.bart.modeling_bart._expand_mask
114
+ def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
115
+ """
116
+ Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
117
+ """
118
+ bsz, src_len = mask.size()
119
+ tgt_len = tgt_len if tgt_len is not None else src_len
120
+
121
+ expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
122
+
123
+ inverted_mask = 1.0 - expanded_mask
124
+
125
+ return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min)
126
+
127
+
128
+ # Copied from transformers.models.llama.modeling_llama.LlamaRMSNorm with Llama->InternLM2
129
+ class InternLM2RMSNorm(nn.Module):
130
+ def __init__(self, hidden_size, eps=1e-6):
131
+ """
132
+ InternLM2RMSNorm is equivalent to T5LayerNorm
133
+ """
134
+ super().__init__()
135
+ self.weight = nn.Parameter(torch.ones(hidden_size))
136
+ self.variance_epsilon = eps
137
+
138
+ def forward(self, hidden_states):
139
+ input_dtype = hidden_states.dtype
140
+ hidden_states = hidden_states.to(torch.float32)
141
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
142
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
143
+ return self.weight * hidden_states.to(input_dtype)
144
+
145
+
146
+ # Copied from transformers.model.llama.modeling_llama.LlamaRotaryEmbedding with Llama->InternLM2
147
+ class InternLM2RotaryEmbedding(nn.Module):
148
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
149
+ super().__init__()
150
+
151
+ self.dim = dim
152
+ self.max_position_embeddings = max_position_embeddings
153
+ self.base = base
154
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
155
+ self.register_buffer('inv_freq', inv_freq, persistent=False)
156
+
157
+ # Build here to make `torch.jit.trace` work.
158
+ self._set_cos_sin_cache(
159
+ seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
160
+ )
161
+
162
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
163
+ self.max_seq_len_cached = seq_len
164
+ t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
165
+
166
+ freqs = torch.einsum('i,j->ij', t, self.inv_freq)
167
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
168
+ emb = torch.cat((freqs, freqs), dim=-1)
169
+ self.register_buffer('cos_cached', emb.cos().to(dtype), persistent=False)
170
+ self.register_buffer('sin_cached', emb.sin().to(dtype), persistent=False)
171
+
172
+ def forward(self, x, seq_len=None):
173
+ # x: [bs, num_attention_heads, seq_len, head_size]
174
+ if seq_len > self.max_seq_len_cached:
175
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=torch.float32)
176
+
177
+ return (
178
+ self.cos_cached[:seq_len].to(dtype=x.dtype),
179
+ self.sin_cached[:seq_len].to(dtype=x.dtype),
180
+ )
181
+
182
+
183
+ # Copied from transformers.model.llama.modeling_llama.LlamaLinearScalingRotaryEmbedding with Llama->InternLM2
184
+ class InternLM2LinearScalingRotaryEmbedding(InternLM2RotaryEmbedding):
185
+ """InternLM2RotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
186
+
187
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
188
+ self.scaling_factor = scaling_factor
189
+ super().__init__(dim, max_position_embeddings, base, device)
190
+
191
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
192
+ self.max_seq_len_cached = seq_len
193
+ t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
194
+ t = t / self.scaling_factor
195
+
196
+ freqs = torch.einsum('i,j->ij', t, self.inv_freq)
197
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
198
+ emb = torch.cat((freqs, freqs), dim=-1)
199
+ self.register_buffer('cos_cached', emb.cos().to(dtype), persistent=False)
200
+ self.register_buffer('sin_cached', emb.sin().to(dtype), persistent=False)
201
+
202
+
203
+ # Copied from transformers.model.llama.modeling_llama.LlamaDynamicNTKScalingRotaryEmbedding with Llama->InternLM2
204
+ class InternLM2DynamicNTKScalingRotaryEmbedding(InternLM2RotaryEmbedding):
205
+ """InternLM2RotaryEmbedding extended with Dynamic NTK scaling.
206
+ Credits to the Reddit users /u/bloc97 and /u/emozilla.
207
+ """
208
+
209
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
210
+ self.scaling_factor = scaling_factor
211
+ super().__init__(dim, max_position_embeddings, base, device)
212
+
213
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
214
+ self.max_seq_len_cached = seq_len
215
+
216
+ if seq_len > self.max_position_embeddings:
217
+ base = self.base * (
218
+ (self.scaling_factor * seq_len / self.max_position_embeddings) - (self.scaling_factor - 1)
219
+ ) ** (self.dim / (self.dim - 2))
220
+ inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
221
+ self.register_buffer('inv_freq', inv_freq, persistent=False)
222
+
223
+ t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
224
+
225
+ freqs = torch.einsum('i,j->ij', t, self.inv_freq)
226
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
227
+ emb = torch.cat((freqs, freqs), dim=-1)
228
+ self.register_buffer('cos_cached', emb.cos().to(dtype), persistent=False)
229
+ self.register_buffer('sin_cached', emb.sin().to(dtype), persistent=False)
230
+
231
+
232
+ # Copied from transformers.model.llama.modeling_llama.rotate_half
233
+ def rotate_half(x):
234
+ """Rotates half the hidden dims of the input."""
235
+ x1 = x[..., : x.shape[-1] // 2]
236
+ x2 = x[..., x.shape[-1] // 2 :]
237
+ return torch.cat((-x2, x1), dim=-1)
238
+
239
+
240
+ # Copied from transformers.model.llama.modeling_llama.apply_rotary_pos_emb
241
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1):
242
+ """Applies Rotary Position Embedding to the query and key tensors."""
243
+ cos = cos[position_ids].unsqueeze(unsqueeze_dim)
244
+ sin = sin[position_ids].unsqueeze(unsqueeze_dim)
245
+ q_embed = (q * cos) + (rotate_half(q) * sin)
246
+ k_embed = (k * cos) + (rotate_half(k) * sin)
247
+ return q_embed, k_embed
248
+
249
+
250
+ class InternLM2MLP(nn.Module):
251
+ def __init__(self, config):
252
+ super().__init__()
253
+ self.config = config
254
+ self.hidden_size = config.hidden_size
255
+ self.intermediate_size = config.intermediate_size
256
+ self.w1 = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
257
+ self.w3 = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
258
+ self.w2 = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
259
+ self.act_fn = ACT2FN[config.hidden_act]
260
+
261
+ def forward(self, x):
262
+ down_proj = self.w2(self.act_fn(self.w1(x)) * self.w3(x))
263
+
264
+ return down_proj
265
+
266
+
267
+ # Copied from transformers.model.llama.modeling_llama.repeat_kv
268
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
269
+ """
270
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
271
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
272
+ """
273
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
274
+ if n_rep == 1:
275
+ return hidden_states
276
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
277
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
278
+
279
+
280
+ # Modified from transformers.model.llama.modeling_llama.LlamaAttention
281
+ class InternLM2Attention(nn.Module):
282
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
283
+
284
+ def __init__(self, config: InternLM2Config):
285
+ super().__init__()
286
+ self.config = config
287
+ self.hidden_size = config.hidden_size
288
+ self.num_heads = config.num_attention_heads
289
+ self.head_dim = self.hidden_size // self.num_heads
290
+ self.num_key_value_heads = config.num_key_value_heads
291
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
292
+ self.max_position_embeddings = config.max_position_embeddings
293
+ self.is_causal = True
294
+
295
+ if (self.head_dim * self.num_heads) != self.hidden_size:
296
+ raise ValueError(
297
+ f'hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}'
298
+ f' and `num_heads`: {self.num_heads}).'
299
+ )
300
+
301
+ self.wqkv = nn.Linear(
302
+ self.hidden_size,
303
+ (self.num_heads + 2 * self.num_key_value_heads) * self.head_dim,
304
+ bias=config.bias,
305
+ )
306
+
307
+ self.wo = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.bias)
308
+ self._init_rope()
309
+
310
+ # YOPO configuration:
311
+ self.attncut = True
312
+ self.headcut = True
313
+ self.layercut = True
314
+ self.layercut_idx = 36
315
+ self.offset = 41
316
+ head_num=24
317
+ self.mask = torch.load("headcut_mask/internvl2.0_26B/mask_"+str(head_num)+".pth")
318
+ def _init_rope(self):
319
+ if self.config.rope_scaling is None:
320
+ self.rotary_emb = InternLM2RotaryEmbedding(
321
+ self.head_dim,
322
+ max_position_embeddings=self.max_position_embeddings,
323
+ base=self.config.rope_theta,
324
+ )
325
+ else:
326
+ scaling_type = self.config.rope_scaling['type']
327
+ scaling_factor = self.config.rope_scaling['factor']
328
+ if scaling_type == 'dynamic':
329
+ self.rotary_emb = InternLM2DynamicNTKScalingRotaryEmbedding(
330
+ self.head_dim,
331
+ max_position_embeddings=self.max_position_embeddings,
332
+ base=self.config.rope_theta,
333
+ scaling_factor=scaling_factor,
334
+ )
335
+ elif scaling_type == 'linear':
336
+ self.rotary_emb = InternLM2LinearScalingRotaryEmbedding(
337
+ self.head_dim,
338
+ max_position_embeddings=self.max_position_embeddings,
339
+ base=self.config.rope_theta,
340
+ scaling_factor=scaling_factor,
341
+ )
342
+ else:
343
+ raise ValueError("Currently we only support rotary embedding's type being 'dynamic' or 'linear'.")
344
+ return self.rotary_emb
345
+
346
+ def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
347
+ return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
348
+
349
+ def local_mask(self,h, w, window):
350
+ height, width = h, w
351
+ num_pixels = height * width
352
+
353
+ # Generate grid of positions
354
+ rows = torch.arange(height)
355
+ cols = torch.arange(width)
356
+ grid_r, grid_c = torch.meshgrid(rows, cols, indexing='ij') # Shape: (24, 24)
357
+ positions = torch.stack([grid_r.flatten(), grid_c.flatten()], dim=1) # Shape: (576, 2)
358
+
359
+ # Compute pairwise differences between positions
360
+ positions_i = positions.unsqueeze(1) # Shape: (576, 1, 2)
361
+ positions_j = positions.unsqueeze(0) # Shape: (1, 576, 2)
362
+ delta = positions_i - positions_j # Shape: (576, 576, 2)
363
+ delta_abs = delta.abs() # Absolute differences
364
+
365
+ # Create neighbor mask for 3x3 neighborhood
366
+ neighbor_mask = (delta_abs[..., 0] <= int((window-1)/2)) & (delta_abs[..., 1] <= int((window-1)/2)) # Shape: (576, 576)
367
+
368
+ # Initialize the attention mask
369
+ attention_mask = torch.full((num_pixels, num_pixels), float('-inf'))
370
+ attention_mask[neighbor_mask] = 0.0 # Set 3x3 neighborhood to 0, others to -inf
371
+ return attention_mask
372
+
373
+ def forward(
374
+ self,
375
+ hidden_states: torch.Tensor,
376
+ attention_mask: Optional[torch.Tensor] = None,
377
+ position_ids: Optional[torch.LongTensor] = None,
378
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
379
+ output_attentions: bool = False,
380
+ use_cache: bool = False,
381
+ idx: int = 0,
382
+ **kwargs,
383
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
384
+ if 'padding_mask' in kwargs:
385
+ warnings.warn(
386
+ 'Passing `padding_mask` is deprecated and will be removed in v4.37. '
387
+ 'Please make sure use `attention_mask` instead.`'
388
+ )
389
+
390
+ bsz, q_len, _ = hidden_states.size()
391
+
392
+ qkv_states = self.wqkv(hidden_states)
393
+
394
+ qkv_states = rearrange(
395
+ qkv_states,
396
+ 'b q (h gs d) -> b q h gs d',
397
+ gs=2 + self.num_key_value_groups,
398
+ d=self.head_dim,
399
+ )
400
+
401
+ query_states = qkv_states[..., : self.num_key_value_groups, :]
402
+ query_states = rearrange(query_states, 'b q h gs d -> b q (h gs) d')
403
+ key_states = qkv_states[..., -2, :]
404
+ value_states = qkv_states[..., -1, :]
405
+
406
+ query_states = query_states.transpose(1, 2)
407
+ key_states = key_states.transpose(1, 2)
408
+ value_states = value_states.transpose(1, 2)
409
+
410
+ kv_seq_len = key_states.shape[-2]
411
+ if past_key_value is not None:
412
+ kv_seq_len += past_key_value[0].shape[-2]
413
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
414
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
415
+
416
+ if past_key_value is not None:
417
+ # reuse k, v, self_attention
418
+ key_states = torch.cat([past_key_value[0], key_states], dim=2)
419
+ value_states = torch.cat([past_key_value[1], value_states], dim=2)
420
+
421
+ past_key_value = (key_states, value_states) if use_cache else None
422
+
423
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
424
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
425
+
426
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
427
+
428
+ if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
429
+ raise ValueError(
430
+ f'Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is'
431
+ f' {attn_weights.size()}'
432
+ )
433
+
434
+ if attention_mask is not None:
435
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
436
+ raise ValueError(
437
+ f'Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}'
438
+ )
439
+ attn_weights = attn_weights + attention_mask
440
+
441
+ # upcast attention to fp32
442
+
443
+ image_token_num = int(os.environ.get('IMAGE_TOKEN_NUM'))
444
+
445
+
446
+ # YOPO implementation
447
+
448
+ if self.attncut:
449
+ h = int(int(os.environ.get('IMAGE_H'))/2)
450
+ if attn_weights.shape[2]>image_token_num:
451
+ self.mask_local = self.local_mask(h, h, int(h/2)) # 1/4 window
452
+ mask = attn_weights.clone()*0
453
+ temp = mask[:,:,self.offset:self.offset+image_token_num,self.offset:self.offset+image_token_num]
454
+ temp = temp.reshape(temp.shape[0],48, int(temp.shape[2]/(h*h)),h*h,int(temp.shape[2]/(h*h)),h*h)
455
+ temp2 = self.mask_local.unsqueeze(1).unsqueeze(0).unsqueeze(0).unsqueeze(0)
456
+ temp[:,:,:,:,:,:]=temp2.cuda()
457
+ attn_weights = attn_weights + mask
458
+
459
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
460
+
461
+ if self.headcut:
462
+ if idx>=2:
463
+ mask = self.mask[idx].unsqueeze(1).unsqueeze(1).unsqueeze(0).cuda()
464
+ attn_weights[:,:,:,self.offset:image_token_num]= attn_weights[:,:,:,self.offset:image_token_num] * mask
465
+
466
+ if self.layercut and idx>=self.layercut_idx:
467
+ if attn_weights.shape[2]>image_token_num:
468
+ attn_weights[:,:,image_token_num+self.offset:,self.offset:self.offset+image_token_num]=0
469
+ else:
470
+ attn_weights[:,:,:,self.offset:self.offset+image_token_num]=0
471
+
472
+ attn_output = torch.matmul(attn_weights, value_states)
473
+
474
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
475
+ raise ValueError(
476
+ f'`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is'
477
+ f' {attn_output.size()}'
478
+ )
479
+
480
+ attn_output = attn_output.transpose(1, 2).contiguous()
481
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
482
+
483
+ attn_output = self.wo(attn_output)
484
+
485
+ if not output_attentions:
486
+ attn_weights = None
487
+
488
+ return attn_output, attn_weights, past_key_value
489
+
490
+
491
+ # Modified from transformers.model.llama.modeling_llama.InternLM2FlashAttention2
492
+ class InternLM2FlashAttention2(InternLM2Attention):
493
+ """
494
+ InternLM2 flash attention module. This module inherits from `InternLM2Attention` as the weights of the module stays
495
+ untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
496
+ flash attention and deal with padding tokens in case the input contains any of them.
497
+ """
498
+
499
+ def forward(
500
+ self,
501
+ hidden_states: torch.Tensor,
502
+ attention_mask: Optional[torch.LongTensor] = None,
503
+ position_ids: Optional[torch.LongTensor] = None,
504
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
505
+ output_attentions: bool = False,
506
+ use_cache: bool = False,
507
+ **kwargs,
508
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
509
+ # InternLM2FlashAttention2 attention does not support output_attentions
510
+ if 'padding_mask' in kwargs:
511
+ warnings.warn(
512
+ 'Passing `padding_mask` is deprecated and will be removed in v4.37. '
513
+ 'Please make sure use `attention_mask` instead.`'
514
+ )
515
+
516
+ # overwrite attention_mask with padding_mask
517
+ attention_mask = kwargs.pop('padding_mask')
518
+
519
+ output_attentions = False
520
+
521
+ bsz, q_len, _ = hidden_states.size()
522
+
523
+ qkv_states = self.wqkv(hidden_states)
524
+
525
+ qkv_states = rearrange(
526
+ qkv_states,
527
+ 'b q (h gs d) -> b q h gs d',
528
+ gs=2 + self.num_key_value_groups,
529
+ d=self.head_dim,
530
+ )
531
+
532
+ query_states = qkv_states[..., : self.num_key_value_groups, :]
533
+ query_states = rearrange(query_states, 'b q h gs d -> b q (h gs) d')
534
+ key_states = qkv_states[..., -2, :]
535
+ value_states = qkv_states[..., -1, :]
536
+
537
+ query_states = query_states.transpose(1, 2)
538
+ key_states = key_states.transpose(1, 2)
539
+ value_states = value_states.transpose(1, 2)
540
+
541
+ kv_seq_len = key_states.shape[-2]
542
+ if past_key_value is not None:
543
+ kv_seq_len += past_key_value[0].shape[-2]
544
+
545
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
546
+
547
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
548
+
549
+ if past_key_value is not None:
550
+ # reuse k, v, self_attention
551
+ key_states = torch.cat([past_key_value[0], key_states], dim=2)
552
+ value_states = torch.cat([past_key_value[1], value_states], dim=2)
553
+
554
+ past_key_value = (key_states, value_states) if use_cache else None
555
+
556
+ query_states = query_states.transpose(1, 2)
557
+ key_states = key_states.transpose(1, 2)
558
+ value_states = value_states.transpose(1, 2)
559
+
560
+ attn_output = self._flash_attention_forward(
561
+ query_states, key_states, value_states, attention_mask, q_len
562
+ )
563
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
564
+ attn_output = self.wo(attn_output)
565
+
566
+ if not output_attentions:
567
+ attn_weights = None
568
+
569
+ return attn_output, attn_weights, past_key_value
570
+
571
+ def _flash_attention_forward(
572
+ self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None
573
+ ):
574
+ """
575
+ Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
576
+ first unpad the input, then computes the attention scores and pad the final attention scores.
577
+
578
+ Args:
579
+ query_states (`torch.Tensor`):
580
+ Input query states to be passed to Flash Attention API
581
+ key_states (`torch.Tensor`):
582
+ Input key states to be passed to Flash Attention API
583
+ value_states (`torch.Tensor`):
584
+ Input value states to be passed to Flash Attention API
585
+ attention_mask (`torch.Tensor`):
586
+ The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
587
+ position of padding tokens and 1 for the position of non-padding tokens.
588
+ dropout (`int`, *optional*):
589
+ Attention dropout
590
+ softmax_scale (`float`, *optional*):
591
+ The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
592
+ """
593
+ # Contains at least one padding token in the sequence
594
+ causal = self.is_causal and query_length != 1
595
+ if attention_mask is not None:
596
+ batch_size = query_states.shape[0]
597
+ query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._unpad_input(
598
+ query_states, key_states, value_states, attention_mask, query_length
599
+ )
600
+
601
+ cu_seqlens_q, cu_seqlens_k = cu_seq_lens
602
+ max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
603
+
604
+ attn_output_unpad = flash_attn_varlen_func(
605
+ query_states,
606
+ key_states,
607
+ value_states,
608
+ cu_seqlens_q=cu_seqlens_q,
609
+ cu_seqlens_k=cu_seqlens_k,
610
+ max_seqlen_q=max_seqlen_in_batch_q,
611
+ max_seqlen_k=max_seqlen_in_batch_k,
612
+ dropout_p=dropout,
613
+ softmax_scale=softmax_scale,
614
+ causal=causal,
615
+ )
616
+
617
+ attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
618
+ else:
619
+ attn_output = flash_attn_func(
620
+ query_states, key_states, value_states, dropout, softmax_scale=softmax_scale, causal=causal
621
+ )
622
+
623
+ return attn_output
624
+
625
+ def _unpad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
626
+ indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
627
+ batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
628
+
629
+ key_layer = index_first_axis(
630
+ key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
631
+ )
632
+ value_layer = index_first_axis(
633
+ value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
634
+ )
635
+
636
+ if query_length == kv_seq_len:
637
+ query_layer = index_first_axis(
638
+ query_layer.reshape(batch_size * kv_seq_len, self.num_heads, head_dim), indices_k
639
+ )
640
+ cu_seqlens_q = cu_seqlens_k
641
+ max_seqlen_in_batch_q = max_seqlen_in_batch_k
642
+ indices_q = indices_k
643
+ elif query_length == 1:
644
+ max_seqlen_in_batch_q = 1
645
+ cu_seqlens_q = torch.arange(
646
+ batch_size + 1, dtype=torch.int32, device=query_layer.device
647
+ ) # There is a memcpy here, that is very bad.
648
+ indices_q = cu_seqlens_q[:-1]
649
+ query_layer = query_layer.squeeze(1)
650
+ else:
651
+ # The -q_len: slice assumes left padding.
652
+ attention_mask = attention_mask[:, -query_length:]
653
+ query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
654
+
655
+ return (
656
+ query_layer,
657
+ key_layer,
658
+ value_layer,
659
+ indices_q.to(torch.int64),
660
+ (cu_seqlens_q, cu_seqlens_k),
661
+ (max_seqlen_in_batch_q, max_seqlen_in_batch_k),
662
+ )
663
+
664
+
665
+ INTERNLM2_ATTENTION_CLASSES = {
666
+ 'eager': InternLM2Attention,
667
+ 'flash_attention_2': InternLM2FlashAttention2,
668
+ }
669
+
670
+
671
+ # Modified from transformers.model.llama.modeling_llama.LlamaDecoderLayer
672
+ class InternLM2DecoderLayer(nn.Module):
673
+ def __init__(self, config: InternLM2Config):
674
+ super().__init__()
675
+ self.hidden_size = config.hidden_size
676
+
677
+ self.attention = INTERNLM2_ATTENTION_CLASSES[config.attn_implementation](config=config)
678
+
679
+ self.feed_forward = InternLM2MLP(config)
680
+ self.attention_norm = InternLM2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
681
+ self.ffn_norm = InternLM2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
682
+
683
+ def forward(
684
+ self,
685
+ hidden_states: torch.Tensor,
686
+ attention_mask: Optional[torch.Tensor] = None,
687
+ position_ids: Optional[torch.LongTensor] = None,
688
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
689
+ output_attentions: Optional[bool] = False,
690
+ use_cache: Optional[bool] = False,
691
+ idx: Optional[int] = 0,
692
+ **kwargs,
693
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
694
+ """
695
+ Args:
696
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
697
+ attention_mask (`torch.FloatTensor`, *optional*):
698
+ attention mask of size `(batch_size, sequence_length)` if flash attention is used or `(batch_size, 1,
699
+ query_sequence_length, key_sequence_length)` if default attention is used.
700
+ output_attentions (`bool`, *optional*):
701
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
702
+ returned tensors for more detail.
703
+ use_cache (`bool`, *optional*):
704
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
705
+ (see `past_key_values`).
706
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
707
+ """
708
+ if 'padding_mask' in kwargs:
709
+ warnings.warn(
710
+ 'Passing `padding_mask` is deprecated and will be removed in v4.37. '
711
+ 'Please make sure use `attention_mask` instead.`'
712
+ )
713
+
714
+ residual = hidden_states
715
+
716
+ hidden_states = self.attention_norm(hidden_states)
717
+
718
+ # Self Attention
719
+ hidden_states, self_attn_weights, present_key_value = self.attention(
720
+ hidden_states=hidden_states,
721
+ attention_mask=attention_mask,
722
+ position_ids=position_ids,
723
+ past_key_value=past_key_value,
724
+ output_attentions=output_attentions,
725
+ use_cache=use_cache,
726
+ idx = idx,
727
+ **kwargs,
728
+ )
729
+ hidden_states = residual + hidden_states
730
+
731
+ # Fully Connected
732
+ residual = hidden_states
733
+ hidden_states = self.ffn_norm(hidden_states)
734
+ hidden_states = self.feed_forward(hidden_states)
735
+ hidden_states = residual + hidden_states
736
+
737
+ outputs = (hidden_states,)
738
+
739
+ if output_attentions:
740
+ outputs += (self_attn_weights,)
741
+
742
+ if use_cache:
743
+ outputs += (present_key_value,)
744
+
745
+ return outputs
746
+
747
+
748
+ InternLM2_START_DOCSTRING = r"""
749
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
750
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
751
+ etc.)
752
+
753
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
754
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
755
+ and behavior.
756
+
757
+ Parameters:
758
+ config ([`InternLM2Config`]):
759
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
760
+ load the weights associated with the model, only the configuration. Check out the
761
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
762
+ """
763
+
764
+
765
+ # Copied from transformers.models.llama.modeling_llama.LlamaPreTrainedModel with Llama->InternLM2
766
+ @add_start_docstrings(
767
+ 'The bare InternLM2 Model outputting raw hidden-states without any specific head on top.',
768
+ InternLM2_START_DOCSTRING,
769
+ )
770
+ class InternLM2PreTrainedModel(PreTrainedModel):
771
+ config_class = InternLM2Config
772
+ base_model_prefix = 'model'
773
+ supports_gradient_checkpointing = True
774
+ _no_split_modules = ['InternLM2DecoderLayer']
775
+ _skip_keys_device_placement = 'past_key_values'
776
+ _supports_flash_attn_2 = True
777
+
778
+ def _init_weights(self, module):
779
+ std = self.config.initializer_range
780
+ if isinstance(module, nn.Linear):
781
+ module.weight.data.normal_(mean=0.0, std=std)
782
+ if module.bias is not None:
783
+ module.bias.data.zero_()
784
+ elif isinstance(module, nn.Embedding):
785
+ module.weight.data.normal_(mean=0.0, std=std)
786
+ if module.padding_idx is not None:
787
+ module.weight.data[module.padding_idx].zero_()
788
+
789
+
790
+ InternLM2_INPUTS_DOCSTRING = r"""
791
+ Args:
792
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
793
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
794
+ it.
795
+
796
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
797
+ [`PreTrainedTokenizer.__call__`] for details.
798
+
799
+ [What are input IDs?](../glossary#input-ids)
800
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
801
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
802
+
803
+ - 1 for tokens that are **not masked**,
804
+ - 0 for tokens that are **masked**.
805
+
806
+ [What are attention masks?](../glossary#attention-mask)
807
+
808
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
809
+ [`PreTrainedTokenizer.__call__`] for details.
810
+
811
+ If `past_key_values` is used, optionally only the last `input_ids` have to be input (see
812
+ `past_key_values`).
813
+
814
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
815
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
816
+ information on the default strategy.
817
+
818
+ - 1 indicates the head is **not masked**,
819
+ - 0 indicates the head is **masked**.
820
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
821
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
822
+ config.n_positions - 1]`.
823
+
824
+ [What are position IDs?](../glossary#position-ids)
825
+ past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or
826
+ when `config.use_cache=True`):
827
+ Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
828
+ `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
829
+ `(batch_size, num_heads, decoder_sequence_length, embed_size_per_head)`.
830
+
831
+ Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
832
+ blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
833
+
834
+ If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
835
+ have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
836
+ of shape `(batch_size, sequence_length)`.
837
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
838
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
839
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
840
+ model's internal embedding lookup matrix.
841
+ use_cache (`bool`, *optional*):
842
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
843
+ `past_key_values`).
844
+ output_attentions (`bool`, *optional*):
845
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
846
+ tensors for more detail.
847
+ output_hidden_states (`bool`, *optional*):
848
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
849
+ more detail.
850
+ return_dict (`bool`, *optional*):
851
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
852
+ """
853
+
854
+
855
+ # Modified from transformers.model.llama.modeling_llama.LlamaModel
856
+ @add_start_docstrings(
857
+ 'The bare InternLM2 Model outputting raw hidden-states without any specific head on top.',
858
+ InternLM2_START_DOCSTRING,
859
+ )
860
+ class InternLM2Model(InternLM2PreTrainedModel):
861
+ """
862
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`InternLM2DecoderLayer`]
863
+
864
+ Args:
865
+ config: InternLM2Config
866
+ """
867
+
868
+ _auto_class = 'AutoModel'
869
+
870
+ def __init__(self, config: InternLM2Config):
871
+ super().__init__(config)
872
+ self.padding_idx = config.pad_token_id
873
+ self.vocab_size = config.vocab_size
874
+ self.config = config
875
+ if not has_flash_attn:
876
+ self.config.attn_implementation = 'eager'
877
+ print('Warning: Flash attention is not available, using eager attention instead.')
878
+
879
+ self.tok_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
880
+
881
+ self.layers = nn.ModuleList([InternLM2DecoderLayer(config) for _ in range(config.num_hidden_layers)])
882
+ self.norm = InternLM2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
883
+
884
+ self.gradient_checkpointing = False
885
+ # Initialize weights and apply final processing
886
+ self.post_init()
887
+
888
+ def get_input_embeddings(self):
889
+ return self.tok_embeddings
890
+
891
+ def set_input_embeddings(self, value):
892
+ self.tok_embeddings = value
893
+
894
+ def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length):
895
+ # create causal mask
896
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
897
+ combined_attention_mask = None
898
+ if input_shape[-1] > 1:
899
+ combined_attention_mask = _make_causal_mask(
900
+ input_shape,
901
+ inputs_embeds.dtype,
902
+ device=inputs_embeds.device,
903
+ past_key_values_length=past_key_values_length,
904
+ )
905
+
906
+ if attention_mask is not None:
907
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
908
+ expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(
909
+ inputs_embeds.device
910
+ )
911
+ combined_attention_mask = (
912
+ expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
913
+ )
914
+
915
+ return combined_attention_mask
916
+
917
+ @add_start_docstrings_to_model_forward(InternLM2_INPUTS_DOCSTRING)
918
+ def forward(
919
+ self,
920
+ input_ids: torch.LongTensor = None,
921
+ attention_mask: Optional[torch.Tensor] = None,
922
+ position_ids: Optional[torch.LongTensor] = None,
923
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
924
+ inputs_embeds: Optional[torch.FloatTensor] = None,
925
+ use_cache: Optional[bool] = None,
926
+ output_attentions: Optional[bool] = None,
927
+ output_hidden_states: Optional[bool] = None,
928
+ return_dict: Optional[bool] = None,
929
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
930
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
931
+ output_hidden_states = (
932
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
933
+ )
934
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
935
+
936
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
937
+
938
+ if self.config.attn_implementation == 'flash_attention_2':
939
+ _import_flash_attn()
940
+
941
+ # retrieve input_ids and inputs_embeds
942
+ if input_ids is not None and inputs_embeds is not None:
943
+ raise ValueError('You cannot specify both input_ids and inputs_embeds at the same time')
944
+ elif input_ids is not None:
945
+ batch_size, seq_length = input_ids.shape[:2]
946
+ elif inputs_embeds is not None:
947
+ batch_size, seq_length = inputs_embeds.shape[:2]
948
+ else:
949
+ raise ValueError('You have to specify either input_ids or inputs_embeds')
950
+
951
+ seq_length_with_past = seq_length
952
+ past_key_values_length = 0
953
+ if past_key_values is not None:
954
+ past_key_values_length = past_key_values[0][0].shape[2]
955
+ seq_length_with_past = seq_length_with_past + past_key_values_length
956
+
957
+ if position_ids is None:
958
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
959
+ position_ids = torch.arange(
960
+ past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
961
+ )
962
+ position_ids = position_ids.unsqueeze(0)
963
+
964
+ if inputs_embeds is None:
965
+ inputs_embeds = self.tok_embeddings(input_ids)
966
+
967
+ if self.config.attn_implementation == 'flash_attention_2':
968
+ # 2d mask is passed through the layers
969
+ attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
970
+ else:
971
+ if attention_mask is None:
972
+ attention_mask = torch.ones(
973
+ (batch_size, seq_length_with_past), dtype=torch.bool, device=inputs_embeds.device
974
+ )
975
+ attention_mask = self._prepare_decoder_attention_mask(
976
+ attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
977
+ )
978
+
979
+ # embed positions
980
+ hidden_states = inputs_embeds
981
+
982
+ if self.gradient_checkpointing and self.training:
983
+ if use_cache:
984
+ logger.warning_once(
985
+ '`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...'
986
+ )
987
+ use_cache = False
988
+
989
+ # decoder layers
990
+ all_hidden_states = () if output_hidden_states else None
991
+ all_self_attns = () if output_attentions else None
992
+ next_decoder_cache = () if use_cache else None
993
+
994
+ for idx, decoder_layer in enumerate(self.layers):
995
+ if output_hidden_states:
996
+ all_hidden_states += (hidden_states,)
997
+
998
+ past_key_value = past_key_values[idx] if past_key_values is not None else None
999
+
1000
+ if self.gradient_checkpointing and self.training:
1001
+
1002
+ def create_custom_forward(module):
1003
+ def custom_forward(*inputs):
1004
+ # None for past_key_value
1005
+ return module(*inputs, output_attentions, None)
1006
+
1007
+ return custom_forward
1008
+
1009
+ layer_outputs = torch.utils.checkpoint.checkpoint(
1010
+ create_custom_forward(decoder_layer),
1011
+ hidden_states,
1012
+ attention_mask,
1013
+ position_ids,
1014
+ None,
1015
+ )
1016
+ else:
1017
+ layer_outputs = decoder_layer(
1018
+ hidden_states,
1019
+ attention_mask=attention_mask,
1020
+ position_ids=position_ids,
1021
+ past_key_value=past_key_value,
1022
+ output_attentions=output_attentions,
1023
+ use_cache=use_cache,
1024
+ idx = idx,
1025
+ )
1026
+
1027
+ hidden_states = layer_outputs[0]
1028
+
1029
+ if use_cache:
1030
+ next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
1031
+
1032
+ if output_attentions:
1033
+ all_self_attns += (layer_outputs[1],)
1034
+
1035
+ hidden_states = self.norm(hidden_states)
1036
+
1037
+ # add hidden states from the last decoder layer
1038
+ if output_hidden_states:
1039
+ all_hidden_states += (hidden_states,)
1040
+
1041
+ next_cache = next_decoder_cache if use_cache else None
1042
+ if not return_dict:
1043
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
1044
+ return BaseModelOutputWithPast(
1045
+ last_hidden_state=hidden_states,
1046
+ past_key_values=next_cache,
1047
+ hidden_states=all_hidden_states,
1048
+ attentions=all_self_attns,
1049
+ )
1050
+
1051
+
1052
+ # Modified from transformers.model.llama.modeling_llama.LlamaForCausalLM
1053
+ class InternLM2ForCausalLM(InternLM2PreTrainedModel):
1054
+ _auto_class = 'AutoModelForCausalLM'
1055
+
1056
+ _tied_weights_keys = ['output.weight']
1057
+
1058
+ def __init__(self, config):
1059
+ super().__init__(config)
1060
+ self.model = InternLM2Model(config)
1061
+ self.vocab_size = config.vocab_size
1062
+ self.output = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
1063
+
1064
+ # Initialize weights and apply final processing
1065
+ self.post_init()
1066
+
1067
+ def get_input_embeddings(self):
1068
+ return self.model.tok_embeddings
1069
+
1070
+ def set_input_embeddings(self, value):
1071
+ self.model.tok_embeddings = value
1072
+
1073
+ def get_output_embeddings(self):
1074
+ return self.output
1075
+
1076
+ def set_output_embeddings(self, new_embeddings):
1077
+ self.output = new_embeddings
1078
+
1079
+ def set_decoder(self, decoder):
1080
+ self.model = decoder
1081
+
1082
+ def get_decoder(self):
1083
+ return self.model
1084
+
1085
+ @add_start_docstrings_to_model_forward(InternLM2_INPUTS_DOCSTRING)
1086
+ @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
1087
+ def forward(
1088
+ self,
1089
+ input_ids: torch.LongTensor = None,
1090
+ attention_mask: Optional[torch.Tensor] = None,
1091
+ position_ids: Optional[torch.LongTensor] = None,
1092
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1093
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1094
+ labels: Optional[torch.LongTensor] = None,
1095
+ use_cache: Optional[bool] = None,
1096
+ output_attentions: Optional[bool] = None,
1097
+ output_hidden_states: Optional[bool] = None,
1098
+ return_dict: Optional[bool] = None,
1099
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
1100
+ r"""
1101
+ Args:
1102
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1103
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
1104
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
1105
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
1106
+
1107
+ Returns:
1108
+
1109
+ Example:
1110
+
1111
+ ```python
1112
+ >>> from transformers import AutoTokenizer, InternLM2ForCausalLM
1113
+
1114
+ >>> model = InternLM2ForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
1115
+ >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
1116
+
1117
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
1118
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
1119
+
1120
+ >>> # Generate
1121
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
1122
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
1123
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
1124
+ ```"""
1125
+
1126
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1127
+ output_hidden_states = (
1128
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1129
+ )
1130
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1131
+
1132
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
1133
+ outputs = self.model(
1134
+ input_ids=input_ids,
1135
+ attention_mask=attention_mask,
1136
+ position_ids=position_ids,
1137
+ past_key_values=past_key_values,
1138
+ inputs_embeds=inputs_embeds,
1139
+ use_cache=use_cache,
1140
+ output_attentions=output_attentions,
1141
+ output_hidden_states=output_hidden_states,
1142
+ return_dict=return_dict,
1143
+ )
1144
+
1145
+ hidden_states = outputs[0]
1146
+ logits = self.output(hidden_states)
1147
+ logits = logits.float()
1148
+
1149
+ loss = None
1150
+ if labels is not None:
1151
+ # Shift so that tokens < n predict n
1152
+ shift_logits = logits[..., :-1, :].contiguous()
1153
+ shift_labels = labels[..., 1:].contiguous()
1154
+ # Flatten the tokens
1155
+ loss_fct = CrossEntropyLoss()
1156
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
1157
+ shift_labels = shift_labels.view(-1)
1158
+ # Enable model parallelism
1159
+ shift_labels = shift_labels.to(shift_logits.device)
1160
+ loss = loss_fct(shift_logits, shift_labels)
1161
+
1162
+ if not return_dict:
1163
+ output = (logits,) + outputs[1:]
1164
+ return (loss,) + output if loss is not None else output
1165
+
1166
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
1167
+ output = CausalLMOutputWithPast(
1168
+ loss=loss,
1169
+ logits=logits,
1170
+ past_key_values=outputs.past_key_values,
1171
+ hidden_states=outputs.hidden_states,
1172
+ attentions=outputs.attentions,
1173
+ )
1174
+ output['logits'] = output['logits'].to(device)
1175
+ return output
1176
+
1177
+ def prepare_inputs_for_generation(
1178
+ self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
1179
+ ):
1180
+ if past_key_values is not None:
1181
+ past_length = past_key_values[0][0].shape[2]
1182
+
1183
+ # Some generation methods already pass only the last input ID
1184
+ if input_ids.shape[1] > past_length:
1185
+ remove_prefix_length = past_length
1186
+ else:
1187
+ # Default to old behavior: keep only final ID
1188
+ remove_prefix_length = input_ids.shape[1] - 1
1189
+
1190
+ input_ids = input_ids[:, remove_prefix_length:]
1191
+
1192
+ position_ids = kwargs.get('position_ids', None)
1193
+ if attention_mask is not None and position_ids is None:
1194
+ # create position_ids on the fly for batch generation
1195
+ position_ids = attention_mask.long().cumsum(-1) - 1
1196
+ position_ids.masked_fill_(attention_mask == 0, 1)
1197
+ if past_key_values:
1198
+ position_ids = position_ids[:, -input_ids.shape[1] :]
1199
+
1200
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
1201
+ if inputs_embeds is not None and past_key_values is None:
1202
+ model_inputs = {'inputs_embeds': inputs_embeds}
1203
+ else:
1204
+ model_inputs = {'input_ids': input_ids}
1205
+
1206
+ model_inputs.update(
1207
+ {
1208
+ 'position_ids': position_ids,
1209
+ 'past_key_values': past_key_values,
1210
+ 'use_cache': kwargs.get('use_cache'),
1211
+ 'attention_mask': attention_mask,
1212
+ }
1213
+ )
1214
+ return model_inputs
1215
+
1216
+ @staticmethod
1217
+ def _reorder_cache(past_key_values, beam_idx):
1218
+ reordered_past = ()
1219
+ for layer_past in past_key_values:
1220
+ reordered_past += (
1221
+ tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
1222
+ )
1223
+ return reordered_past
1224
+
1225
+ def build_inputs(self, tokenizer, query: str, history: List[Tuple[str, str]] = [], meta_instruction=''):
1226
+ if tokenizer.add_bos_token:
1227
+ prompt = ''
1228
+ else:
1229
+ prompt = tokenizer.bos_token
1230
+ if meta_instruction:
1231
+ prompt += f"""<|im_start|>system\n{meta_instruction}<|im_end|>\n"""
1232
+ for record in history:
1233
+ prompt += f"""<|im_start|>user\n{record[0]}<|im_end|>\n<|im_start|>assistant\n{record[1]}<|im_end|>\n"""
1234
+ prompt += f"""<|im_start|>user\n{query}<|im_end|>\n<|im_start|>assistant\n"""
1235
+ return tokenizer([prompt], return_tensors='pt')
1236
+
1237
+ @torch.no_grad()
1238
+ def chat(
1239
+ self,
1240
+ tokenizer,
1241
+ query: str,
1242
+ history: List[Tuple[str, str]] = [],
1243
+ streamer: Optional[BaseStreamer] = None,
1244
+ max_new_tokens: int = 1024,
1245
+ do_sample: bool = True,
1246
+ temperature: float = 0.8,
1247
+ top_p: float = 0.8,
1248
+ meta_instruction: str = 'You are an AI assistant whose name is InternLM (书生·浦语).\n'
1249
+ '- InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.\n'
1250
+ '- InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文.',
1251
+ **kwargs,
1252
+ ):
1253
+ inputs = self.build_inputs(tokenizer, query, history, meta_instruction)
1254
+ inputs = {k: v.to(self.device) for k, v in inputs.items() if torch.is_tensor(v)}
1255
+ # also add end-of-assistant token in eos token id to avoid unnecessary generation
1256
+ eos_token_id = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids(['<|im_end|>'])[0]]
1257
+ outputs = self.generate(
1258
+ **inputs,
1259
+ streamer=streamer,
1260
+ max_new_tokens=max_new_tokens,
1261
+ do_sample=do_sample,
1262
+ temperature=temperature,
1263
+ top_p=top_p,
1264
+ eos_token_id=eos_token_id,
1265
+ **kwargs,
1266
+ )
1267
+ outputs = outputs[0].cpu().tolist()[len(inputs['input_ids'][0]) :]
1268
+ response = tokenizer.decode(outputs, skip_special_tokens=True)
1269
+ response = response.split('<|im_end|>')[0]
1270
+ history = history + [(query, response)]
1271
+ return response, history
1272
+
1273
+ @torch.no_grad()
1274
+ def stream_chat(
1275
+ self,
1276
+ tokenizer,
1277
+ query: str,
1278
+ history: List[Tuple[str, str]] = [],
1279
+ max_new_tokens: int = 1024,
1280
+ do_sample: bool = True,
1281
+ temperature: float = 0.8,
1282
+ top_p: float = 0.8,
1283
+ **kwargs,
1284
+ ):
1285
+ """
1286
+ Return a generator in format: (response, history)
1287
+ Eg.
1288
+ ('你好,有什么可以帮助您的吗', [('你好', '你好,有什么可以帮助您的吗')])
1289
+ ('你好,有什么可以帮助您的吗?', [('你好', '你好,有什么可以帮助您的吗?')])
1290
+ """
1291
+ if BaseStreamer is None:
1292
+ raise ModuleNotFoundError(
1293
+ 'The version of `transformers` is too low. Please make sure '
1294
+ 'that you have installed `transformers>=4.28.0`.'
1295
+ )
1296
+
1297
+ response_queue = queue.Queue(maxsize=20)
1298
+
1299
+ class ChatStreamer(BaseStreamer):
1300
+ def __init__(self, tokenizer) -> None:
1301
+ super().__init__()
1302
+ self.tokenizer = tokenizer
1303
+ self.queue = response_queue
1304
+ self.query = query
1305
+ self.history = history
1306
+ self.response = ''
1307
+ self.cache = []
1308
+ self.received_inputs = False
1309
+ self.queue.put((self.response, history + [(self.query, self.response)]))
1310
+
1311
+ def put(self, value):
1312
+ if len(value.shape) > 1 and value.shape[0] > 1:
1313
+ raise ValueError('ChatStreamer only supports batch size 1')
1314
+ elif len(value.shape) > 1:
1315
+ value = value[0]
1316
+
1317
+ if not self.received_inputs:
1318
+ # The first received value is input_ids, ignore here
1319
+ self.received_inputs = True
1320
+ return
1321
+
1322
+ self.cache.extend(value.tolist())
1323
+ token = self.tokenizer.decode(self.cache, skip_special_tokens=True)
1324
+ if token.strip() != '<|im_end|>':
1325
+ self.response = self.response + token
1326
+ history = self.history + [(self.query, self.response)]
1327
+ self.queue.put((self.response, history))
1328
+ self.cache = []
1329
+ else:
1330
+ self.end()
1331
+
1332
+ def end(self):
1333
+ self.queue.put(None)
1334
+
1335
+ def stream_producer():
1336
+ return self.chat(
1337
+ tokenizer=tokenizer,
1338
+ query=query,
1339
+ streamer=ChatStreamer(tokenizer=tokenizer),
1340
+ history=history,
1341
+ max_new_tokens=max_new_tokens,
1342
+ do_sample=do_sample,
1343
+ temperature=temperature,
1344
+ top_p=top_p,
1345
+ **kwargs,
1346
+ )
1347
+
1348
+ def consumer():
1349
+ producer = threading.Thread(target=stream_producer)
1350
+ producer.start()
1351
+ while True:
1352
+ res = response_queue.get()
1353
+ if res is None:
1354
+ return
1355
+ yield res
1356
+
1357
+ return consumer()
1358
+
1359
+
1360
+ # Copied from transformers.model.llama.modeling_llama.LlamaForSequenceClassification with Llama->InternLM2
1361
+ @add_start_docstrings(
1362
+ """
1363
+ The InternLM2 Model transformer with a sequence classification head on top (linear layer).
1364
+
1365
+ [`InternLM2ForSequenceClassification`] uses the last token in order to do the classification,
1366
+ as other causal models (e.g. GPT-2) do.
1367
+
1368
+ Since it does classification on the last token, it requires to know the position of the last token. If a
1369
+ `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
1370
+ no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
1371
+ padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
1372
+ each row of the batch).
1373
+ """,
1374
+ InternLM2_START_DOCSTRING,
1375
+ )
1376
+ class InternLM2ForSequenceClassification(InternLM2PreTrainedModel):
1377
+ def __init__(self, config):
1378
+ super().__init__(config)
1379
+ self.num_labels = config.num_labels
1380
+ self.model = InternLM2Model(config)
1381
+ self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
1382
+
1383
+ # Initialize weights and apply final processing
1384
+ self.post_init()
1385
+
1386
+ def get_input_embeddings(self):
1387
+ return self.model.tok_embeddings
1388
+
1389
+ def set_input_embeddings(self, value):
1390
+ self.model.tok_embeddings = value
1391
+
1392
+ @add_start_docstrings_to_model_forward(InternLM2_INPUTS_DOCSTRING)
1393
+ def forward(
1394
+ self,
1395
+ input_ids: torch.LongTensor = None,
1396
+ attention_mask: Optional[torch.Tensor] = None,
1397
+ position_ids: Optional[torch.LongTensor] = None,
1398
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1399
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1400
+ labels: Optional[torch.LongTensor] = None,
1401
+ use_cache: Optional[bool] = None,
1402
+ output_attentions: Optional[bool] = None,
1403
+ output_hidden_states: Optional[bool] = None,
1404
+ return_dict: Optional[bool] = None,
1405
+ ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
1406
+ r"""
1407
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1408
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1409
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1410
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1411
+ """
1412
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1413
+
1414
+ transformer_outputs = self.model(
1415
+ input_ids,
1416
+ attention_mask=attention_mask,
1417
+ position_ids=position_ids,
1418
+ past_key_values=past_key_values,
1419
+ inputs_embeds=inputs_embeds,
1420
+ use_cache=use_cache,
1421
+ output_attentions=output_attentions,
1422
+ output_hidden_states=output_hidden_states,
1423
+ return_dict=return_dict,
1424
+ )
1425
+ hidden_states = transformer_outputs[0]
1426
+ logits = self.score(hidden_states)
1427
+
1428
+ if input_ids is not None:
1429
+ batch_size = input_ids.shape[0]
1430
+ else:
1431
+ batch_size = inputs_embeds.shape[0]
1432
+
1433
+ if self.config.pad_token_id is None and batch_size != 1:
1434
+ raise ValueError('Cannot handle batch sizes > 1 if no padding token is defined.')
1435
+ if self.config.pad_token_id is None:
1436
+ sequence_lengths = -1
1437
+ else:
1438
+ if input_ids is not None:
1439
+ sequence_lengths = (torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1).to(
1440
+ logits.device
1441
+ )
1442
+ else:
1443
+ sequence_lengths = -1
1444
+
1445
+ pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
1446
+
1447
+ loss = None
1448
+ if labels is not None:
1449
+ labels = labels.to(logits.device)
1450
+ if self.config.problem_type is None:
1451
+ if self.num_labels == 1:
1452
+ self.config.problem_type = 'regression'
1453
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
1454
+ self.config.problem_type = 'single_label_classification'
1455
+ else:
1456
+ self.config.problem_type = 'multi_label_classification'
1457
+
1458
+ if self.config.problem_type == 'regression':
1459
+ loss_fct = MSELoss()
1460
+ if self.num_labels == 1:
1461
+ loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
1462
+ else:
1463
+ loss = loss_fct(pooled_logits, labels)
1464
+ elif self.config.problem_type == 'single_label_classification':
1465
+ loss_fct = CrossEntropyLoss()
1466
+ loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
1467
+ elif self.config.problem_type == 'multi_label_classification':
1468
+ loss_fct = BCEWithLogitsLoss()
1469
+ loss = loss_fct(pooled_logits, labels)
1470
+ if not return_dict:
1471
+ output = (pooled_logits,) + transformer_outputs[1:]
1472
+ return ((loss,) + output) if loss is not None else output
1473
+
1474
+ return SequenceClassifierOutputWithPast(
1475
+ loss=loss,
1476
+ logits=pooled_logits,
1477
+ past_key_values=transformer_outputs.past_key_values,
1478
+ hidden_states=transformer_outputs.hidden_states,
1479
+ attentions=transformer_outputs.attentions,
1480
+ )
modeling_internlm2_temp.py ADDED
@@ -0,0 +1,1478 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) The InternLM team and The HuggingFace Inc. team. All rights reserved.
2
+ #
3
+ # This code is based on transformers/src/transformers/models/llama/modeling_llama.py
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+ """ PyTorch InternLM2 model."""
17
+ import math
18
+ import queue
19
+ import threading
20
+ import warnings
21
+ from typing import List, Optional, Tuple, Union
22
+
23
+ import torch
24
+ import torch.nn.functional as F
25
+ import torch.utils.checkpoint
26
+ from einops import rearrange
27
+ from torch import nn
28
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
29
+ from transformers.activations import ACT2FN
30
+ from transformers.modeling_outputs import (BaseModelOutputWithPast,
31
+ CausalLMOutputWithPast,
32
+ SequenceClassifierOutputWithPast)
33
+ from transformers.modeling_utils import PreTrainedModel
34
+ from transformers.utils import (add_start_docstrings,
35
+ add_start_docstrings_to_model_forward, logging,
36
+ replace_return_docstrings)
37
+
38
+ try:
39
+ from transformers.generation.streamers import BaseStreamer
40
+ except: # noqa # pylint: disable=bare-except
41
+ BaseStreamer = None
42
+
43
+ from .configuration_internlm2 import InternLM2Config
44
+ import os
45
+ logger = logging.get_logger(__name__)
46
+
47
+ _CONFIG_FOR_DOC = 'InternLM2Config'
48
+
49
+ flash_attn_func, flash_attn_varlen_func = None, None
50
+ pad_input, index_first_axis, unpad_input = None, None, None
51
+ try:
52
+ from flash_attn import flash_attn_func as _flash_attn_func
53
+ from flash_attn import flash_attn_varlen_func as _flash_attn_varlen_func
54
+ from flash_attn.bert_padding import index_first_axis as _index_first_axis
55
+ from flash_attn.bert_padding import pad_input as _pad_input
56
+ from flash_attn.bert_padding import unpad_input as _unpad_input
57
+
58
+ flash_attn_func, flash_attn_varlen_func = _flash_attn_func, _flash_attn_varlen_func
59
+ pad_input, index_first_axis, unpad_input = _pad_input, _index_first_axis, _unpad_input
60
+ has_flash_attn = True
61
+ except:
62
+ has_flash_attn = False
63
+
64
+
65
+ def _import_flash_attn():
66
+ global flash_attn_func, flash_attn_varlen_func
67
+ global pad_input, index_first_axis, unpad_input
68
+ try:
69
+ from flash_attn import flash_attn_func as _flash_attn_func
70
+ from flash_attn import \
71
+ flash_attn_varlen_func as _flash_attn_varlen_func
72
+ from flash_attn.bert_padding import \
73
+ index_first_axis as _index_first_axis
74
+ from flash_attn.bert_padding import pad_input as _pad_input
75
+ from flash_attn.bert_padding import unpad_input as _unpad_input
76
+ flash_attn_func, flash_attn_varlen_func = _flash_attn_func, _flash_attn_varlen_func
77
+ pad_input, index_first_axis, unpad_input = _pad_input, _index_first_axis, _unpad_input
78
+ except ImportError:
79
+ raise ImportError('flash_attn is not installed.')
80
+
81
+
82
+ # Copied from transformers.models.llama.modeling_llama._get_unpad_data
83
+ def _get_unpad_data(attention_mask):
84
+ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
85
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
86
+ max_seqlen_in_batch = seqlens_in_batch.max().item()
87
+ cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0))
88
+ return (
89
+ indices,
90
+ cu_seqlens,
91
+ max_seqlen_in_batch,
92
+ )
93
+
94
+
95
+ # Copied from transformers.models.bart.modeling_bart._make_causal_mask
96
+ def _make_causal_mask(
97
+ input_ids_shape: torch.Size, dtype: torch.dtype, device: torch.device, past_key_values_length: int = 0
98
+ ):
99
+ """
100
+ Make causal mask used for bi-directional self-attention.
101
+ """
102
+ bsz, tgt_len = input_ids_shape
103
+ mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min, device=device), device=device)
104
+ mask_cond = torch.arange(mask.size(-1), device=device)
105
+ mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
106
+ mask = mask.to(dtype)
107
+
108
+ if past_key_values_length > 0:
109
+ mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype, device=device), mask], dim=-1)
110
+ return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length)
111
+
112
+
113
+ # Copied from transformers.models.bart.modeling_bart._expand_mask
114
+ def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
115
+ """
116
+ Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
117
+ """
118
+ bsz, src_len = mask.size()
119
+ tgt_len = tgt_len if tgt_len is not None else src_len
120
+
121
+ expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
122
+
123
+ inverted_mask = 1.0 - expanded_mask
124
+
125
+ return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min)
126
+
127
+
128
+ # Copied from transformers.models.llama.modeling_llama.LlamaRMSNorm with Llama->InternLM2
129
+ class InternLM2RMSNorm(nn.Module):
130
+ def __init__(self, hidden_size, eps=1e-6):
131
+ """
132
+ InternLM2RMSNorm is equivalent to T5LayerNorm
133
+ """
134
+ super().__init__()
135
+ self.weight = nn.Parameter(torch.ones(hidden_size))
136
+ self.variance_epsilon = eps
137
+
138
+ def forward(self, hidden_states):
139
+ input_dtype = hidden_states.dtype
140
+ hidden_states = hidden_states.to(torch.float32)
141
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
142
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
143
+ return self.weight * hidden_states.to(input_dtype)
144
+
145
+
146
+ # Copied from transformers.model.llama.modeling_llama.LlamaRotaryEmbedding with Llama->InternLM2
147
+ class InternLM2RotaryEmbedding(nn.Module):
148
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
149
+ super().__init__()
150
+
151
+ self.dim = dim
152
+ self.max_position_embeddings = max_position_embeddings
153
+ self.base = base
154
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
155
+ self.register_buffer('inv_freq', inv_freq, persistent=False)
156
+
157
+ # Build here to make `torch.jit.trace` work.
158
+ self._set_cos_sin_cache(
159
+ seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
160
+ )
161
+
162
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
163
+ self.max_seq_len_cached = seq_len
164
+ t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
165
+
166
+ freqs = torch.einsum('i,j->ij', t, self.inv_freq)
167
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
168
+ emb = torch.cat((freqs, freqs), dim=-1)
169
+ self.register_buffer('cos_cached', emb.cos().to(dtype), persistent=False)
170
+ self.register_buffer('sin_cached', emb.sin().to(dtype), persistent=False)
171
+
172
+ def forward(self, x, seq_len=None):
173
+ # x: [bs, num_attention_heads, seq_len, head_size]
174
+ if seq_len > self.max_seq_len_cached:
175
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=torch.float32)
176
+
177
+ return (
178
+ self.cos_cached[:seq_len].to(dtype=x.dtype),
179
+ self.sin_cached[:seq_len].to(dtype=x.dtype),
180
+ )
181
+
182
+
183
+ # Copied from transformers.model.llama.modeling_llama.LlamaLinearScalingRotaryEmbedding with Llama->InternLM2
184
+ class InternLM2LinearScalingRotaryEmbedding(InternLM2RotaryEmbedding):
185
+ """InternLM2RotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
186
+
187
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
188
+ self.scaling_factor = scaling_factor
189
+ super().__init__(dim, max_position_embeddings, base, device)
190
+
191
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
192
+ self.max_seq_len_cached = seq_len
193
+ t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
194
+ t = t / self.scaling_factor
195
+
196
+ freqs = torch.einsum('i,j->ij', t, self.inv_freq)
197
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
198
+ emb = torch.cat((freqs, freqs), dim=-1)
199
+ self.register_buffer('cos_cached', emb.cos().to(dtype), persistent=False)
200
+ self.register_buffer('sin_cached', emb.sin().to(dtype), persistent=False)
201
+
202
+
203
+ # Copied from transformers.model.llama.modeling_llama.LlamaDynamicNTKScalingRotaryEmbedding with Llama->InternLM2
204
+ class InternLM2DynamicNTKScalingRotaryEmbedding(InternLM2RotaryEmbedding):
205
+ """InternLM2RotaryEmbedding extended with Dynamic NTK scaling.
206
+ Credits to the Reddit users /u/bloc97 and /u/emozilla.
207
+ """
208
+
209
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
210
+ self.scaling_factor = scaling_factor
211
+ super().__init__(dim, max_position_embeddings, base, device)
212
+
213
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
214
+ self.max_seq_len_cached = seq_len
215
+
216
+ if seq_len > self.max_position_embeddings:
217
+ base = self.base * (
218
+ (self.scaling_factor * seq_len / self.max_position_embeddings) - (self.scaling_factor - 1)
219
+ ) ** (self.dim / (self.dim - 2))
220
+ inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
221
+ self.register_buffer('inv_freq', inv_freq, persistent=False)
222
+
223
+ t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
224
+
225
+ freqs = torch.einsum('i,j->ij', t, self.inv_freq)
226
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
227
+ emb = torch.cat((freqs, freqs), dim=-1)
228
+ self.register_buffer('cos_cached', emb.cos().to(dtype), persistent=False)
229
+ self.register_buffer('sin_cached', emb.sin().to(dtype), persistent=False)
230
+
231
+
232
+ # Copied from transformers.model.llama.modeling_llama.rotate_half
233
+ def rotate_half(x):
234
+ """Rotates half the hidden dims of the input."""
235
+ x1 = x[..., : x.shape[-1] // 2]
236
+ x2 = x[..., x.shape[-1] // 2 :]
237
+ return torch.cat((-x2, x1), dim=-1)
238
+
239
+
240
+ # Copied from transformers.model.llama.modeling_llama.apply_rotary_pos_emb
241
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1):
242
+ """Applies Rotary Position Embedding to the query and key tensors."""
243
+ cos = cos[position_ids].unsqueeze(unsqueeze_dim)
244
+ sin = sin[position_ids].unsqueeze(unsqueeze_dim)
245
+ q_embed = (q * cos) + (rotate_half(q) * sin)
246
+ k_embed = (k * cos) + (rotate_half(k) * sin)
247
+ return q_embed, k_embed
248
+
249
+
250
+ class InternLM2MLP(nn.Module):
251
+ def __init__(self, config):
252
+ super().__init__()
253
+ self.config = config
254
+ self.hidden_size = config.hidden_size
255
+ self.intermediate_size = config.intermediate_size
256
+ self.w1 = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
257
+ self.w3 = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
258
+ self.w2 = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
259
+ self.act_fn = ACT2FN[config.hidden_act]
260
+
261
+ def forward(self, x):
262
+ down_proj = self.w2(self.act_fn(self.w1(x)) * self.w3(x))
263
+
264
+ return down_proj
265
+
266
+
267
+ # Copied from transformers.model.llama.modeling_llama.repeat_kv
268
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
269
+ """
270
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
271
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
272
+ """
273
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
274
+ if n_rep == 1:
275
+ return hidden_states
276
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
277
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
278
+
279
+
280
+ # Modified from transformers.model.llama.modeling_llama.LlamaAttention
281
+ class InternLM2Attention(nn.Module):
282
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
283
+
284
+ def __init__(self, config: InternLM2Config):
285
+ super().__init__()
286
+ self.config = config
287
+ self.hidden_size = config.hidden_size
288
+ self.num_heads = config.num_attention_heads
289
+ self.head_dim = self.hidden_size // self.num_heads
290
+ self.num_key_value_heads = config.num_key_value_heads
291
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
292
+ self.max_position_embeddings = config.max_position_embeddings
293
+ self.is_causal = True
294
+
295
+ if (self.head_dim * self.num_heads) != self.hidden_size:
296
+ raise ValueError(
297
+ f'hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}'
298
+ f' and `num_heads`: {self.num_heads}).'
299
+ )
300
+
301
+ self.wqkv = nn.Linear(
302
+ self.hidden_size,
303
+ (self.num_heads + 2 * self.num_key_value_heads) * self.head_dim,
304
+ bias=config.bias,
305
+ )
306
+
307
+ self.wo = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.bias)
308
+ self._init_rope()
309
+ self.mask = torch.load("headcut_mask/mask.pth")
310
+ def _init_rope(self):
311
+ if self.config.rope_scaling is None:
312
+ self.rotary_emb = InternLM2RotaryEmbedding(
313
+ self.head_dim,
314
+ max_position_embeddings=self.max_position_embeddings,
315
+ base=self.config.rope_theta,
316
+ )
317
+ else:
318
+ scaling_type = self.config.rope_scaling['type']
319
+ scaling_factor = self.config.rope_scaling['factor']
320
+ if scaling_type == 'dynamic':
321
+ self.rotary_emb = InternLM2DynamicNTKScalingRotaryEmbedding(
322
+ self.head_dim,
323
+ max_position_embeddings=self.max_position_embeddings,
324
+ base=self.config.rope_theta,
325
+ scaling_factor=scaling_factor,
326
+ )
327
+ elif scaling_type == 'linear':
328
+ self.rotary_emb = InternLM2LinearScalingRotaryEmbedding(
329
+ self.head_dim,
330
+ max_position_embeddings=self.max_position_embeddings,
331
+ base=self.config.rope_theta,
332
+ scaling_factor=scaling_factor,
333
+ )
334
+ else:
335
+ raise ValueError("Currently we only support rotary embedding's type being 'dynamic' or 'linear'.")
336
+ return self.rotary_emb
337
+
338
+ def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
339
+ return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
340
+
341
+ def local_mask(self,h, w, window):
342
+ height, width = h, w
343
+ num_pixels = height * width
344
+
345
+ # Generate grid of positions
346
+ rows = torch.arange(height)
347
+ cols = torch.arange(width)
348
+ grid_r, grid_c = torch.meshgrid(rows, cols, indexing='ij') # Shape: (24, 24)
349
+ positions = torch.stack([grid_r.flatten(), grid_c.flatten()], dim=1) # Shape: (576, 2)
350
+
351
+ # Compute pairwise differences between positions
352
+ positions_i = positions.unsqueeze(1) # Shape: (576, 1, 2)
353
+ positions_j = positions.unsqueeze(0) # Shape: (1, 576, 2)
354
+ delta = positions_i - positions_j # Shape: (576, 576, 2)
355
+ delta_abs = delta.abs() # Absolute differences
356
+
357
+ # Create neighbor mask for 3x3 neighborhood
358
+ neighbor_mask = (delta_abs[..., 0] <= int((window-1)/2)) & (delta_abs[..., 1] <= int((window-1)/2)) # Shape: (576, 576)
359
+
360
+ # Initialize the attention mask
361
+ attention_mask = torch.full((num_pixels, num_pixels), float('-inf'))
362
+ attention_mask[neighbor_mask] = 0.0 # Set 3x3 neighborhood to 0, others to -inf
363
+ return attention_mask
364
+
365
+ def forward(
366
+ self,
367
+ hidden_states: torch.Tensor,
368
+ attention_mask: Optional[torch.Tensor] = None,
369
+ position_ids: Optional[torch.LongTensor] = None,
370
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
371
+ output_attentions: bool = False,
372
+ use_cache: bool = False,
373
+ idx: int = 0,
374
+ **kwargs,
375
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
376
+ if 'padding_mask' in kwargs:
377
+ warnings.warn(
378
+ 'Passing `padding_mask` is deprecated and will be removed in v4.37. '
379
+ 'Please make sure use `attention_mask` instead.`'
380
+ )
381
+
382
+ bsz, q_len, _ = hidden_states.size()
383
+
384
+ qkv_states = self.wqkv(hidden_states)
385
+
386
+ qkv_states = rearrange(
387
+ qkv_states,
388
+ 'b q (h gs d) -> b q h gs d',
389
+ gs=2 + self.num_key_value_groups,
390
+ d=self.head_dim,
391
+ )
392
+
393
+ query_states = qkv_states[..., : self.num_key_value_groups, :]
394
+ query_states = rearrange(query_states, 'b q h gs d -> b q (h gs) d')
395
+ key_states = qkv_states[..., -2, :]
396
+ value_states = qkv_states[..., -1, :]
397
+
398
+ query_states = query_states.transpose(1, 2)
399
+ key_states = key_states.transpose(1, 2)
400
+ value_states = value_states.transpose(1, 2)
401
+
402
+ kv_seq_len = key_states.shape[-2]
403
+ if past_key_value is not None:
404
+ kv_seq_len += past_key_value[0].shape[-2]
405
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
406
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
407
+
408
+ if past_key_value is not None:
409
+ # reuse k, v, self_attention
410
+ key_states = torch.cat([past_key_value[0], key_states], dim=2)
411
+ value_states = torch.cat([past_key_value[1], value_states], dim=2)
412
+
413
+ past_key_value = (key_states, value_states) if use_cache else None
414
+
415
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
416
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
417
+
418
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
419
+
420
+ if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
421
+ raise ValueError(
422
+ f'Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is'
423
+ f' {attn_weights.size()}'
424
+ )
425
+
426
+ if attention_mask is not None:
427
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
428
+ raise ValueError(
429
+ f'Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}'
430
+ )
431
+ attn_weights = attn_weights + attention_mask
432
+
433
+ # upcast attention to fp32
434
+
435
+ image_token_num = int(os.environ.get('IMAGE_TOKEN_NUM'))
436
+ # import pdb; pdb.set_trace()
437
+
438
+ attncut = True
439
+ headcut = True
440
+ layercut = True
441
+ if attncut:
442
+ h = int(int(os.environ.get('IMAGE_H'))/2)
443
+ if attn_weights.shape[2]>image_token_num:
444
+ self.mask_local = self.local_mask(h, h, int(h/2)) # 1/4 window
445
+ #import pdb; pdb.set_trace()
446
+ mask = attn_weights.clone()*0
447
+ temp = mask[:,:,41:41+image_token_num,41:41+image_token_num]
448
+ temp = temp.reshape(temp.shape[0],48, int(temp.shape[2]/(h*h)),h*h,int(temp.shape[2]/(h*h)),h*h)
449
+ temp2 = self.mask_local.unsqueeze(1).unsqueeze(0).unsqueeze(0).unsqueeze(0)
450
+ temp[:,:,:,:,:,:]=temp2
451
+ attn_weights = attn_weights + mask
452
+
453
+
454
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
455
+
456
+ if headcut:
457
+ if idx>=2:
458
+ mask = self.mask[idx].unsqueeze(1).unsqueeze(1).unsqueeze(0).cuda()
459
+ attn_weights[:,:,:,41:image_token_num]= attn_weights[:,:,:,41:image_token_num] * mask
460
+ # import pdb; pdb.set_trace()
461
+ if layercut:
462
+ if idx>=36:
463
+ #import pdb; pdb.set_trace()
464
+ if attn_weights.shape[2]>image_token_num:
465
+ # import pdb; pdb.set_trace()
466
+ attn_weights[:,:,image_token_num+41:,41:41+image_token_num]=0
467
+ else:
468
+ attn_weights[:,:,:,41:41+image_token_num]=0
469
+
470
+ attn_output = torch.matmul(attn_weights, value_states)
471
+
472
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
473
+ raise ValueError(
474
+ f'`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is'
475
+ f' {attn_output.size()}'
476
+ )
477
+
478
+ attn_output = attn_output.transpose(1, 2).contiguous()
479
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
480
+
481
+ attn_output = self.wo(attn_output)
482
+
483
+ if not output_attentions:
484
+ attn_weights = None
485
+
486
+ return attn_output, attn_weights, past_key_value
487
+
488
+
489
+ # Modified from transformers.model.llama.modeling_llama.InternLM2FlashAttention2
490
+ class InternLM2FlashAttention2(InternLM2Attention):
491
+ """
492
+ InternLM2 flash attention module. This module inherits from `InternLM2Attention` as the weights of the module stays
493
+ untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
494
+ flash attention and deal with padding tokens in case the input contains any of them.
495
+ """
496
+
497
+ def forward(
498
+ self,
499
+ hidden_states: torch.Tensor,
500
+ attention_mask: Optional[torch.LongTensor] = None,
501
+ position_ids: Optional[torch.LongTensor] = None,
502
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
503
+ output_attentions: bool = False,
504
+ use_cache: bool = False,
505
+ **kwargs,
506
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
507
+ # InternLM2FlashAttention2 attention does not support output_attentions
508
+ if 'padding_mask' in kwargs:
509
+ warnings.warn(
510
+ 'Passing `padding_mask` is deprecated and will be removed in v4.37. '
511
+ 'Please make sure use `attention_mask` instead.`'
512
+ )
513
+
514
+ # overwrite attention_mask with padding_mask
515
+ attention_mask = kwargs.pop('padding_mask')
516
+
517
+ output_attentions = False
518
+
519
+ bsz, q_len, _ = hidden_states.size()
520
+
521
+ qkv_states = self.wqkv(hidden_states)
522
+
523
+ qkv_states = rearrange(
524
+ qkv_states,
525
+ 'b q (h gs d) -> b q h gs d',
526
+ gs=2 + self.num_key_value_groups,
527
+ d=self.head_dim,
528
+ )
529
+
530
+ query_states = qkv_states[..., : self.num_key_value_groups, :]
531
+ query_states = rearrange(query_states, 'b q h gs d -> b q (h gs) d')
532
+ key_states = qkv_states[..., -2, :]
533
+ value_states = qkv_states[..., -1, :]
534
+
535
+ query_states = query_states.transpose(1, 2)
536
+ key_states = key_states.transpose(1, 2)
537
+ value_states = value_states.transpose(1, 2)
538
+
539
+ kv_seq_len = key_states.shape[-2]
540
+ if past_key_value is not None:
541
+ kv_seq_len += past_key_value[0].shape[-2]
542
+
543
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
544
+
545
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
546
+
547
+ if past_key_value is not None:
548
+ # reuse k, v, self_attention
549
+ key_states = torch.cat([past_key_value[0], key_states], dim=2)
550
+ value_states = torch.cat([past_key_value[1], value_states], dim=2)
551
+
552
+ past_key_value = (key_states, value_states) if use_cache else None
553
+
554
+ query_states = query_states.transpose(1, 2)
555
+ key_states = key_states.transpose(1, 2)
556
+ value_states = value_states.transpose(1, 2)
557
+
558
+ attn_output = self._flash_attention_forward(
559
+ query_states, key_states, value_states, attention_mask, q_len
560
+ )
561
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
562
+ attn_output = self.wo(attn_output)
563
+
564
+ if not output_attentions:
565
+ attn_weights = None
566
+
567
+ return attn_output, attn_weights, past_key_value
568
+
569
+ def _flash_attention_forward(
570
+ self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None
571
+ ):
572
+ """
573
+ Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
574
+ first unpad the input, then computes the attention scores and pad the final attention scores.
575
+
576
+ Args:
577
+ query_states (`torch.Tensor`):
578
+ Input query states to be passed to Flash Attention API
579
+ key_states (`torch.Tensor`):
580
+ Input key states to be passed to Flash Attention API
581
+ value_states (`torch.Tensor`):
582
+ Input value states to be passed to Flash Attention API
583
+ attention_mask (`torch.Tensor`):
584
+ The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
585
+ position of padding tokens and 1 for the position of non-padding tokens.
586
+ dropout (`int`, *optional*):
587
+ Attention dropout
588
+ softmax_scale (`float`, *optional*):
589
+ The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
590
+ """
591
+ # Contains at least one padding token in the sequence
592
+ causal = self.is_causal and query_length != 1
593
+ if attention_mask is not None:
594
+ batch_size = query_states.shape[0]
595
+ query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._unpad_input(
596
+ query_states, key_states, value_states, attention_mask, query_length
597
+ )
598
+
599
+ cu_seqlens_q, cu_seqlens_k = cu_seq_lens
600
+ max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
601
+
602
+ attn_output_unpad = flash_attn_varlen_func(
603
+ query_states,
604
+ key_states,
605
+ value_states,
606
+ cu_seqlens_q=cu_seqlens_q,
607
+ cu_seqlens_k=cu_seqlens_k,
608
+ max_seqlen_q=max_seqlen_in_batch_q,
609
+ max_seqlen_k=max_seqlen_in_batch_k,
610
+ dropout_p=dropout,
611
+ softmax_scale=softmax_scale,
612
+ causal=causal,
613
+ )
614
+
615
+ attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
616
+ else:
617
+ attn_output = flash_attn_func(
618
+ query_states, key_states, value_states, dropout, softmax_scale=softmax_scale, causal=causal
619
+ )
620
+
621
+ return attn_output
622
+
623
+ def _unpad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
624
+ indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
625
+ batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
626
+
627
+ key_layer = index_first_axis(
628
+ key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
629
+ )
630
+ value_layer = index_first_axis(
631
+ value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
632
+ )
633
+
634
+ if query_length == kv_seq_len:
635
+ query_layer = index_first_axis(
636
+ query_layer.reshape(batch_size * kv_seq_len, self.num_heads, head_dim), indices_k
637
+ )
638
+ cu_seqlens_q = cu_seqlens_k
639
+ max_seqlen_in_batch_q = max_seqlen_in_batch_k
640
+ indices_q = indices_k
641
+ elif query_length == 1:
642
+ max_seqlen_in_batch_q = 1
643
+ cu_seqlens_q = torch.arange(
644
+ batch_size + 1, dtype=torch.int32, device=query_layer.device
645
+ ) # There is a memcpy here, that is very bad.
646
+ indices_q = cu_seqlens_q[:-1]
647
+ query_layer = query_layer.squeeze(1)
648
+ else:
649
+ # The -q_len: slice assumes left padding.
650
+ attention_mask = attention_mask[:, -query_length:]
651
+ query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
652
+
653
+ return (
654
+ query_layer,
655
+ key_layer,
656
+ value_layer,
657
+ indices_q.to(torch.int64),
658
+ (cu_seqlens_q, cu_seqlens_k),
659
+ (max_seqlen_in_batch_q, max_seqlen_in_batch_k),
660
+ )
661
+
662
+
663
+ INTERNLM2_ATTENTION_CLASSES = {
664
+ 'eager': InternLM2Attention,
665
+ 'flash_attention_2': InternLM2FlashAttention2,
666
+ }
667
+
668
+
669
+ # Modified from transformers.model.llama.modeling_llama.LlamaDecoderLayer
670
+ class InternLM2DecoderLayer(nn.Module):
671
+ def __init__(self, config: InternLM2Config):
672
+ super().__init__()
673
+ self.hidden_size = config.hidden_size
674
+
675
+ self.attention = INTERNLM2_ATTENTION_CLASSES[config.attn_implementation](config=config)
676
+
677
+ self.feed_forward = InternLM2MLP(config)
678
+ self.attention_norm = InternLM2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
679
+ self.ffn_norm = InternLM2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
680
+
681
+ def forward(
682
+ self,
683
+ hidden_states: torch.Tensor,
684
+ attention_mask: Optional[torch.Tensor] = None,
685
+ position_ids: Optional[torch.LongTensor] = None,
686
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
687
+ output_attentions: Optional[bool] = False,
688
+ use_cache: Optional[bool] = False,
689
+ idx: Optional[int] = 0,
690
+ **kwargs,
691
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
692
+ """
693
+ Args:
694
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
695
+ attention_mask (`torch.FloatTensor`, *optional*):
696
+ attention mask of size `(batch_size, sequence_length)` if flash attention is used or `(batch_size, 1,
697
+ query_sequence_length, key_sequence_length)` if default attention is used.
698
+ output_attentions (`bool`, *optional*):
699
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
700
+ returned tensors for more detail.
701
+ use_cache (`bool`, *optional*):
702
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
703
+ (see `past_key_values`).
704
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
705
+ """
706
+ if 'padding_mask' in kwargs:
707
+ warnings.warn(
708
+ 'Passing `padding_mask` is deprecated and will be removed in v4.37. '
709
+ 'Please make sure use `attention_mask` instead.`'
710
+ )
711
+
712
+ residual = hidden_states
713
+
714
+ hidden_states = self.attention_norm(hidden_states)
715
+
716
+ # Self Attention
717
+ hidden_states, self_attn_weights, present_key_value = self.attention(
718
+ hidden_states=hidden_states,
719
+ attention_mask=attention_mask,
720
+ position_ids=position_ids,
721
+ past_key_value=past_key_value,
722
+ output_attentions=output_attentions,
723
+ use_cache=use_cache,
724
+ idx = idx,
725
+ **kwargs,
726
+ )
727
+ hidden_states = residual + hidden_states
728
+
729
+ # Fully Connected
730
+ residual = hidden_states
731
+ hidden_states = self.ffn_norm(hidden_states)
732
+ hidden_states = self.feed_forward(hidden_states)
733
+ hidden_states = residual + hidden_states
734
+
735
+ outputs = (hidden_states,)
736
+
737
+ if output_attentions:
738
+ outputs += (self_attn_weights,)
739
+
740
+ if use_cache:
741
+ outputs += (present_key_value,)
742
+
743
+ return outputs
744
+
745
+
746
+ InternLM2_START_DOCSTRING = r"""
747
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
748
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
749
+ etc.)
750
+
751
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
752
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
753
+ and behavior.
754
+
755
+ Parameters:
756
+ config ([`InternLM2Config`]):
757
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
758
+ load the weights associated with the model, only the configuration. Check out the
759
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
760
+ """
761
+
762
+
763
+ # Copied from transformers.models.llama.modeling_llama.LlamaPreTrainedModel with Llama->InternLM2
764
+ @add_start_docstrings(
765
+ 'The bare InternLM2 Model outputting raw hidden-states without any specific head on top.',
766
+ InternLM2_START_DOCSTRING,
767
+ )
768
+ class InternLM2PreTrainedModel(PreTrainedModel):
769
+ config_class = InternLM2Config
770
+ base_model_prefix = 'model'
771
+ supports_gradient_checkpointing = True
772
+ _no_split_modules = ['InternLM2DecoderLayer']
773
+ _skip_keys_device_placement = 'past_key_values'
774
+ _supports_flash_attn_2 = True
775
+
776
+ def _init_weights(self, module):
777
+ std = self.config.initializer_range
778
+ if isinstance(module, nn.Linear):
779
+ module.weight.data.normal_(mean=0.0, std=std)
780
+ if module.bias is not None:
781
+ module.bias.data.zero_()
782
+ elif isinstance(module, nn.Embedding):
783
+ module.weight.data.normal_(mean=0.0, std=std)
784
+ if module.padding_idx is not None:
785
+ module.weight.data[module.padding_idx].zero_()
786
+
787
+
788
+ InternLM2_INPUTS_DOCSTRING = r"""
789
+ Args:
790
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
791
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
792
+ it.
793
+
794
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
795
+ [`PreTrainedTokenizer.__call__`] for details.
796
+
797
+ [What are input IDs?](../glossary#input-ids)
798
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
799
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
800
+
801
+ - 1 for tokens that are **not masked**,
802
+ - 0 for tokens that are **masked**.
803
+
804
+ [What are attention masks?](../glossary#attention-mask)
805
+
806
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
807
+ [`PreTrainedTokenizer.__call__`] for details.
808
+
809
+ If `past_key_values` is used, optionally only the last `input_ids` have to be input (see
810
+ `past_key_values`).
811
+
812
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
813
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
814
+ information on the default strategy.
815
+
816
+ - 1 indicates the head is **not masked**,
817
+ - 0 indicates the head is **masked**.
818
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
819
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
820
+ config.n_positions - 1]`.
821
+
822
+ [What are position IDs?](../glossary#position-ids)
823
+ past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or
824
+ when `config.use_cache=True`):
825
+ Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
826
+ `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
827
+ `(batch_size, num_heads, decoder_sequence_length, embed_size_per_head)`.
828
+
829
+ Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
830
+ blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
831
+
832
+ If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
833
+ have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
834
+ of shape `(batch_size, sequence_length)`.
835
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
836
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
837
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
838
+ model's internal embedding lookup matrix.
839
+ use_cache (`bool`, *optional*):
840
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
841
+ `past_key_values`).
842
+ output_attentions (`bool`, *optional*):
843
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
844
+ tensors for more detail.
845
+ output_hidden_states (`bool`, *optional*):
846
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
847
+ more detail.
848
+ return_dict (`bool`, *optional*):
849
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
850
+ """
851
+
852
+
853
+ # Modified from transformers.model.llama.modeling_llama.LlamaModel
854
+ @add_start_docstrings(
855
+ 'The bare InternLM2 Model outputting raw hidden-states without any specific head on top.',
856
+ InternLM2_START_DOCSTRING,
857
+ )
858
+ class InternLM2Model(InternLM2PreTrainedModel):
859
+ """
860
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`InternLM2DecoderLayer`]
861
+
862
+ Args:
863
+ config: InternLM2Config
864
+ """
865
+
866
+ _auto_class = 'AutoModel'
867
+
868
+ def __init__(self, config: InternLM2Config):
869
+ super().__init__(config)
870
+ self.padding_idx = config.pad_token_id
871
+ self.vocab_size = config.vocab_size
872
+ self.config = config
873
+ if not has_flash_attn:
874
+ self.config.attn_implementation = 'eager'
875
+ print('Warning: Flash attention is not available, using eager attention instead.')
876
+
877
+ self.tok_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
878
+
879
+ self.layers = nn.ModuleList([InternLM2DecoderLayer(config) for _ in range(config.num_hidden_layers)])
880
+ self.norm = InternLM2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
881
+
882
+ self.gradient_checkpointing = False
883
+ # Initialize weights and apply final processing
884
+ self.post_init()
885
+
886
+ def get_input_embeddings(self):
887
+ return self.tok_embeddings
888
+
889
+ def set_input_embeddings(self, value):
890
+ self.tok_embeddings = value
891
+
892
+ def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length):
893
+ # create causal mask
894
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
895
+ combined_attention_mask = None
896
+ if input_shape[-1] > 1:
897
+ combined_attention_mask = _make_causal_mask(
898
+ input_shape,
899
+ inputs_embeds.dtype,
900
+ device=inputs_embeds.device,
901
+ past_key_values_length=past_key_values_length,
902
+ )
903
+
904
+ if attention_mask is not None:
905
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
906
+ expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(
907
+ inputs_embeds.device
908
+ )
909
+ combined_attention_mask = (
910
+ expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
911
+ )
912
+
913
+ return combined_attention_mask
914
+
915
+ @add_start_docstrings_to_model_forward(InternLM2_INPUTS_DOCSTRING)
916
+ def forward(
917
+ self,
918
+ input_ids: torch.LongTensor = None,
919
+ attention_mask: Optional[torch.Tensor] = None,
920
+ position_ids: Optional[torch.LongTensor] = None,
921
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
922
+ inputs_embeds: Optional[torch.FloatTensor] = None,
923
+ use_cache: Optional[bool] = None,
924
+ output_attentions: Optional[bool] = None,
925
+ output_hidden_states: Optional[bool] = None,
926
+ return_dict: Optional[bool] = None,
927
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
928
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
929
+ output_hidden_states = (
930
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
931
+ )
932
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
933
+
934
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
935
+
936
+ if self.config.attn_implementation == 'flash_attention_2':
937
+ _import_flash_attn()
938
+
939
+ # retrieve input_ids and inputs_embeds
940
+ if input_ids is not None and inputs_embeds is not None:
941
+ raise ValueError('You cannot specify both input_ids and inputs_embeds at the same time')
942
+ elif input_ids is not None:
943
+ batch_size, seq_length = input_ids.shape[:2]
944
+ elif inputs_embeds is not None:
945
+ batch_size, seq_length = inputs_embeds.shape[:2]
946
+ else:
947
+ raise ValueError('You have to specify either input_ids or inputs_embeds')
948
+
949
+ seq_length_with_past = seq_length
950
+ past_key_values_length = 0
951
+ if past_key_values is not None:
952
+ past_key_values_length = past_key_values[0][0].shape[2]
953
+ seq_length_with_past = seq_length_with_past + past_key_values_length
954
+
955
+ if position_ids is None:
956
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
957
+ position_ids = torch.arange(
958
+ past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
959
+ )
960
+ position_ids = position_ids.unsqueeze(0)
961
+
962
+ if inputs_embeds is None:
963
+ inputs_embeds = self.tok_embeddings(input_ids)
964
+
965
+ if self.config.attn_implementation == 'flash_attention_2':
966
+ # 2d mask is passed through the layers
967
+ attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
968
+ else:
969
+ if attention_mask is None:
970
+ attention_mask = torch.ones(
971
+ (batch_size, seq_length_with_past), dtype=torch.bool, device=inputs_embeds.device
972
+ )
973
+ attention_mask = self._prepare_decoder_attention_mask(
974
+ attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
975
+ )
976
+
977
+ # embed positions
978
+ hidden_states = inputs_embeds
979
+
980
+ if self.gradient_checkpointing and self.training:
981
+ if use_cache:
982
+ logger.warning_once(
983
+ '`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...'
984
+ )
985
+ use_cache = False
986
+
987
+ # decoder layers
988
+ all_hidden_states = () if output_hidden_states else None
989
+ all_self_attns = () if output_attentions else None
990
+ next_decoder_cache = () if use_cache else None
991
+
992
+ for idx, decoder_layer in enumerate(self.layers):
993
+ if output_hidden_states:
994
+ all_hidden_states += (hidden_states,)
995
+
996
+ past_key_value = past_key_values[idx] if past_key_values is not None else None
997
+
998
+ if self.gradient_checkpointing and self.training:
999
+
1000
+ def create_custom_forward(module):
1001
+ def custom_forward(*inputs):
1002
+ # None for past_key_value
1003
+ return module(*inputs, output_attentions, None)
1004
+
1005
+ return custom_forward
1006
+
1007
+ layer_outputs = torch.utils.checkpoint.checkpoint(
1008
+ create_custom_forward(decoder_layer),
1009
+ hidden_states,
1010
+ attention_mask,
1011
+ position_ids,
1012
+ None,
1013
+ )
1014
+ else:
1015
+ layer_outputs = decoder_layer(
1016
+ hidden_states,
1017
+ attention_mask=attention_mask,
1018
+ position_ids=position_ids,
1019
+ past_key_value=past_key_value,
1020
+ output_attentions=output_attentions,
1021
+ use_cache=use_cache,
1022
+ idx = idx,
1023
+ )
1024
+
1025
+ hidden_states = layer_outputs[0]
1026
+
1027
+ if use_cache:
1028
+ next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
1029
+
1030
+ if output_attentions:
1031
+ all_self_attns += (layer_outputs[1],)
1032
+
1033
+ hidden_states = self.norm(hidden_states)
1034
+
1035
+ # add hidden states from the last decoder layer
1036
+ if output_hidden_states:
1037
+ all_hidden_states += (hidden_states,)
1038
+
1039
+ next_cache = next_decoder_cache if use_cache else None
1040
+ if not return_dict:
1041
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
1042
+ return BaseModelOutputWithPast(
1043
+ last_hidden_state=hidden_states,
1044
+ past_key_values=next_cache,
1045
+ hidden_states=all_hidden_states,
1046
+ attentions=all_self_attns,
1047
+ )
1048
+
1049
+
1050
+ # Modified from transformers.model.llama.modeling_llama.LlamaForCausalLM
1051
+ class InternLM2ForCausalLM(InternLM2PreTrainedModel):
1052
+ _auto_class = 'AutoModelForCausalLM'
1053
+
1054
+ _tied_weights_keys = ['output.weight']
1055
+
1056
+ def __init__(self, config):
1057
+ super().__init__(config)
1058
+ self.model = InternLM2Model(config)
1059
+ self.vocab_size = config.vocab_size
1060
+ self.output = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
1061
+
1062
+ # Initialize weights and apply final processing
1063
+ self.post_init()
1064
+
1065
+ def get_input_embeddings(self):
1066
+ return self.model.tok_embeddings
1067
+
1068
+ def set_input_embeddings(self, value):
1069
+ self.model.tok_embeddings = value
1070
+
1071
+ def get_output_embeddings(self):
1072
+ return self.output
1073
+
1074
+ def set_output_embeddings(self, new_embeddings):
1075
+ self.output = new_embeddings
1076
+
1077
+ def set_decoder(self, decoder):
1078
+ self.model = decoder
1079
+
1080
+ def get_decoder(self):
1081
+ return self.model
1082
+
1083
+ @add_start_docstrings_to_model_forward(InternLM2_INPUTS_DOCSTRING)
1084
+ @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
1085
+ def forward(
1086
+ self,
1087
+ input_ids: torch.LongTensor = None,
1088
+ attention_mask: Optional[torch.Tensor] = None,
1089
+ position_ids: Optional[torch.LongTensor] = None,
1090
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1091
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1092
+ labels: Optional[torch.LongTensor] = None,
1093
+ use_cache: Optional[bool] = None,
1094
+ output_attentions: Optional[bool] = None,
1095
+ output_hidden_states: Optional[bool] = None,
1096
+ return_dict: Optional[bool] = None,
1097
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
1098
+ r"""
1099
+ Args:
1100
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1101
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
1102
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
1103
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
1104
+
1105
+ Returns:
1106
+
1107
+ Example:
1108
+
1109
+ ```python
1110
+ >>> from transformers import AutoTokenizer, InternLM2ForCausalLM
1111
+
1112
+ >>> model = InternLM2ForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
1113
+ >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
1114
+
1115
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
1116
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
1117
+
1118
+ >>> # Generate
1119
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
1120
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
1121
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
1122
+ ```"""
1123
+
1124
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1125
+ output_hidden_states = (
1126
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1127
+ )
1128
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1129
+
1130
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
1131
+ outputs = self.model(
1132
+ input_ids=input_ids,
1133
+ attention_mask=attention_mask,
1134
+ position_ids=position_ids,
1135
+ past_key_values=past_key_values,
1136
+ inputs_embeds=inputs_embeds,
1137
+ use_cache=use_cache,
1138
+ output_attentions=output_attentions,
1139
+ output_hidden_states=output_hidden_states,
1140
+ return_dict=return_dict,
1141
+ )
1142
+
1143
+ hidden_states = outputs[0]
1144
+ logits = self.output(hidden_states)
1145
+ logits = logits.float()
1146
+
1147
+ loss = None
1148
+ if labels is not None:
1149
+ # Shift so that tokens < n predict n
1150
+ shift_logits = logits[..., :-1, :].contiguous()
1151
+ shift_labels = labels[..., 1:].contiguous()
1152
+ # Flatten the tokens
1153
+ loss_fct = CrossEntropyLoss()
1154
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
1155
+ shift_labels = shift_labels.view(-1)
1156
+ # Enable model parallelism
1157
+ shift_labels = shift_labels.to(shift_logits.device)
1158
+ loss = loss_fct(shift_logits, shift_labels)
1159
+
1160
+ if not return_dict:
1161
+ output = (logits,) + outputs[1:]
1162
+ return (loss,) + output if loss is not None else output
1163
+
1164
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
1165
+ output = CausalLMOutputWithPast(
1166
+ loss=loss,
1167
+ logits=logits,
1168
+ past_key_values=outputs.past_key_values,
1169
+ hidden_states=outputs.hidden_states,
1170
+ attentions=outputs.attentions,
1171
+ )
1172
+ output['logits'] = output['logits'].to(device)
1173
+ return output
1174
+
1175
+ def prepare_inputs_for_generation(
1176
+ self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
1177
+ ):
1178
+ if past_key_values is not None:
1179
+ past_length = past_key_values[0][0].shape[2]
1180
+
1181
+ # Some generation methods already pass only the last input ID
1182
+ if input_ids.shape[1] > past_length:
1183
+ remove_prefix_length = past_length
1184
+ else:
1185
+ # Default to old behavior: keep only final ID
1186
+ remove_prefix_length = input_ids.shape[1] - 1
1187
+
1188
+ input_ids = input_ids[:, remove_prefix_length:]
1189
+
1190
+ position_ids = kwargs.get('position_ids', None)
1191
+ if attention_mask is not None and position_ids is None:
1192
+ # create position_ids on the fly for batch generation
1193
+ position_ids = attention_mask.long().cumsum(-1) - 1
1194
+ position_ids.masked_fill_(attention_mask == 0, 1)
1195
+ if past_key_values:
1196
+ position_ids = position_ids[:, -input_ids.shape[1] :]
1197
+
1198
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
1199
+ if inputs_embeds is not None and past_key_values is None:
1200
+ model_inputs = {'inputs_embeds': inputs_embeds}
1201
+ else:
1202
+ model_inputs = {'input_ids': input_ids}
1203
+
1204
+ model_inputs.update(
1205
+ {
1206
+ 'position_ids': position_ids,
1207
+ 'past_key_values': past_key_values,
1208
+ 'use_cache': kwargs.get('use_cache'),
1209
+ 'attention_mask': attention_mask,
1210
+ }
1211
+ )
1212
+ return model_inputs
1213
+
1214
+ @staticmethod
1215
+ def _reorder_cache(past_key_values, beam_idx):
1216
+ reordered_past = ()
1217
+ for layer_past in past_key_values:
1218
+ reordered_past += (
1219
+ tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
1220
+ )
1221
+ return reordered_past
1222
+
1223
+ def build_inputs(self, tokenizer, query: str, history: List[Tuple[str, str]] = [], meta_instruction=''):
1224
+ if tokenizer.add_bos_token:
1225
+ prompt = ''
1226
+ else:
1227
+ prompt = tokenizer.bos_token
1228
+ if meta_instruction:
1229
+ prompt += f"""<|im_start|>system\n{meta_instruction}<|im_end|>\n"""
1230
+ for record in history:
1231
+ prompt += f"""<|im_start|>user\n{record[0]}<|im_end|>\n<|im_start|>assistant\n{record[1]}<|im_end|>\n"""
1232
+ prompt += f"""<|im_start|>user\n{query}<|im_end|>\n<|im_start|>assistant\n"""
1233
+ return tokenizer([prompt], return_tensors='pt')
1234
+
1235
+ @torch.no_grad()
1236
+ def chat(
1237
+ self,
1238
+ tokenizer,
1239
+ query: str,
1240
+ history: List[Tuple[str, str]] = [],
1241
+ streamer: Optional[BaseStreamer] = None,
1242
+ max_new_tokens: int = 1024,
1243
+ do_sample: bool = True,
1244
+ temperature: float = 0.8,
1245
+ top_p: float = 0.8,
1246
+ meta_instruction: str = 'You are an AI assistant whose name is InternLM (书生·浦语).\n'
1247
+ '- InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.\n'
1248
+ '- InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文.',
1249
+ **kwargs,
1250
+ ):
1251
+ inputs = self.build_inputs(tokenizer, query, history, meta_instruction)
1252
+ inputs = {k: v.to(self.device) for k, v in inputs.items() if torch.is_tensor(v)}
1253
+ # also add end-of-assistant token in eos token id to avoid unnecessary generation
1254
+ eos_token_id = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids(['<|im_end|>'])[0]]
1255
+ outputs = self.generate(
1256
+ **inputs,
1257
+ streamer=streamer,
1258
+ max_new_tokens=max_new_tokens,
1259
+ do_sample=do_sample,
1260
+ temperature=temperature,
1261
+ top_p=top_p,
1262
+ eos_token_id=eos_token_id,
1263
+ **kwargs,
1264
+ )
1265
+ outputs = outputs[0].cpu().tolist()[len(inputs['input_ids'][0]) :]
1266
+ response = tokenizer.decode(outputs, skip_special_tokens=True)
1267
+ response = response.split('<|im_end|>')[0]
1268
+ history = history + [(query, response)]
1269
+ return response, history
1270
+
1271
+ @torch.no_grad()
1272
+ def stream_chat(
1273
+ self,
1274
+ tokenizer,
1275
+ query: str,
1276
+ history: List[Tuple[str, str]] = [],
1277
+ max_new_tokens: int = 1024,
1278
+ do_sample: bool = True,
1279
+ temperature: float = 0.8,
1280
+ top_p: float = 0.8,
1281
+ **kwargs,
1282
+ ):
1283
+ """
1284
+ Return a generator in format: (response, history)
1285
+ Eg.
1286
+ ('你好,有什么可以帮助您的吗', [('你好', '你好,有什么可以帮助您的吗')])
1287
+ ('你好,有什么可以帮助您的吗?', [('你好', '你好,有什么可以帮助您的吗?')])
1288
+ """
1289
+ if BaseStreamer is None:
1290
+ raise ModuleNotFoundError(
1291
+ 'The version of `transformers` is too low. Please make sure '
1292
+ 'that you have installed `transformers>=4.28.0`.'
1293
+ )
1294
+
1295
+ response_queue = queue.Queue(maxsize=20)
1296
+
1297
+ class ChatStreamer(BaseStreamer):
1298
+ def __init__(self, tokenizer) -> None:
1299
+ super().__init__()
1300
+ self.tokenizer = tokenizer
1301
+ self.queue = response_queue
1302
+ self.query = query
1303
+ self.history = history
1304
+ self.response = ''
1305
+ self.cache = []
1306
+ self.received_inputs = False
1307
+ self.queue.put((self.response, history + [(self.query, self.response)]))
1308
+
1309
+ def put(self, value):
1310
+ if len(value.shape) > 1 and value.shape[0] > 1:
1311
+ raise ValueError('ChatStreamer only supports batch size 1')
1312
+ elif len(value.shape) > 1:
1313
+ value = value[0]
1314
+
1315
+ if not self.received_inputs:
1316
+ # The first received value is input_ids, ignore here
1317
+ self.received_inputs = True
1318
+ return
1319
+
1320
+ self.cache.extend(value.tolist())
1321
+ token = self.tokenizer.decode(self.cache, skip_special_tokens=True)
1322
+ if token.strip() != '<|im_end|>':
1323
+ self.response = self.response + token
1324
+ history = self.history + [(self.query, self.response)]
1325
+ self.queue.put((self.response, history))
1326
+ self.cache = []
1327
+ else:
1328
+ self.end()
1329
+
1330
+ def end(self):
1331
+ self.queue.put(None)
1332
+
1333
+ def stream_producer():
1334
+ return self.chat(
1335
+ tokenizer=tokenizer,
1336
+ query=query,
1337
+ streamer=ChatStreamer(tokenizer=tokenizer),
1338
+ history=history,
1339
+ max_new_tokens=max_new_tokens,
1340
+ do_sample=do_sample,
1341
+ temperature=temperature,
1342
+ top_p=top_p,
1343
+ **kwargs,
1344
+ )
1345
+
1346
+ def consumer():
1347
+ producer = threading.Thread(target=stream_producer)
1348
+ producer.start()
1349
+ while True:
1350
+ res = response_queue.get()
1351
+ if res is None:
1352
+ return
1353
+ yield res
1354
+
1355
+ return consumer()
1356
+
1357
+
1358
+ # Copied from transformers.model.llama.modeling_llama.LlamaForSequenceClassification with Llama->InternLM2
1359
+ @add_start_docstrings(
1360
+ """
1361
+ The InternLM2 Model transformer with a sequence classification head on top (linear layer).
1362
+
1363
+ [`InternLM2ForSequenceClassification`] uses the last token in order to do the classification,
1364
+ as other causal models (e.g. GPT-2) do.
1365
+
1366
+ Since it does classification on the last token, it requires to know the position of the last token. If a
1367
+ `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
1368
+ no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
1369
+ padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
1370
+ each row of the batch).
1371
+ """,
1372
+ InternLM2_START_DOCSTRING,
1373
+ )
1374
+ class InternLM2ForSequenceClassification(InternLM2PreTrainedModel):
1375
+ def __init__(self, config):
1376
+ super().__init__(config)
1377
+ self.num_labels = config.num_labels
1378
+ self.model = InternLM2Model(config)
1379
+ self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
1380
+
1381
+ # Initialize weights and apply final processing
1382
+ self.post_init()
1383
+
1384
+ def get_input_embeddings(self):
1385
+ return self.model.tok_embeddings
1386
+
1387
+ def set_input_embeddings(self, value):
1388
+ self.model.tok_embeddings = value
1389
+
1390
+ @add_start_docstrings_to_model_forward(InternLM2_INPUTS_DOCSTRING)
1391
+ def forward(
1392
+ self,
1393
+ input_ids: torch.LongTensor = None,
1394
+ attention_mask: Optional[torch.Tensor] = None,
1395
+ position_ids: Optional[torch.LongTensor] = None,
1396
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1397
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1398
+ labels: Optional[torch.LongTensor] = None,
1399
+ use_cache: Optional[bool] = None,
1400
+ output_attentions: Optional[bool] = None,
1401
+ output_hidden_states: Optional[bool] = None,
1402
+ return_dict: Optional[bool] = None,
1403
+ ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
1404
+ r"""
1405
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1406
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1407
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1408
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1409
+ """
1410
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1411
+
1412
+ transformer_outputs = self.model(
1413
+ input_ids,
1414
+ attention_mask=attention_mask,
1415
+ position_ids=position_ids,
1416
+ past_key_values=past_key_values,
1417
+ inputs_embeds=inputs_embeds,
1418
+ use_cache=use_cache,
1419
+ output_attentions=output_attentions,
1420
+ output_hidden_states=output_hidden_states,
1421
+ return_dict=return_dict,
1422
+ )
1423
+ hidden_states = transformer_outputs[0]
1424
+ logits = self.score(hidden_states)
1425
+
1426
+ if input_ids is not None:
1427
+ batch_size = input_ids.shape[0]
1428
+ else:
1429
+ batch_size = inputs_embeds.shape[0]
1430
+
1431
+ if self.config.pad_token_id is None and batch_size != 1:
1432
+ raise ValueError('Cannot handle batch sizes > 1 if no padding token is defined.')
1433
+ if self.config.pad_token_id is None:
1434
+ sequence_lengths = -1
1435
+ else:
1436
+ if input_ids is not None:
1437
+ sequence_lengths = (torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1).to(
1438
+ logits.device
1439
+ )
1440
+ else:
1441
+ sequence_lengths = -1
1442
+
1443
+ pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
1444
+
1445
+ loss = None
1446
+ if labels is not None:
1447
+ labels = labels.to(logits.device)
1448
+ if self.config.problem_type is None:
1449
+ if self.num_labels == 1:
1450
+ self.config.problem_type = 'regression'
1451
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
1452
+ self.config.problem_type = 'single_label_classification'
1453
+ else:
1454
+ self.config.problem_type = 'multi_label_classification'
1455
+
1456
+ if self.config.problem_type == 'regression':
1457
+ loss_fct = MSELoss()
1458
+ if self.num_labels == 1:
1459
+ loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
1460
+ else:
1461
+ loss = loss_fct(pooled_logits, labels)
1462
+ elif self.config.problem_type == 'single_label_classification':
1463
+ loss_fct = CrossEntropyLoss()
1464
+ loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
1465
+ elif self.config.problem_type == 'multi_label_classification':
1466
+ loss_fct = BCEWithLogitsLoss()
1467
+ loss = loss_fct(pooled_logits, labels)
1468
+ if not return_dict:
1469
+ output = (pooled_logits,) + transformer_outputs[1:]
1470
+ return ((loss,) + output) if loss is not None else output
1471
+
1472
+ return SequenceClassifierOutputWithPast(
1473
+ loss=loss,
1474
+ logits=pooled_logits,
1475
+ past_key_values=transformer_outputs.past_key_values,
1476
+ hidden_states=transformer_outputs.hidden_states,
1477
+ attentions=transformer_outputs.attentions,
1478
+ )
modeling_internvl_chat.py ADDED
@@ -0,0 +1,364 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # --------------------------------------------------------
2
+ # InternVL
3
+ # Copyright (c) 2024 OpenGVLab
4
+ # Licensed under The MIT License [see LICENSE for details]
5
+ # --------------------------------------------------------
6
+ import warnings
7
+ from typing import Any, List, Optional, Tuple, Union
8
+
9
+ import torch.utils.checkpoint
10
+ import transformers
11
+ from torch import nn
12
+ from torch.nn import CrossEntropyLoss
13
+ from transformers import (AutoModel, GenerationConfig, LlamaForCausalLM,
14
+ LlamaTokenizer)
15
+ from transformers.modeling_outputs import CausalLMOutputWithPast
16
+ from transformers.modeling_utils import PreTrainedModel
17
+ from transformers.utils import ModelOutput, logging
18
+
19
+ from .configuration_internvl_chat import InternVLChatConfig
20
+ from .conversation import get_conv_template
21
+ from .modeling_intern_vit import InternVisionModel, has_flash_attn
22
+ from .modeling_internlm2 import InternLM2ForCausalLM
23
+
24
+ logger = logging.get_logger(__name__)
25
+ import os
26
+ image_token_num = 0
27
+
28
+ def version_cmp(v1, v2, op='eq'):
29
+ import operator
30
+
31
+ from packaging import version
32
+ op_func = getattr(operator, op)
33
+ return op_func(version.parse(v1), version.parse(v2))
34
+
35
+
36
+ class InternVLChatModel(PreTrainedModel):
37
+ config_class = InternVLChatConfig
38
+ main_input_name = 'pixel_values'
39
+ base_model_prefix = 'language_model'
40
+ _supports_flash_attn_2 = True
41
+ _no_split_modules = ['InternVisionModel', 'LlamaDecoderLayer', 'InternLM2DecoderLayer']
42
+
43
+ def __init__(self, config: InternVLChatConfig, vision_model=None, language_model=None, use_flash_attn=True):
44
+ super().__init__(config)
45
+
46
+ assert version_cmp(transformers.__version__, '4.36.2', 'ge')
47
+ image_size = config.force_image_size or config.vision_config.image_size
48
+ patch_size = config.vision_config.patch_size
49
+ self.patch_size = patch_size
50
+ self.select_layer = config.select_layer
51
+ self.template = config.template
52
+ self.num_image_token = int((image_size // patch_size) ** 2 * (config.downsample_ratio ** 2))
53
+ self.downsample_ratio = config.downsample_ratio
54
+ self.ps_version = config.ps_version
55
+ use_flash_attn = use_flash_attn if has_flash_attn else False
56
+ use_flash_attn = False
57
+ config.vision_config.use_flash_attn = True if use_flash_attn else False
58
+ config.llm_config.attn_implementation = 'flash_attention_2' if use_flash_attn else 'eager'
59
+
60
+ logger.info(f'num_image_token: {self.num_image_token}')
61
+ logger.info(f'ps_version: {self.ps_version}')
62
+ if vision_model is not None:
63
+ self.vision_model = vision_model
64
+ else:
65
+ self.vision_model = InternVisionModel(config.vision_config)
66
+ if language_model is not None:
67
+ self.language_model = language_model
68
+ else:
69
+ if config.llm_config.architectures[0] == 'LlamaForCausalLM':
70
+ self.language_model = LlamaForCausalLM(config.llm_config)
71
+ elif config.llm_config.architectures[0] == 'InternLM2ForCausalLM':
72
+ self.language_model = InternLM2ForCausalLM(config.llm_config)
73
+ else:
74
+ raise NotImplementedError(f'{config.llm_config.architectures[0]} is not implemented.')
75
+
76
+ vit_hidden_size = config.vision_config.hidden_size
77
+ llm_hidden_size = config.llm_config.hidden_size
78
+
79
+ self.mlp1 = nn.Sequential(
80
+ nn.LayerNorm(vit_hidden_size * int(1 / self.downsample_ratio) ** 2),
81
+ nn.Linear(vit_hidden_size * int(1 / self.downsample_ratio) ** 2, llm_hidden_size),
82
+ nn.GELU(),
83
+ nn.Linear(llm_hidden_size, llm_hidden_size)
84
+ )
85
+
86
+ self.img_context_token_id = None
87
+ self.conv_template = get_conv_template(self.template)
88
+ self.system_message = self.conv_template.system_message
89
+
90
+ def forward(
91
+ self,
92
+ pixel_values: torch.FloatTensor,
93
+ input_ids: torch.LongTensor = None,
94
+ attention_mask: Optional[torch.Tensor] = None,
95
+ position_ids: Optional[torch.LongTensor] = None,
96
+ image_flags: Optional[torch.LongTensor] = None,
97
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
98
+ labels: Optional[torch.LongTensor] = None,
99
+ use_cache: Optional[bool] = None,
100
+ output_attentions: Optional[bool] = None,
101
+ output_hidden_states: Optional[bool] = None,
102
+ return_dict: Optional[bool] = None,
103
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
104
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
105
+
106
+ image_flags = image_flags.squeeze(-1)
107
+ input_embeds = self.language_model.get_input_embeddings()(input_ids).clone()
108
+
109
+ vit_embeds = self.extract_feature(pixel_values)
110
+ vit_embeds = vit_embeds[image_flags == 1]
111
+ vit_batch_size = pixel_values.shape[0]
112
+
113
+ B, N, C = input_embeds.shape
114
+ input_embeds = input_embeds.reshape(B * N, C)
115
+
116
+ if torch.distributed.get_rank() == 0:
117
+ print(f'dynamic ViT batch size: {vit_batch_size}, images per sample: {vit_batch_size / B}, dynamic token length: {N}')
118
+
119
+ input_ids = input_ids.reshape(B * N)
120
+ selected = (input_ids == self.img_context_token_id)
121
+ try:
122
+ input_embeds[selected] = input_embeds[selected] * 0.0 + vit_embeds.reshape(-1, C)
123
+ except Exception as e:
124
+ vit_embeds = vit_embeds.reshape(-1, C)
125
+ print(f'warning: {e}, input_embeds[selected].shape={input_embeds[selected].shape}, '
126
+ f'vit_embeds.shape={vit_embeds.shape}')
127
+ n_token = selected.sum()
128
+ input_embeds[selected] = input_embeds[selected] * 0.0 + vit_embeds[:n_token]
129
+
130
+ input_embeds = input_embeds.reshape(B, N, C)
131
+
132
+ outputs = self.language_model(
133
+ inputs_embeds=input_embeds,
134
+ attention_mask=attention_mask,
135
+ position_ids=position_ids,
136
+ past_key_values=past_key_values,
137
+ use_cache=use_cache,
138
+ output_attentions=output_attentions,
139
+ output_hidden_states=output_hidden_states,
140
+ return_dict=return_dict,
141
+ )
142
+ logits = outputs.logits
143
+
144
+ loss = None
145
+ if labels is not None:
146
+ # Shift so that tokens < n predict n
147
+ shift_logits = logits[..., :-1, :].contiguous()
148
+ shift_labels = labels[..., 1:].contiguous()
149
+ # Flatten the tokens
150
+ loss_fct = CrossEntropyLoss()
151
+ shift_logits = shift_logits.view(-1, self.language_model.config.vocab_size)
152
+ shift_labels = shift_labels.view(-1)
153
+ # Enable model parallelism
154
+ shift_labels = shift_labels.to(shift_logits.device)
155
+ loss = loss_fct(shift_logits, shift_labels)
156
+
157
+ if not return_dict:
158
+ output = (logits,) + outputs[1:]
159
+ return (loss,) + output if loss is not None else output
160
+
161
+ return CausalLMOutputWithPast(
162
+ loss=loss,
163
+ logits=logits,
164
+ past_key_values=outputs.past_key_values,
165
+ hidden_states=outputs.hidden_states,
166
+ attentions=outputs.attentions,
167
+ )
168
+
169
+ def pixel_shuffle(self, x, scale_factor=0.5):
170
+ n, w, h, c = x.size()
171
+ # N, W, H, C --> N, W, H * scale, C // scale
172
+ x = x.view(n, w, int(h * scale_factor), int(c / scale_factor))
173
+ # N, W, H * scale, C // scale --> N, H * scale, W, C // scale
174
+ x = x.permute(0, 2, 1, 3).contiguous()
175
+ # N, H * scale, W, C // scale --> N, H * scale, W * scale, C // (scale ** 2)
176
+ x = x.view(n, int(h * scale_factor), int(w * scale_factor),
177
+ int(c / (scale_factor * scale_factor)))
178
+ if self.ps_version == 'v1':
179
+ warnings.warn("In ps_version 'v1', the height and width have not been swapped back, "
180
+ 'which results in a transposed image.')
181
+ else:
182
+ x = x.permute(0, 2, 1, 3).contiguous()
183
+ return x
184
+
185
+ def extract_feature(self, pixel_values):
186
+ if self.select_layer == -1:
187
+ vit_embeds = self.vision_model(
188
+ pixel_values=pixel_values,
189
+ output_hidden_states=False,
190
+ return_dict=True).last_hidden_state
191
+ else:
192
+ vit_embeds = self.vision_model(
193
+ pixel_values=pixel_values,
194
+ output_hidden_states=True,
195
+ return_dict=True).hidden_states[self.select_layer]
196
+ vit_embeds = vit_embeds[:, 1:, :]
197
+
198
+ h = w = int(vit_embeds.shape[1] ** 0.5)
199
+ os.environ['IMAGE_H'] = str(h)
200
+
201
+ vit_embeds = vit_embeds.reshape(vit_embeds.shape[0], h, w, -1)
202
+ vit_embeds = self.pixel_shuffle(vit_embeds, scale_factor=self.downsample_ratio)
203
+ vit_embeds = vit_embeds.reshape(vit_embeds.shape[0], -1, vit_embeds.shape[-1])
204
+ vit_embeds = self.mlp1(vit_embeds)
205
+ # import pdb; pdb.set_trace()
206
+ return vit_embeds
207
+
208
+ def batch_chat(self, tokenizer, pixel_values, questions, generation_config, num_patches_list=None,
209
+ history=None, return_history=False, IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>',
210
+ IMG_CONTEXT_TOKEN='<IMG_CONTEXT>', verbose=False, image_counts=None):
211
+ if history is not None or return_history:
212
+ print('Now multi-turn chat is not supported in batch_chat.')
213
+ raise NotImplementedError
214
+
215
+ if image_counts is not None:
216
+ num_patches_list = image_counts
217
+ print('Warning: `image_counts` is deprecated. Please use `num_patches_list` instead.')
218
+
219
+ img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
220
+ self.img_context_token_id = img_context_token_id
221
+
222
+ if verbose and pixel_values is not None:
223
+ image_bs = pixel_values.shape[0]
224
+ print(f'dynamic ViT batch size: {image_bs}')
225
+
226
+ queries = []
227
+ for idx, num_patches in enumerate(num_patches_list):
228
+ question = questions[idx]
229
+ if pixel_values is not None and '<image>' not in question:
230
+ question = '<image>\n' + question
231
+ template = get_conv_template(self.template)
232
+ template.system_message = self.system_message
233
+ template.append_message(template.roles[0], question)
234
+ template.append_message(template.roles[1], None)
235
+ query = template.get_prompt()
236
+
237
+ image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * num_patches + IMG_END_TOKEN
238
+ query = query.replace('<image>', image_tokens, 1)
239
+ queries.append(query)
240
+
241
+ tokenizer.padding_side = 'left'
242
+ model_inputs = tokenizer(queries, return_tensors='pt', padding=True)
243
+ input_ids = model_inputs['input_ids'].to(self.device)
244
+ attention_mask = model_inputs['attention_mask'].to(self.device)
245
+ eos_token_id = tokenizer.convert_tokens_to_ids(template.sep)
246
+ generation_config['eos_token_id'] = eos_token_id
247
+ generation_output = self.generate(
248
+ pixel_values=pixel_values,
249
+ input_ids=input_ids,
250
+ attention_mask=attention_mask,
251
+ **generation_config
252
+ )
253
+ responses = tokenizer.batch_decode(generation_output, skip_special_tokens=True)
254
+ responses = [response.split(template.sep)[0].strip() for response in responses]
255
+ return responses
256
+
257
+ def chat(self, tokenizer, pixel_values, question, generation_config, history=None, return_history=False,
258
+ num_patches_list=None, IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>', IMG_CONTEXT_TOKEN='<IMG_CONTEXT>',
259
+ verbose=False):
260
+
261
+ if history is None and pixel_values is not None and '<image>' not in question:
262
+ question = '<image>\n' + question
263
+
264
+ if num_patches_list is None:
265
+ num_patches_list = [pixel_values.shape[0]] if pixel_values is not None else []
266
+ assert pixel_values is None or len(pixel_values) == sum(num_patches_list)
267
+
268
+ img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
269
+ self.img_context_token_id = img_context_token_id
270
+
271
+ template = get_conv_template(self.template)
272
+ template.system_message = self.system_message
273
+ eos_token_id = tokenizer.convert_tokens_to_ids(template.sep)
274
+
275
+ history = [] if history is None else history
276
+ for (old_question, old_answer) in history:
277
+ template.append_message(template.roles[0], old_question)
278
+ template.append_message(template.roles[1], old_answer)
279
+ template.append_message(template.roles[0], question)
280
+ template.append_message(template.roles[1], None)
281
+ query = template.get_prompt()
282
+
283
+ if verbose and pixel_values is not None:
284
+ image_bs = pixel_values.shape[0]
285
+ print(f'dynamic ViT batch size: {image_bs}')
286
+
287
+ for num_patches in num_patches_list:
288
+ image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * num_patches + IMG_END_TOKEN
289
+ query = query.replace('<image>', image_tokens, 1)
290
+
291
+ model_inputs = tokenizer(query, return_tensors='pt')
292
+ input_ids = model_inputs['input_ids'].to(self.device)
293
+ attention_mask = model_inputs['attention_mask'].to(self.device)
294
+ generation_config['eos_token_id'] = eos_token_id
295
+ generation_output = self.generate(
296
+ pixel_values=pixel_values,
297
+ input_ids=input_ids,
298
+ attention_mask=attention_mask,
299
+ **generation_config
300
+ )
301
+ response = tokenizer.batch_decode(generation_output, skip_special_tokens=True)[0]
302
+ response = response.split(template.sep)[0].strip()
303
+ history.append((question, response))
304
+ if return_history:
305
+ return response, history
306
+ else:
307
+ query_to_print = query.replace(IMG_CONTEXT_TOKEN, '')
308
+ query_to_print = query_to_print.replace(f'{IMG_START_TOKEN}{IMG_END_TOKEN}', '<image>')
309
+ if verbose:
310
+ print(query_to_print, response)
311
+ return response
312
+
313
+ @torch.no_grad()
314
+ def generate(
315
+ self,
316
+ pixel_values: Optional[torch.FloatTensor] = None,
317
+ input_ids: Optional[torch.FloatTensor] = None,
318
+ attention_mask: Optional[torch.LongTensor] = None,
319
+ visual_features: Optional[torch.FloatTensor] = None,
320
+ generation_config: Optional[GenerationConfig] = None,
321
+ output_hidden_states: Optional[bool] = None,
322
+ return_dict: Optional[bool] = None,
323
+ **generate_kwargs,
324
+ ) -> torch.LongTensor:
325
+
326
+ assert self.img_context_token_id is not None
327
+ if pixel_values is not None:
328
+ if visual_features is not None:
329
+ vit_embeds = visual_features
330
+ else:
331
+ vit_embeds = self.extract_feature(pixel_values)
332
+ input_embeds = self.language_model.get_input_embeddings()(input_ids)
333
+ B, N, C = input_embeds.shape
334
+ input_embeds = input_embeds.reshape(B * N, C)
335
+
336
+ input_ids = input_ids.reshape(B * N)
337
+ # import pdb; pdb.set_trace()
338
+ selected = (input_ids == self.img_context_token_id)
339
+ assert selected.sum() != 0
340
+ # import pdb; pdb.set_trace()
341
+ input_embeds[selected] = vit_embeds.reshape(-1, C).to(input_embeds.device)
342
+
343
+ image_token_num = vit_embeds.shape[0] * vit_embeds.shape[1]
344
+ os.environ['IMAGE_TOKEN_NUM'] = str(image_token_num)
345
+ # import pdb; pdb.set_trace()
346
+ input_embeds = input_embeds.reshape(B, N, C)
347
+ else:
348
+
349
+ image_token_num = 0
350
+ os.environ['IMAGE_TOKEN_NUM'] = str(image_token_num)
351
+
352
+ input_embeds = self.language_model.get_input_embeddings()(input_ids)
353
+
354
+ outputs = self.language_model.generate(
355
+ inputs_embeds=input_embeds,
356
+ attention_mask=attention_mask,
357
+ generation_config=generation_config,
358
+ output_hidden_states=output_hidden_states,
359
+ return_dict=return_dict,
360
+ use_cache=True,
361
+ **generate_kwargs,
362
+ )
363
+
364
+ return outputs
preprocessor_config.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": 448,
3
+ "do_center_crop": true,
4
+ "do_normalize": true,
5
+ "do_resize": true,
6
+ "feature_extractor_type": "CLIPFeatureExtractor",
7
+ "image_mean": [
8
+ 0.485,
9
+ 0.456,
10
+ 0.406
11
+ ],
12
+ "image_std": [
13
+ 0.229,
14
+ 0.224,
15
+ 0.225
16
+ ],
17
+ "resample": 3,
18
+ "size": 448
19
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<img>",
4
+ "</img>",
5
+ "<IMG_CONTEXT>",
6
+ "<quad>",
7
+ "</quad>",
8
+ "<ref>",
9
+ "</ref>",
10
+ "<box>",
11
+ "</box>"
12
+ ],
13
+ "bos_token": {
14
+ "content": "<s>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false
19
+ },
20
+ "eos_token": {
21
+ "content": "</s>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false
26
+ },
27
+ "pad_token": {
28
+ "content": "</s>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false
33
+ },
34
+ "unk_token": {
35
+ "content": "<unk>",
36
+ "lstrip": false,
37
+ "normalized": false,
38
+ "rstrip": false,
39
+ "single_word": false
40
+ }
41
+ }
tokenization_internlm2.py ADDED
@@ -0,0 +1,235 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) The InternLM team and The HuggingFace Inc. team. All rights reserved.
2
+ #
3
+ # This code is based on transformers/src/transformers/models/llama/tokenization_llama.py
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+
17
+ """Tokenization classes for InternLM."""
18
+ import os
19
+ from shutil import copyfile
20
+ from typing import Any, Dict, List, Optional, Tuple
21
+
22
+ import sentencepiece as spm
23
+ from transformers.tokenization_utils import PreTrainedTokenizer
24
+ from transformers.utils import logging
25
+
26
+ logger = logging.get_logger(__name__)
27
+
28
+ VOCAB_FILES_NAMES = {'vocab_file': './tokenizer.model'}
29
+
30
+ PRETRAINED_VOCAB_FILES_MAP = {}
31
+
32
+
33
+ # Modified from transformers.model.llama.tokenization_llama.LlamaTokenizer
34
+ class InternLM2Tokenizer(PreTrainedTokenizer):
35
+ """
36
+ Construct a InternLM2 tokenizer. Based on byte-level Byte-Pair-Encoding.
37
+
38
+ Args:
39
+ vocab_file (`str`):
40
+ Path to the vocabulary file.
41
+ """
42
+
43
+ vocab_files_names = VOCAB_FILES_NAMES
44
+ pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
45
+ model_input_names = ['input_ids', 'attention_mask']
46
+ _auto_class = 'AutoTokenizer'
47
+
48
+ def __init__(
49
+ self,
50
+ vocab_file,
51
+ unk_token='<unk>',
52
+ bos_token='<s>',
53
+ eos_token='</s>',
54
+ pad_token='</s>',
55
+ sp_model_kwargs: Optional[Dict[str, Any]] = None,
56
+ add_bos_token=True,
57
+ add_eos_token=False,
58
+ decode_with_prefix_space=False,
59
+ clean_up_tokenization_spaces=False,
60
+ **kwargs,
61
+ ):
62
+ self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
63
+ self.vocab_file = vocab_file
64
+ self.add_bos_token = add_bos_token
65
+ self.add_eos_token = add_eos_token
66
+ self.decode_with_prefix_space = decode_with_prefix_space
67
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
68
+ self.sp_model.Load(vocab_file)
69
+ self._no_prefix_space_tokens = None
70
+ super().__init__(
71
+ bos_token=bos_token,
72
+ eos_token=eos_token,
73
+ unk_token=unk_token,
74
+ pad_token=pad_token,
75
+ clean_up_tokenization_spaces=clean_up_tokenization_spaces,
76
+ **kwargs,
77
+ )
78
+
79
+ @property
80
+ def no_prefix_space_tokens(self):
81
+ if self._no_prefix_space_tokens is None:
82
+ vocab = self.convert_ids_to_tokens(list(range(self.vocab_size)))
83
+ self._no_prefix_space_tokens = {i for i, tok in enumerate(vocab) if not tok.startswith('▁')}
84
+ return self._no_prefix_space_tokens
85
+
86
+ @property
87
+ def vocab_size(self):
88
+ """Returns vocab size"""
89
+ return self.sp_model.get_piece_size()
90
+
91
+ @property
92
+ def bos_token_id(self) -> Optional[int]:
93
+ return self.sp_model.bos_id()
94
+
95
+ @property
96
+ def eos_token_id(self) -> Optional[int]:
97
+ return self.sp_model.eos_id()
98
+
99
+ def get_vocab(self):
100
+ """Returns vocab as a dict"""
101
+ vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
102
+ vocab.update(self.added_tokens_encoder)
103
+ return vocab
104
+
105
+ def _tokenize(self, text):
106
+ """Returns a tokenized string."""
107
+ return self.sp_model.encode(text, out_type=str)
108
+
109
+ def _convert_token_to_id(self, token):
110
+ """Converts a token (str) in an id using the vocab."""
111
+ return self.sp_model.piece_to_id(token)
112
+
113
+ def _convert_id_to_token(self, index):
114
+ """Converts an index (integer) in a token (str) using the vocab."""
115
+ token = self.sp_model.IdToPiece(index)
116
+ return token
117
+
118
+ def _maybe_add_prefix_space(self, tokens, decoded):
119
+ if tokens and tokens[0] not in self.no_prefix_space_tokens:
120
+ return ' ' + decoded
121
+ else:
122
+ return decoded
123
+
124
+ def convert_tokens_to_string(self, tokens):
125
+ """Converts a sequence of tokens (string) in a single string."""
126
+ current_sub_tokens = []
127
+ out_string = ''
128
+ prev_is_special = False
129
+ for token in tokens:
130
+ # make sure that special tokens are not decoded using sentencepiece model
131
+ if token in self.all_special_tokens:
132
+ if not prev_is_special:
133
+ out_string += ' '
134
+ out_string += self.sp_model.decode(current_sub_tokens) + token
135
+ prev_is_special = True
136
+ current_sub_tokens = []
137
+ else:
138
+ current_sub_tokens.append(token)
139
+ prev_is_special = False
140
+ out_string += self.sp_model.decode(current_sub_tokens)
141
+ out_string = self.clean_up_tokenization(out_string)
142
+ out_string = self._maybe_add_prefix_space(tokens=tokens, decoded=out_string)
143
+ return out_string[1:]
144
+
145
+ def save_vocabulary(self, save_directory, filename_prefix: Optional[str] = None) -> Tuple[str]:
146
+ """
147
+ Save the vocabulary and special tokens file to a directory.
148
+
149
+ Args:
150
+ save_directory (`str`):
151
+ The directory in which to save the vocabulary.
152
+
153
+ Returns:
154
+ `Tuple(str)`: Paths to the files saved.
155
+ """
156
+ if not os.path.isdir(save_directory):
157
+ logger.error(f'Vocabulary path ({save_directory}) should be a directory')
158
+ return
159
+ out_vocab_file = os.path.join(
160
+ save_directory, (filename_prefix + '-' if filename_prefix else '') + VOCAB_FILES_NAMES['vocab_file']
161
+ )
162
+
163
+ if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
164
+ copyfile(self.vocab_file, out_vocab_file)
165
+ elif not os.path.isfile(self.vocab_file):
166
+ with open(out_vocab_file, 'wb') as fi:
167
+ content_spiece_model = self.sp_model.serialized_model_proto()
168
+ fi.write(content_spiece_model)
169
+
170
+ return (out_vocab_file,)
171
+
172
+ def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
173
+ if self.add_bos_token:
174
+ bos_token_ids = [self.bos_token_id]
175
+ else:
176
+ bos_token_ids = []
177
+
178
+ output = bos_token_ids + token_ids_0
179
+
180
+ if token_ids_1 is not None:
181
+ output = output + token_ids_1
182
+
183
+ if self.add_eos_token:
184
+ output = output + [self.eos_token_id]
185
+
186
+ return output
187
+
188
+ def get_special_tokens_mask(
189
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
190
+ ) -> List[int]:
191
+ """
192
+ Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
193
+ special tokens using the tokenizer `prepare_for_model` method.
194
+
195
+ Args:
196
+ token_ids_0 (`List[int]`):
197
+ List of IDs.
198
+ token_ids_1 (`List[int]`, *optional*):
199
+ Optional second list of IDs for sequence pairs.
200
+ already_has_special_tokens (`bool`, *optional*, defaults to `False`):
201
+ Whether or not the token list is already formatted with special tokens for the model.
202
+
203
+ Returns:
204
+ `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
205
+ """
206
+ if already_has_special_tokens:
207
+ return super().get_special_tokens_mask(
208
+ token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
209
+ )
210
+
211
+ if token_ids_1 is None:
212
+ return [1] + ([0] * len(token_ids_0)) + [1]
213
+ return [1] + ([0] * len(token_ids_0)) + [1, 1] + ([0] * len(token_ids_1)) + [1]
214
+
215
+ def create_token_type_ids_from_sequences(
216
+ self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
217
+ ) -> List[int]:
218
+ """
219
+ Create a mask from the two sequences passed to be used in a sequence-pair classification task. T5 does not make
220
+ use of token type ids, therefore a list of zeros is returned.
221
+
222
+ Args:
223
+ token_ids_0 (`List[int]`):
224
+ List of IDs.
225
+ token_ids_1 (`List[int]`, *optional*):
226
+ Optional second list of IDs for sequence pairs.
227
+
228
+ Returns:
229
+ `List[int]`: List of zeros.
230
+ """
231
+ eos = [self.eos_token_id]
232
+
233
+ if token_ids_1 is None:
234
+ return len(token_ids_0 + eos) * [0]
235
+ return len(token_ids_0 + eos + token_ids_1 + eos) * [0]
tokenization_internlm2_fast.py ADDED
@@ -0,0 +1,211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) The InternLM team and The HuggingFace Inc. team. All rights reserved.
2
+ #
3
+ # This code is based on transformers/src/transformers/models/llama/tokenization_llama_fast.py
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+
17
+ """Tokenization Fast class for InternLM."""
18
+ import os
19
+ from shutil import copyfile
20
+ from typing import Any, Dict, Optional, Tuple
21
+
22
+ from tokenizers import Tokenizer, decoders, normalizers, processors
23
+ from tokenizers.models import BPE
24
+ from transformers.convert_slow_tokenizer import (SLOW_TO_FAST_CONVERTERS,
25
+ SentencePieceExtractor,
26
+ SpmConverter)
27
+ from transformers.tokenization_utils_fast import PreTrainedTokenizerFast
28
+ from transformers.utils import logging
29
+
30
+ from .tokenization_internlm2 import InternLM2Tokenizer
31
+
32
+ logger = logging.get_logger(__name__)
33
+
34
+ VOCAB_FILES_NAMES = {'vocab_file': './tokenizer.model'}
35
+
36
+
37
+ # Modified from transformers.convert_slow_tokenizer.LlamaConverter
38
+ class InternLM2Converter(SpmConverter):
39
+ handle_byte_fallback = True
40
+
41
+ def vocab(self, proto):
42
+ vocab = [
43
+ ('<unk>', 0.0),
44
+ ('<s>', 0.0),
45
+ ('</s>', 0.0),
46
+ ]
47
+ vocab += [(piece.piece, piece.score) for piece in proto.pieces[3:]]
48
+ return vocab
49
+
50
+ def unk_id(self, proto):
51
+ unk_id = 0
52
+ return unk_id
53
+
54
+ def decoder(self, replacement, add_prefix_space):
55
+ return decoders.Sequence(
56
+ [
57
+ decoders.Replace('▁', ' '),
58
+ decoders.ByteFallback(),
59
+ decoders.Fuse(),
60
+ decoders.Strip(content=' ', left=1),
61
+ ]
62
+ )
63
+
64
+ def tokenizer(self, proto):
65
+ model_type = proto.trainer_spec.model_type
66
+ vocab_scores = self.vocab(proto)
67
+ # special tokens
68
+ added_tokens = self.original_tokenizer.added_tokens_decoder
69
+ for i in range(len(vocab_scores)):
70
+ piece, score = vocab_scores[i]
71
+ if i in added_tokens:
72
+ vocab_scores[i] = (added_tokens[i].content, score)
73
+ if model_type == 1:
74
+ raise RuntimeError('InternLM2 is supposed to be a BPE model!')
75
+
76
+ elif model_type == 2:
77
+ _, merges = SentencePieceExtractor(self.original_tokenizer.vocab_file).extract(vocab_scores)
78
+ bpe_vocab = {word: i for i, (word, _score) in enumerate(vocab_scores)}
79
+ tokenizer = Tokenizer(
80
+ BPE(bpe_vocab, merges, unk_token=proto.trainer_spec.unk_piece, fuse_unk=True, byte_fallback=True)
81
+ )
82
+ tokenizer.add_special_tokens(
83
+ [ added_token for index, added_token in added_tokens.items()]
84
+ )
85
+ else:
86
+ raise Exception(
87
+ "You're trying to run a `Unigram` model but you're file was trained with a different algorithm"
88
+ )
89
+
90
+ return tokenizer
91
+
92
+ def normalizer(self, proto):
93
+ normalizers_list = []
94
+ if proto.normalizer_spec.add_dummy_prefix:
95
+ normalizers_list.append(normalizers.Prepend(prepend='▁'))
96
+ normalizers_list.append(normalizers.Replace(pattern=' ', content='▁'))
97
+ return normalizers.Sequence(normalizers_list)
98
+
99
+ def pre_tokenizer(self, replacement, add_prefix_space):
100
+ return None
101
+
102
+
103
+ SLOW_TO_FAST_CONVERTERS['InternLM2Tokenizer'] = InternLM2Converter
104
+
105
+
106
+ # Modified from transformers.model.llama.tokenization_llama_fast.LlamaTokenizerFast -> InternLM2TokenizerFast
107
+ class InternLM2TokenizerFast(PreTrainedTokenizerFast):
108
+ vocab_files_names = VOCAB_FILES_NAMES
109
+ slow_tokenizer_class = InternLM2Tokenizer
110
+ padding_side = 'left'
111
+ model_input_names = ['input_ids', 'attention_mask']
112
+ _auto_class = 'AutoTokenizer'
113
+
114
+ def __init__(
115
+ self,
116
+ vocab_file,
117
+ unk_token='<unk>',
118
+ bos_token='<s>',
119
+ eos_token='</s>',
120
+ pad_token='</s>',
121
+ sp_model_kwargs: Optional[Dict[str, Any]] = None,
122
+ add_bos_token=True,
123
+ add_eos_token=False,
124
+ decode_with_prefix_space=False,
125
+ clean_up_tokenization_spaces=False,
126
+ **kwargs,
127
+ ):
128
+ super().__init__(
129
+ vocab_file=vocab_file,
130
+ unk_token=unk_token,
131
+ bos_token=bos_token,
132
+ eos_token=eos_token,
133
+ pad_token=pad_token,
134
+ sp_model_kwargs=sp_model_kwargs,
135
+ add_bos_token=add_bos_token,
136
+ add_eos_token=add_eos_token,
137
+ decode_with_prefix_space=decode_with_prefix_space,
138
+ clean_up_tokenization_spaces=clean_up_tokenization_spaces,
139
+ **kwargs,
140
+ )
141
+ self._add_bos_token = add_bos_token
142
+ self._add_eos_token = add_eos_token
143
+ self.update_post_processor()
144
+ self.vocab_file = vocab_file
145
+
146
+ @property
147
+ def can_save_slow_tokenizer(self) -> bool:
148
+ return os.path.isfile(self.vocab_file) if self.vocab_file else False
149
+
150
+ def update_post_processor(self):
151
+ """
152
+ Updates the underlying post processor with the current `bos_token` and `eos_token`.
153
+ """
154
+ bos = self.bos_token
155
+ bos_token_id = self.bos_token_id
156
+ if bos is None and self.add_bos_token:
157
+ raise ValueError('add_bos_token = True but bos_token = None')
158
+
159
+ eos = self.eos_token
160
+ eos_token_id = self.eos_token_id
161
+ if eos is None and self.add_eos_token:
162
+ raise ValueError('add_eos_token = True but eos_token = None')
163
+
164
+ single = f"{(bos+':0 ') if self.add_bos_token else ''}$A:0{(' '+eos+':0') if self.add_eos_token else ''}"
165
+ pair = f"{single}{(' '+bos+':1') if self.add_bos_token else ''} $B:1{(' '+eos+':1') if self.add_eos_token else ''}"
166
+
167
+ special_tokens = []
168
+ if self.add_bos_token:
169
+ special_tokens.append((bos, bos_token_id))
170
+ if self.add_eos_token:
171
+ special_tokens.append((eos, eos_token_id))
172
+ self._tokenizer.post_processor = processors.TemplateProcessing(
173
+ single=single, pair=pair, special_tokens=special_tokens
174
+ )
175
+
176
+ @property
177
+ def add_eos_token(self):
178
+ return self._add_eos_token
179
+
180
+ @property
181
+ def add_bos_token(self):
182
+ return self._add_bos_token
183
+
184
+ @add_eos_token.setter
185
+ def add_eos_token(self, value):
186
+ self._add_eos_token = value
187
+ self.update_post_processor()
188
+
189
+ @add_bos_token.setter
190
+ def add_bos_token(self, value):
191
+ self._add_bos_token = value
192
+ self.update_post_processor()
193
+
194
+ def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
195
+ if not self.can_save_slow_tokenizer:
196
+ raise ValueError(
197
+ 'Your fast tokenizer does not have the necessary information to save the vocabulary for a slow '
198
+ 'tokenizer.'
199
+ )
200
+
201
+ if not os.path.isdir(save_directory):
202
+ logger.error(f'Vocabulary path ({save_directory}) should be a directory')
203
+ return
204
+ out_vocab_file = os.path.join(
205
+ save_directory, (filename_prefix + '-' if filename_prefix else '') + VOCAB_FILES_NAMES['vocab_file']
206
+ )
207
+
208
+ if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
209
+ copyfile(self.vocab_file, out_vocab_file)
210
+
211
+ return (out_vocab_file,)
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f868398fc4e05ee1e8aeba95ddf18ddcc45b8bce55d5093bead5bbf80429b48b
3
+ size 1477754
tokenizer_config.json ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<unk>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<s>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "92538": {
28
+ "content": "<|plugin|>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "92539": {
36
+ "content": "<|interpreter|>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "92540": {
44
+ "content": "<|action_end|>",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ },
51
+ "92541": {
52
+ "content": "<|action_start|>",
53
+ "lstrip": false,
54
+ "normalized": false,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": true
58
+ },
59
+ "92542": {
60
+ "content": "<|im_end|>",
61
+ "lstrip": false,
62
+ "normalized": false,
63
+ "rstrip": false,
64
+ "single_word": false,
65
+ "special": true
66
+ },
67
+ "92543": {
68
+ "content": "<|im_start|>",
69
+ "lstrip": false,
70
+ "normalized": false,
71
+ "rstrip": false,
72
+ "single_word": false,
73
+ "special": true
74
+ },
75
+ "92544": {
76
+ "content": "<img>",
77
+ "lstrip": false,
78
+ "normalized": false,
79
+ "rstrip": false,
80
+ "single_word": false,
81
+ "special": true
82
+ },
83
+ "92545": {
84
+ "content": "</img>",
85
+ "lstrip": false,
86
+ "normalized": false,
87
+ "rstrip": false,
88
+ "single_word": false,
89
+ "special": true
90
+ },
91
+ "92546": {
92
+ "content": "<IMG_CONTEXT>",
93
+ "lstrip": false,
94
+ "normalized": false,
95
+ "rstrip": false,
96
+ "single_word": false,
97
+ "special": true
98
+ },
99
+ "92547": {
100
+ "content": "<quad>",
101
+ "lstrip": false,
102
+ "normalized": false,
103
+ "rstrip": false,
104
+ "single_word": false,
105
+ "special": true
106
+ },
107
+ "92548": {
108
+ "content": "</quad>",
109
+ "lstrip": false,
110
+ "normalized": false,
111
+ "rstrip": false,
112
+ "single_word": false,
113
+ "special": true
114
+ },
115
+ "92549": {
116
+ "content": "<ref>",
117
+ "lstrip": false,
118
+ "normalized": false,
119
+ "rstrip": false,
120
+ "single_word": false,
121
+ "special": true
122
+ },
123
+ "92550": {
124
+ "content": "</ref>",
125
+ "lstrip": false,
126
+ "normalized": false,
127
+ "rstrip": false,
128
+ "single_word": false,
129
+ "special": true
130
+ },
131
+ "92551": {
132
+ "content": "<box>",
133
+ "lstrip": false,
134
+ "normalized": false,
135
+ "rstrip": false,
136
+ "single_word": false,
137
+ "special": true
138
+ },
139
+ "92552": {
140
+ "content": "</box>",
141
+ "lstrip": false,
142
+ "normalized": false,
143
+ "rstrip": false,
144
+ "single_word": false,
145
+ "special": true
146
+ }
147
+ },
148
+ "additional_special_tokens": [
149
+ "<img>",
150
+ "</img>",
151
+ "<IMG_CONTEXT>",
152
+ "<quad>",
153
+ "</quad>",
154
+ "<ref>",
155
+ "</ref>",
156
+ "<box>",
157
+ "</box>"
158
+ ],
159
+ "auto_map": {
160
+ "AutoTokenizer": [
161
+ "tokenization_internlm2.InternLM2Tokenizer",
162
+ null
163
+ ]
164
+ },
165
+ "bos_token": "<s>",
166
+ "chat_template": "{{ bos_token }}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
167
+ "clean_up_tokenization_spaces": false,
168
+ "eos_token": "</s>",
169
+ "model_max_length": 8192,
170
+ "pad_token": "</s>",
171
+ "tokenizer_class": "InternLM2Tokenizer",
172
+ "unk_token": "<unk>"
173
+ }