.gitattributes CHANGED
@@ -33,4 +33,3 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
- examples/red-panda.mp4 filter=lfs diff=lfs merge=lfs -text
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
README.md CHANGED
@@ -1,179 +1,110 @@
1
  ---
2
  license: mit
3
- pipeline_tag: image-text-to-text
4
- library_name: transformers
5
- base_model:
6
- - OpenGVLab/InternViT-6B-448px-V1-5
7
- - internlm/internlm2-chat-20b
8
- new_version: OpenGVLab/InternVL2_5-26B
9
- base_model_relation: merge
10
- language:
11
- - multilingual
12
- tags:
13
- - internvl
14
- - custom_code
15
  ---
16
 
17
- # InternVL-Chat-V1-5
18
-
19
- [\[📂 GitHub\]](https://github.com/OpenGVLab/InternVL) [\[📜 InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[📜 InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[📜 Mini-InternVL\]](https://arxiv.org/abs/2410.16261) [\[📜 InternVL 2.5\]](https://huggingface.co/papers/2412.05271)
20
-
21
- [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[🚀 Quick Start\]](#quick-start) [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
22
-
23
- ## Introduction
24
-
25
  <p align="center">
26
- <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/D60YzQBIzvoCvLRp2gZ0A.jpeg" alt="Image Description" width="300" height="300">
27
  </p>
28
 
29
  > _Two interns holding hands, symbolizing the integration of InternViT and InternLM._
30
 
31
- We introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding.
32
 
33
- We introduce three simple designs:
 
 
 
 
34
 
35
- 1. **Strong Vision Encoder:** we explored a continuous learning strategy for the large-scale vision foundation model---InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs.
36
- 2. **Dynamic High-Resolution:** we divide images into tiles ranging from 1 to 40 of 448 × 448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input during inference.
37
- 3. **High-Quality Bilingual Dataset:** we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks.
38
 
39
  ## Model Details
40
-
41
  - **Model Type:** multimodal large language model (MLLM)
42
-
43
  - **Model Stats:**
44
-
45
  - Architecture: [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) + MLP + [InternLM2-Chat-20B](https://huggingface.co/internlm/internlm2-chat-20b)
46
  - Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution).
47
  - Params: 25.5B
48
 
49
  - **Training Strategy:**
50
-
51
- - Learnable component in the pre-training stage: ViT + MLP
52
- - Learnable component in the fine-tuning stage: ViT + MLP + LLM
53
- - For more details on training hyperparameters, please see our [blog](https://internvl.github.io/blog/2024-04-30-InternVL-1.5/).
54
-
55
- ## Architecture
56
-
57
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/YLvX3V-L0kwsyRn3Lhciw.png)
 
 
 
 
 
 
 
58
 
59
  ## Performance
60
 
61
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/4b85G7txoJ_LpT19SZJ4A.png)
62
 
63
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/i2vp6zSHPS3UIr-1Q9cSe.png)
64
 
65
- - We simultaneously use [InternVL](https://github.com/OpenGVLab/InternVL) and [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) repositories for model evaluation. Specifically, the results reported for DocVQA, ChartQA, InfoVQA, TextVQA, MME, AI2D, MMBench, CCBench, MMVet, and SEED-Image were tested using the InternVL repository. OCRBench, RealWorldQA, HallBench, and MathVista were evaluated using the VLMEvalKit.
66
-
67
- Limitations: Although we have made efforts to ensure the safety of the model during the training process and to encourage the model to generate text that complies with ethical and legal requirements, the model may still produce unexpected outputs due to its size and probabilistic generation paradigm. For example, the generated responses may contain biases, discrimination, or other harmful content. Please do not propagate such content. We are not responsible for any consequences resulting from the dissemination of harmful information.
68
 
69
  ## Examples
70
 
71
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/YVr-93mvVMR6UFpGezns7.png)
72
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/ivhj4QqcO2NHUa28DTDkK.png)
73
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/18GeOW10QVcSt5g--TgDY.png)
74
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/tGM_TwdV297H1fCxQ0PZU.png)
75
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/FwlSRBpKgURAVkXNOLoSp.png)
76
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/to3nOaAnyv-fGLEoNPLzz.png)
77
 
78
- ## Quick Start
79
 
80
- We provide an example code to run InternVL-Chat-V1-5 using `transformers`.
81
 
82
- > Please use transformers>=4.37.2 to ensure the model works normally.
83
 
84
- ### Model Loading
85
 
86
- #### 16-bit (bf16 / fp16)
87
 
88
- ```python
89
- import torch
90
- from transformers import AutoTokenizer, AutoModel
91
- path = "OpenGVLab/InternVL-Chat-V1-5"
92
- model = AutoModel.from_pretrained(
93
- path,
94
- torch_dtype=torch.bfloat16,
95
- low_cpu_mem_usage=True,
96
- use_flash_attn=True,
97
- trust_remote_code=True).eval().cuda()
98
- ```
99
 
100
- #### BNB 8-bit Quantization
101
 
102
- ```python
103
- import torch
104
- from transformers import AutoTokenizer, AutoModel
105
- path = "OpenGVLab/InternVL-Chat-V1-5"
106
- model = AutoModel.from_pretrained(
107
- path,
108
- torch_dtype=torch.bfloat16,
109
- load_in_8bit=True,
110
- low_cpu_mem_usage=True,
111
- use_flash_attn=True,
112
- trust_remote_code=True).eval()
113
- ```
114
 
115
- #### BNB 4-bit Quantization
116
 
117
- > **⚠️ Warning:** Due to significant quantization errors with BNB 4-bit quantization on InternViT-6B, the model may produce nonsensical outputs and fail to understand images. Therefore, please avoid using BNB 4-bit quantization.
118
 
119
- #### Multiple GPUs
120
 
121
- The reason for writing the code this way is to avoid errors that occur during multi-GPU inference due to tensors not being on the same device. By ensuring that the first and last layers of the large language model (LLM) are on the same device, we prevent such errors.
122
 
123
- ```python
124
- import math
125
- import torch
126
- from transformers import AutoTokenizer, AutoModel
127
 
128
- def split_model(model_name):
129
- device_map = {}
130
- world_size = torch.cuda.device_count()
131
- num_layers = {'Mini-InternVL-2B-V1-5': 24, 'Mini-InternVL-4B-V1-5': 32, 'InternVL-Chat-V1-5': 48}[model_name]
132
- # Since the first GPU will be used for ViT, treat it as half a GPU.
133
- num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
134
- num_layers_per_gpu = [num_layers_per_gpu] * world_size
135
- num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
136
- layer_cnt = 0
137
- for i, num_layer in enumerate(num_layers_per_gpu):
138
- for j in range(num_layer):
139
- device_map[f'language_model.model.layers.{layer_cnt}'] = i
140
- layer_cnt += 1
141
- device_map['vision_model'] = 0
142
- device_map['mlp1'] = 0
143
- device_map['language_model.model.tok_embeddings'] = 0
144
- device_map['language_model.model.embed_tokens'] = 0
145
- device_map['language_model.output'] = 0
146
- device_map['language_model.model.norm'] = 0
147
- device_map['language_model.lm_head'] = 0
148
- device_map[f'language_model.model.layers.{num_layers - 1}'] = 0
149
-
150
- return device_map
151
 
152
- path = "OpenGVLab/InternVL-Chat-V1-5"
153
- device_map = split_model('InternVL-Chat-V1-5')
154
- model = AutoModel.from_pretrained(
155
- path,
156
- torch_dtype=torch.bfloat16,
157
- low_cpu_mem_usage=True,
158
- use_flash_attn=True,
159
- trust_remote_code=True,
160
- device_map=device_map).eval()
161
- ```
162
-
163
- ### Inference with Transformers
164
 
165
  ```python
166
- import numpy as np
 
 
 
167
  import torch
168
  import torchvision.transforms as T
169
- from decord import VideoReader, cpu
170
  from PIL import Image
 
171
  from torchvision.transforms.functional import InterpolationMode
172
- from transformers import AutoModel, AutoTokenizer
173
 
174
  IMAGENET_MEAN = (0.485, 0.456, 0.406)
175
  IMAGENET_STD = (0.229, 0.224, 0.225)
176
 
 
177
  def build_transform(input_size):
178
  MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
179
  transform = T.Compose([
@@ -184,6 +115,7 @@ def build_transform(input_size):
184
  ])
185
  return transform
186
 
 
187
  def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
188
  best_ratio_diff = float('inf')
189
  best_ratio = (1, 1)
@@ -199,7 +131,8 @@ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_
199
  best_ratio = ratio
200
  return best_ratio
201
 
202
- def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
 
203
  orig_width, orig_height = image.size
204
  aspect_ratio = orig_width / orig_height
205
 
@@ -237,7 +170,8 @@ def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbna
237
  processed_images.append(thumbnail_img)
238
  return processed_images
239
 
240
- def load_image(image_file, input_size=448, max_num=12):
 
241
  image = Image.open(image_file).convert('RGB')
242
  transform = build_transform(input_size=input_size)
243
  images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
@@ -245,338 +179,92 @@ def load_image(image_file, input_size=448, max_num=12):
245
  pixel_values = torch.stack(pixel_values)
246
  return pixel_values
247
 
 
 
248
  # If you have an 80G A100 GPU, you can put the entire model on a single GPU.
249
- # Otherwise, you need to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.
250
- path = 'OpenGVLab/InternVL-Chat-V1-5'
251
  model = AutoModel.from_pretrained(
252
  path,
253
  torch_dtype=torch.bfloat16,
254
  low_cpu_mem_usage=True,
255
- use_flash_attn=True,
256
  trust_remote_code=True).eval().cuda()
257
- tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
258
-
 
 
 
 
 
 
 
259
  # set the max number of tiles in `max_num`
260
- pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
261
- generation_config = dict(max_new_tokens=1024, do_sample=True)
262
 
263
- # pure-text conversation (纯文本对话)
264
- question = 'Hello, who are you?'
265
- response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
266
- print(f'User: {question}\nAssistant: {response}')
 
267
 
268
- question = 'Can you tell me a story?'
269
- response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
270
- print(f'User: {question}\nAssistant: {response}')
271
-
272
- # single-image single-round conversation (单图单轮对话)
273
- question = '<image>\nPlease describe the image shortly.'
274
  response = model.chat(tokenizer, pixel_values, question, generation_config)
275
- print(f'User: {question}\nAssistant: {response}')
276
 
277
- # single-image multi-round conversation (单图多轮对话)
278
- question = '<image>\nPlease describe the image in detail.'
279
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
280
- print(f'User: {question}\nAssistant: {response}')
281
 
282
- question = 'Please write a poem according to the image.'
283
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
284
- print(f'User: {question}\nAssistant: {response}')
285
 
286
- # multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
287
- pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
288
- pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
289
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
290
 
291
- question = '<image>\nDescribe the two images in detail.'
292
- response, history = model.chat(tokenizer, pixel_values, question, generation_config,
293
- history=None, return_history=True)
294
- print(f'User: {question}\nAssistant: {response}')
295
-
296
- question = 'What are the similarities and differences between these two images.'
297
- response, history = model.chat(tokenizer, pixel_values, question, generation_config,
298
- history=history, return_history=True)
299
- print(f'User: {question}\nAssistant: {response}')
300
-
301
- # multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
302
- pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
303
- pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
304
- pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
305
- num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
306
-
307
- question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
308
- response, history = model.chat(tokenizer, pixel_values, question, generation_config,
309
- num_patches_list=num_patches_list,
310
- history=None, return_history=True)
311
- print(f'User: {question}\nAssistant: {response}')
312
-
313
- question = 'What are the similarities and differences between these two images.'
314
- response, history = model.chat(tokenizer, pixel_values, question, generation_config,
315
- num_patches_list=num_patches_list,
316
- history=history, return_history=True)
317
- print(f'User: {question}\nAssistant: {response}')
318
-
319
- # batch inference, single image per sample (单图批处理)
320
- pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
321
- pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
322
- num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
323
- pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
324
-
325
- questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
326
- responses = model.batch_chat(tokenizer, pixel_values,
327
- num_patches_list=num_patches_list,
328
- questions=questions,
329
- generation_config=generation_config)
330
- for question, response in zip(questions, responses):
331
- print(f'User: {question}\nAssistant: {response}')
332
-
333
- # video multi-round conversation (视频多轮对话)
334
- def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
335
- if bound:
336
- start, end = bound[0], bound[1]
337
- else:
338
- start, end = -100000, 100000
339
- start_idx = max(first_idx, round(start * fps))
340
- end_idx = min(round(end * fps), max_frame)
341
- seg_size = float(end_idx - start_idx) / num_segments
342
- frame_indices = np.array([
343
- int(start_idx + (seg_size / 2) + np.round(seg_size * idx))
344
- for idx in range(num_segments)
345
- ])
346
- return frame_indices
347
-
348
- def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):
349
- vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
350
- max_frame = len(vr) - 1
351
- fps = float(vr.get_avg_fps())
352
-
353
- pixel_values_list, num_patches_list = [], []
354
- transform = build_transform(input_size=input_size)
355
- frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
356
- for frame_index in frame_indices:
357
- img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')
358
- img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
359
- pixel_values = [transform(tile) for tile in img]
360
- pixel_values = torch.stack(pixel_values)
361
- num_patches_list.append(pixel_values.shape[0])
362
- pixel_values_list.append(pixel_values)
363
- pixel_values = torch.cat(pixel_values_list)
364
- return pixel_values, num_patches_list
365
-
366
- video_path = './examples/red-panda.mp4'
367
- pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
368
- pixel_values = pixel_values.to(torch.bfloat16).cuda()
369
- video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
370
- question = video_prefix + 'What is the red panda doing?'
371
- # Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
372
- response, history = model.chat(tokenizer, pixel_values, question, generation_config,
373
- num_patches_list=num_patches_list, history=None, return_history=True)
374
- print(f'User: {question}\nAssistant: {response}')
375
-
376
- question = 'Describe this video in detail.'
377
- response, history = model.chat(tokenizer, pixel_values, question, generation_config,
378
- num_patches_list=num_patches_list, history=history, return_history=True)
379
- print(f'User: {question}\nAssistant: {response}')
380
- ```
381
-
382
- #### Streaming Output
383
-
384
- Besides this method, you can also use the following code to get streamed output.
385
-
386
- ```python
387
- from transformers import TextIteratorStreamer
388
- from threading import Thread
389
-
390
- # Initialize the streamer
391
- streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True, timeout=10)
392
- # Define the generation configuration
393
- generation_config = dict(max_new_tokens=1024, do_sample=False, streamer=streamer)
394
- # Start the model chat in a separate thread
395
- thread = Thread(target=model.chat, kwargs=dict(
396
- tokenizer=tokenizer, pixel_values=pixel_values, question=question,
397
- history=None, return_history=False, generation_config=generation_config,
398
- ))
399
- thread.start()
400
-
401
- # Initialize an empty string to store the generated text
402
- generated_text = ''
403
- # Loop through the streamer to get the new text as it is generated
404
- for new_text in streamer:
405
- if new_text == model.conv_template.sep:
406
- break
407
- generated_text += new_text
408
- print(new_text, end='', flush=True) # Print each new chunk of generated text on the same line
409
- ```
410
-
411
- ## Finetune
412
-
413
- Many repositories now support fine-tuning of the InternVL series models, including [InternVL](https://github.com/OpenGVLab/InternVL), [SWIFT](https://github.com/modelscope/ms-swift), [XTurner](https://github.com/InternLM/xtuner), and others. Please refer to their documentation for more details on fine-tuning.
414
-
415
- ## Deployment
416
-
417
- ### LMDeploy
418
-
419
- LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs.
420
-
421
- ```sh
422
- pip install lmdeploy>=0.5.3
423
- ```
424
-
425
- LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
426
-
427
- #### A 'Hello, world' Example
428
-
429
- ```python
430
- from lmdeploy import pipeline, TurbomindEngineConfig
431
- from lmdeploy.vl import load_image
432
-
433
- model = 'OpenGVLab/InternVL-Chat-V1-5'
434
- image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
435
- pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
436
- response = pipe(('describe this image', image))
437
- print(response.text)
438
- ```
439
-
440
- If `ImportError` occurs while executing this case, please install the required dependency packages as prompted.
441
-
442
- #### Multi-images Inference
443
-
444
- When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased.
445
-
446
- > Warning: Due to the scarcity of multi-image conversation data, the performance on multi-image tasks may be unstable, and it may require multiple attempts to achieve satisfactory results.
447
-
448
- ```python
449
- from lmdeploy import pipeline, TurbomindEngineConfig
450
- from lmdeploy.vl import load_image
451
- from lmdeploy.vl.constants import IMAGE_TOKEN
452
-
453
- model = 'OpenGVLab/InternVL-Chat-V1-5'
454
- pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
455
-
456
- image_urls=[
457
- 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
458
- 'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg'
459
- ]
460
-
461
- images = [load_image(img_url) for img_url in image_urls]
462
- # Numbering images improves multi-image conversations
463
- response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
464
- print(response.text)
465
- ```
466
-
467
- #### Batch Prompts Inference
468
-
469
- Conducting inference with batch prompts is quite straightforward; just place them within a list structure:
470
-
471
- ```python
472
- from lmdeploy import pipeline, TurbomindEngineConfig
473
- from lmdeploy.vl import load_image
474
-
475
- model = 'OpenGVLab/InternVL-Chat-V1-5'
476
- pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
477
-
478
- image_urls=[
479
- "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
480
- "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/det.jpg"
481
- ]
482
- prompts = [('describe this image', load_image(img_url)) for img_url in image_urls]
483
- response = pipe(prompts)
484
- print(response)
485
- ```
486
-
487
- #### Multi-turn Conversation
488
-
489
- There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface.
490
-
491
- ```python
492
- from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
493
- from lmdeploy.vl import load_image
494
-
495
- model = 'OpenGVLab/InternVL-Chat-V1-5'
496
- pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
497
-
498
- image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
499
- gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
500
- sess = pipe.chat(('describe this image', image), gen_config=gen_config)
501
- print(sess.response.text)
502
- sess = pipe.chat('What is the woman doing?', session=sess, gen_config=gen_config)
503
- print(sess.response.text)
504
- ```
505
-
506
- #### Service
507
-
508
- LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
509
 
510
- ```shell
511
- lmdeploy serve api_server OpenGVLab/InternVL-Chat-V1-5 --server-port 23333
 
 
 
 
 
 
 
 
 
 
 
 
 
512
  ```
513
 
514
- To use the OpenAI-style interface, you need to install OpenAI:
515
-
516
- ```shell
517
- pip install openai
518
- ```
519
 
520
- Then, use the code below to make the API call:
521
 
522
- ```python
523
- from openai import OpenAI
524
-
525
- client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
526
- model_name = client.models.list().data[0].id
527
- response = client.chat.completions.create(
528
- model=model_name,
529
- messages=[{
530
- 'role':
531
- 'user',
532
- 'content': [{
533
- 'type': 'text',
534
- 'text': 'describe this image',
535
- }, {
536
- 'type': 'image_url',
537
- 'image_url': {
538
- 'url':
539
- 'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
540
- },
541
- }],
542
- }],
543
- temperature=0.8,
544
- top_p=0.8)
545
- print(response)
546
  ```
547
 
548
  ## License
549
 
550
- This project is released under the MIT License. This project uses the pre-trained internlm2-chat-20b as a component, which is licensed under the Apache License 2.0.
551
-
552
- ## Citation
553
 
554
- If you find this project useful in your research, please consider citing:
555
 
556
- ```BibTeX
557
- @article{chen2024expanding,
558
- title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
559
- author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
560
- journal={arXiv preprint arXiv:2412.05271},
561
- year={2024}
562
- }
563
- @article{gao2024mini,
564
- title={Mini-internvl: A flexible-transfer pocket multimodal model with 5\% parameters and 90\% performance},
565
- author={Gao, Zhangwei and Chen, Zhe and Cui, Erfei and Ren, Yiming and Wang, Weiyun and Zhu, Jinguo and Tian, Hao and Ye, Shenglong and He, Junjun and Zhu, Xizhou and others},
566
- journal={arXiv preprint arXiv:2410.16261},
567
- year={2024}
568
- }
569
- @article{chen2024far,
570
- title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
571
- author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
572
- journal={arXiv preprint arXiv:2404.16821},
573
- year={2024}
574
- }
575
- @inproceedings{chen2024internvl,
576
- title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
577
- author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
578
- booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
579
- pages={24185--24198},
580
- year={2024}
581
- }
582
- ```
 
1
  ---
2
  license: mit
3
+ datasets:
4
+ - laion/laion2B-en
5
+ - laion/laion-coco
6
+ - laion/laion2B-multi
7
+ - kakaobrain/coyo-700m
8
+ - conceptual_captions
9
+ - wanng/wukong100m
10
+ pipeline_tag: visual-question-answering
 
 
 
 
11
  ---
12
 
13
+ # Model Card for InternVL-Chat-V1.5
 
 
 
 
 
 
 
14
  <p align="center">
15
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/D60YzQBIzvoCvLRp2gZ0A.jpeg" alt="Image Description" width="300" height="300" />
16
  </p>
17
 
18
  > _Two interns holding hands, symbolizing the integration of InternViT and InternLM._
19
 
20
+ \[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
21
 
22
+ We introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding.
23
+ We introduce three simple designs:
24
+ 1. Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model---InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs.
25
+ 2. Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448 &times; 448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input.
26
+ 3. High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks.
27
 
 
 
 
28
 
29
  ## Model Details
 
30
  - **Model Type:** multimodal large language model (MLLM)
 
31
  - **Model Stats:**
 
32
  - Architecture: [InternViT-6B-448px-V1-5](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5) + MLP + [InternLM2-Chat-20B](https://huggingface.co/internlm/internlm2-chat-20b)
33
  - Image size: dynamic resolution, max to 40 tiles of 448 x 448 (4K resolution).
34
  - Params: 25.5B
35
 
36
  - **Training Strategy:**
37
+ - Pretraining Stage
38
+ - Learnable Component: ViT + MLP
39
+ - Data: Please see our technical report.
40
+ - SFT Stage
41
+ - Learnable Component: ViT + MLP + LLM
42
+ - Data: Please see our technical report.
43
+
44
+ ## Released Models
45
+
46
+ | Model | Vision Foundation Model | Release Date |Note |
47
+ | :---------------------------------------------------------:|:--------------------------------------------------------------------------: |:----------------------:| :---------------------------------- |
48
+ | InternVL-Chat-V1.5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)) | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5)) |2024.04.18 | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
49
+ | InternVL-Chat-V1.2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.21 | more SFT data and stronger |
50
+ | InternVL-Chat-V1.2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2)) |2024.02.11 | scaling up LLM to 34B |
51
+ | InternVL-Chat-V1.1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1)) |InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0)) |2024.01.24 | support Chinese and stronger OCR |
52
 
53
  ## Performance
54
 
55
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/ZyQklQ3C7C60I-xOv7X8L.png)
56
 
 
57
 
58
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/u98oqlnpZtWdq2dnarVlD.png)
 
 
59
 
60
  ## Examples
61
 
62
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/R34jISP4K1U17m9yNP38O.png)
 
 
 
 
 
63
 
64
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/ChkU9XtlsjH0l2EqlO_is.png)
65
 
66
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/1TFxIcf96ANRPLoy4-rbh.png)
67
 
68
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/Wpjo1Sdwf7XcEDevqwcr-.png)
69
 
70
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/kO4-J38sN8TFtmQ5mIBMS.png)
71
 
72
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/qPnTe3Q9UBy8wbclOsmWk.png)
73
 
74
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/l_BILRi13CbZNzbZYn6o6.png)
 
 
 
 
 
 
 
 
 
 
75
 
76
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/2782y7RnvGBogYEIG__7S.png)
77
 
78
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/RyO35PTH14OFiwyxtAZM2.png)
 
 
 
 
 
 
 
 
 
 
 
79
 
80
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/xiLZXWL-JiCTVPnV_VxS2.png)
81
 
82
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/gqX46Tt5jvrcVqb0vcf06.png)
83
 
 
84
 
 
85
 
86
+ ## Model Usage
 
 
 
87
 
88
+ We provide an example code to run InternVL-Chat-V1.5 using `transformers`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
 
90
+ You also can use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
 
 
 
 
 
 
 
 
 
 
 
91
 
92
  ```python
93
+ import json
94
+ import os
95
+ from transformers import AutoTokenizer, AutoModel
96
+ from tqdm import tqdm
97
  import torch
98
  import torchvision.transforms as T
 
99
  from PIL import Image
100
+
101
  from torchvision.transforms.functional import InterpolationMode
102
+
103
 
104
  IMAGENET_MEAN = (0.485, 0.456, 0.406)
105
  IMAGENET_STD = (0.229, 0.224, 0.225)
106
 
107
+
108
  def build_transform(input_size):
109
  MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
110
  transform = T.Compose([
 
115
  ])
116
  return transform
117
 
118
+
119
  def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
120
  best_ratio_diff = float('inf')
121
  best_ratio = (1, 1)
 
131
  best_ratio = ratio
132
  return best_ratio
133
 
134
+
135
+ def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False):
136
  orig_width, orig_height = image.size
137
  aspect_ratio = orig_width / orig_height
138
 
 
170
  processed_images.append(thumbnail_img)
171
  return processed_images
172
 
173
+
174
+ def load_image(image_file, input_size=448, max_num=6):
175
  image = Image.open(image_file).convert('RGB')
176
  transform = build_transform(input_size=input_size)
177
  images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
 
179
  pixel_values = torch.stack(pixel_values)
180
  return pixel_values
181
 
182
+
183
+ path = "OpenGVLab/InternVL-Chat-V1-5"
184
  # If you have an 80G A100 GPU, you can put the entire model on a single GPU.
 
 
185
  model = AutoModel.from_pretrained(
186
  path,
187
  torch_dtype=torch.bfloat16,
188
  low_cpu_mem_usage=True,
 
189
  trust_remote_code=True).eval().cuda()
190
+ # Otherwise, you need to set device_map='auto' to use multiple GPUs for inference.
191
+ # model = AutoModel.from_pretrained(
192
+ # path,
193
+ # torch_dtype=torch.bfloat16,
194
+ # low_cpu_mem_usage=True,
195
+ # trust_remote_code=True,
196
+ # device_map='auto').eval()
197
+
198
+ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
199
  # set the max number of tiles in `max_num`
200
+ pixel_values = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
 
201
 
202
+ generation_config = dict(
203
+ num_beams=1,
204
+ max_new_tokens=512,
205
+ do_sample=False,
206
+ )
207
 
208
+ # single-round single-image conversation
209
+ question = "请详细描述图片"
 
 
 
 
210
  response = model.chat(tokenizer, pixel_values, question, generation_config)
211
+ print(question, response)
212
 
213
+ # multi-round single-image conversation
214
+ question = "请详细描述图片"
215
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
216
+ print(question, response)
217
 
218
+ question = "请根据图片写一首诗"
219
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
220
+ print(question, response)
221
 
222
+ # multi-round multi-image conversation
223
+ pixel_values1 = load_image('./examples/image1.jpg', max_num=6).to(torch.bfloat16).cuda()
224
+ pixel_values2 = load_image('./examples/image2.jpg', max_num=6).to(torch.bfloat16).cuda()
225
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
226
 
227
+ question = "详细描述这两张图片"
228
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
229
+ print(question, response)
230
+ # 第一张图片是一只红熊猫,它有着独特的橙红色皮毛,脸部、耳朵和四肢的末端有白色斑块。红熊猫的眼睛周围有深色的环,它的耳朵是圆形的,上面有白色的毛。它正坐在一个木制的结构上,看起来像是一个平台或休息的地方。背景中有树木和竹子,这表明红熊猫可能在一个模拟自然环境的动物园或保护区内。
231
+ #
232
+ # 第二张图片是一只大熊猫,它是中国的国宝,以其黑白相间的皮毛而闻名。大熊猫的眼睛、耳朵和四肢的末端是黑色的,而它的脸部、耳朵内侧和身体其他部分是白色的。大熊猫正坐在地上,周围有竹子,这是它们的主要食物来源。背景中也有树木,这表明大熊猫可能在一个为它们提供自然栖息地模拟的动物园或保护区内。
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
233
 
234
+ question = "这两张图片的相同点和区别分别是什么"
235
+ response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
236
+ print(question, response)
237
+ # 这两张图片的相同点:
238
+ #
239
+ # 1. 都展示了熊猫,这是两种不同的熊猫物种。
240
+ # 2. 熊猫都处于一个看起来像是模拟自然环境的场所,可能是动物园或保护区。
241
+ # 3. 熊猫周围都有竹子,这是它们的主要食物来源。
242
+ #
243
+ # 这两张图片的区别:
244
+ #
245
+ # 1. 熊猫的种类不同:第一张图片是一只红熊猫,第二张图片是一只大熊猫。
246
+ # 2. 熊猫的皮毛颜色和图案不同:红熊猫的皮毛是橙红色,脸部、耳朵和四肢的末端有白色斑块;而大熊猫的皮毛是黑白相间的,眼睛、耳朵和四肢的末端是黑色的,脸部、耳朵内侧和身体其他部分是白色的。
247
+ # 3. 熊猫的姿态和位置不同:红熊猫坐在一个木制的结构上,而大熊猫坐在地上。
248
+ # 4. ���景中的植被和环境细节略有不同,但都包含树木和竹子。
249
  ```
250
 
251
+ ## Citation
 
 
 
 
252
 
253
+ If you find this project useful in your research, please consider citing:
254
 
255
+ ```BibTeX
256
+ @article{chen2023internvl,
257
+ title={InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks},
258
+ author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and Li, Bin and Luo, Ping and Lu, Tong and Qiao, Yu and Dai, Jifeng},
259
+ journal={arXiv preprint arXiv:2312.14238},
260
+ year={2023}
261
+ }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
262
  ```
263
 
264
  ## License
265
 
266
+ This project is released under the MIT license.
 
 
267
 
268
+ ## Acknowledgement
269
 
270
+ InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
all_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 1.0,
3
+ "train_loss": 0.8170236018231988,
4
+ "train_runtime": 190400.5325,
5
+ "train_samples": 5155291,
6
+ "train_samples_per_second": 27.076,
7
+ "train_steps_per_second": 0.026
8
+ }
config.json CHANGED
@@ -1,19 +1,19 @@
1
  {
2
  "_commit_hash": null,
 
3
  "architectures": [
4
  "InternVLChatModel"
5
  ],
6
  "auto_map": {
7
  "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
8
- "AutoModel": "modeling_internvl_chat.InternVLChatModel",
9
- "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel"
10
  },
11
- "system_message": "You are an AI assistant whose name is InternLM (书生·浦语).",
12
  "downsample_ratio": 0.5,
13
  "dynamic_image_size": true,
14
  "force_image_size": 448,
 
15
  "llm_config": {
16
- "_name_or_path": "internlm/internlm2-chat-20b",
17
  "add_cross_attention": false,
18
  "architectures": [
19
  "InternLM2ForCausalLM"
@@ -95,49 +95,108 @@
95
  "top_p": 1.0,
96
  "torch_dtype": "bfloat16",
97
  "torchscript": false,
98
- "transformers_version": "4.37.2",
99
  "typical_p": 1.0,
100
- "use_bfloat16": true,
101
- "use_cache": true,
102
  "vocab_size": 92553
103
  },
104
- "max_dynamic_patch": 12,
105
  "min_dynamic_patch": 1,
106
  "model_type": "internvl_chat",
 
107
  "ps_version": "v2",
108
  "select_layer": -1,
109
  "template": "internlm2-chat",
110
  "torch_dtype": "bfloat16",
 
111
  "use_backbone_lora": 0,
112
  "use_llm_lora": 0,
113
  "use_thumbnail": true,
114
  "vision_config": {
 
 
115
  "architectures": [
116
  "InternVisionModel"
117
  ],
118
  "attention_dropout": 0.0,
119
- "drop_path_rate": 0.0,
 
 
 
 
 
 
 
 
 
 
 
 
120
  "dropout": 0.0,
 
 
 
 
 
 
 
121
  "hidden_act": "gelu",
122
  "hidden_size": 3200,
 
 
 
 
123
  "image_size": 448,
124
  "initializer_factor": 0.1,
125
  "initializer_range": 1e-10,
126
  "intermediate_size": 12800,
 
 
 
 
 
 
127
  "layer_norm_eps": 1e-06,
 
 
 
128
  "model_type": "intern_vit_6b",
129
- "norm_type": "rms_norm",
130
  "num_attention_heads": 25,
 
 
131
  "num_channels": 3,
132
  "num_hidden_layers": 45,
 
133
  "output_attentions": false,
134
  "output_hidden_states": false,
 
 
135
  "patch_size": 14,
 
 
 
136
  "qk_normalization": true,
137
  "qkv_bias": false,
 
 
138
  "return_dict": true,
 
 
 
 
 
 
 
 
 
 
 
139
  "torch_dtype": "bfloat16",
140
- "transformers_version": "4.37.2",
 
 
141
  "use_bfloat16": true,
142
  "use_flash_attn": true
143
  }
 
1
  {
2
  "_commit_hash": null,
3
+ "_name_or_path": "./work_dirs/internvl_chat_internlm2_20b_448_dynamic_chinese_pretrain3/checkpoint-1600_replace_llm",
4
  "architectures": [
5
  "InternVLChatModel"
6
  ],
7
  "auto_map": {
8
  "AutoConfig": "configuration_internvl_chat.InternVLChatConfig",
9
+ "AutoModel": "modeling_internvl_chat.InternVLChatModel"
 
10
  },
 
11
  "downsample_ratio": 0.5,
12
  "dynamic_image_size": true,
13
  "force_image_size": 448,
14
+ "image_fold": null,
15
  "llm_config": {
16
+ "_name_or_path": "pretrained/internlm2-chat-20b/",
17
  "add_cross_attention": false,
18
  "architectures": [
19
  "InternLM2ForCausalLM"
 
95
  "top_p": 1.0,
96
  "torch_dtype": "bfloat16",
97
  "torchscript": false,
98
+ "transformers_version": "4.36.2",
99
  "typical_p": 1.0,
100
+ "use_bfloat16": false,
101
+ "use_cache": false,
102
  "vocab_size": 92553
103
  },
104
+ "max_dynamic_patch": 6,
105
  "min_dynamic_patch": 1,
106
  "model_type": "internvl_chat",
107
+ "pad2square": false,
108
  "ps_version": "v2",
109
  "select_layer": -1,
110
  "template": "internlm2-chat",
111
  "torch_dtype": "bfloat16",
112
+ "transformers_version": null,
113
  "use_backbone_lora": 0,
114
  "use_llm_lora": 0,
115
  "use_thumbnail": true,
116
  "vision_config": {
117
+ "_name_or_path": "work_dirs/internvl_chat_internlm2_20b_448_dynamic_chinese_pretrain/checkpoint-5200-vit",
118
+ "add_cross_attention": false,
119
  "architectures": [
120
  "InternVisionModel"
121
  ],
122
  "attention_dropout": 0.0,
123
+ "auto_map": {
124
+ "AutoConfig": "configuration_intern_vit.InternVisionConfig",
125
+ "AutoModel": "modeling_intern_vit.InternVisionModel"
126
+ },
127
+ "bad_words_ids": null,
128
+ "begin_suppress_tokens": null,
129
+ "bos_token_id": null,
130
+ "chunk_size_feed_forward": 0,
131
+ "cross_attention_hidden_size": null,
132
+ "decoder_start_token_id": null,
133
+ "diversity_penalty": 0.0,
134
+ "do_sample": false,
135
+ "drop_path_rate": 0.4,
136
  "dropout": 0.0,
137
+ "early_stopping": false,
138
+ "encoder_no_repeat_ngram_size": 0,
139
+ "eos_token_id": null,
140
+ "exponential_decay_length_penalty": null,
141
+ "finetuning_task": null,
142
+ "forced_bos_token_id": null,
143
+ "forced_eos_token_id": null,
144
  "hidden_act": "gelu",
145
  "hidden_size": 3200,
146
+ "id2label": {
147
+ "0": "LABEL_0",
148
+ "1": "LABEL_1"
149
+ },
150
  "image_size": 448,
151
  "initializer_factor": 0.1,
152
  "initializer_range": 1e-10,
153
  "intermediate_size": 12800,
154
+ "is_decoder": false,
155
+ "is_encoder_decoder": false,
156
+ "label2id": {
157
+ "LABEL_0": 0,
158
+ "LABEL_1": 1
159
+ },
160
  "layer_norm_eps": 1e-06,
161
+ "length_penalty": 1.0,
162
+ "max_length": 20,
163
+ "min_length": 0,
164
  "model_type": "intern_vit_6b",
165
+ "no_repeat_ngram_size": 0,
166
  "num_attention_heads": 25,
167
+ "num_beam_groups": 1,
168
+ "num_beams": 1,
169
  "num_channels": 3,
170
  "num_hidden_layers": 45,
171
+ "num_return_sequences": 1,
172
  "output_attentions": false,
173
  "output_hidden_states": false,
174
+ "output_scores": false,
175
+ "pad_token_id": null,
176
  "patch_size": 14,
177
+ "prefix": null,
178
+ "problem_type": null,
179
+ "pruned_heads": {},
180
  "qk_normalization": true,
181
  "qkv_bias": false,
182
+ "remove_invalid_values": false,
183
+ "repetition_penalty": 1.0,
184
  "return_dict": true,
185
+ "return_dict_in_generate": false,
186
+ "sep_token_id": null,
187
+ "suppress_tokens": null,
188
+ "task_specific_params": null,
189
+ "temperature": 1.0,
190
+ "tf_legacy_loss": false,
191
+ "tie_encoder_decoder": false,
192
+ "tie_word_embeddings": true,
193
+ "tokenizer_class": null,
194
+ "top_k": 50,
195
+ "top_p": 1.0,
196
  "torch_dtype": "bfloat16",
197
+ "torchscript": false,
198
+ "transformers_version": "4.36.2",
199
+ "typical_p": 1.0,
200
  "use_bfloat16": true,
201
  "use_flash_attn": true
202
  }
configuration_intern_vit.py CHANGED
@@ -1,9 +1,8 @@
1
  # --------------------------------------------------------
2
  # InternVL
3
- # Copyright (c) 2024 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
6
-
7
  import os
8
  from typing import Union
9
 
@@ -74,7 +73,6 @@ class InternVisionConfig(PretrainedConfig):
74
  num_hidden_layers=48,
75
  use_flash_attn=True,
76
  hidden_act='gelu',
77
- norm_type='rms_norm',
78
  layer_norm_eps=1e-6,
79
  dropout=0.0,
80
  drop_path_rate=0.0,
@@ -99,7 +97,6 @@ class InternVisionConfig(PretrainedConfig):
99
  self.attention_dropout = attention_dropout
100
  self.layer_norm_eps = layer_norm_eps
101
  self.hidden_act = hidden_act
102
- self.norm_type = norm_type
103
  self.qkv_bias = qkv_bias
104
  self.qk_normalization = qk_normalization
105
  self.use_flash_attn = use_flash_attn
 
1
  # --------------------------------------------------------
2
  # InternVL
3
+ # Copyright (c) 2023 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
 
6
  import os
7
  from typing import Union
8
 
 
73
  num_hidden_layers=48,
74
  use_flash_attn=True,
75
  hidden_act='gelu',
 
76
  layer_norm_eps=1e-6,
77
  dropout=0.0,
78
  drop_path_rate=0.0,
 
97
  self.attention_dropout = attention_dropout
98
  self.layer_norm_eps = layer_norm_eps
99
  self.hidden_act = hidden_act
 
100
  self.qkv_bias = qkv_bias
101
  self.qk_normalization = qk_normalization
102
  self.use_flash_attn = use_flash_attn
configuration_internvl_chat.py CHANGED
@@ -1,6 +1,6 @@
1
  # --------------------------------------------------------
2
  # InternVL
3
- # Copyright (c) 2024 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
6
 
@@ -26,10 +26,12 @@ class InternVLChatConfig(PretrainedConfig):
26
  llm_config=None,
27
  use_backbone_lora=0,
28
  use_llm_lora=0,
29
- select_layer=-1,
 
30
  force_image_size=None,
31
  downsample_ratio=0.5,
32
  template=None,
 
33
  dynamic_image_size=False,
34
  use_thumbnail=False,
35
  ps_version='v1',
@@ -39,26 +41,28 @@ class InternVLChatConfig(PretrainedConfig):
39
  super().__init__(**kwargs)
40
 
41
  if vision_config is None:
42
- vision_config = {'architectures': ['InternVisionModel']}
43
  logger.info('vision_config is None. Initializing the InternVisionConfig with default values.')
44
 
45
  if llm_config is None:
46
- llm_config = {'architectures': ['InternLM2ForCausalLM']}
47
  logger.info('llm_config is None. Initializing the LlamaConfig config with default values (`LlamaConfig`).')
48
 
49
  self.vision_config = InternVisionConfig(**vision_config)
50
- if llm_config.get('architectures')[0] == 'LlamaForCausalLM':
51
  self.llm_config = LlamaConfig(**llm_config)
52
- elif llm_config.get('architectures')[0] == 'InternLM2ForCausalLM':
53
  self.llm_config = InternLM2Config(**llm_config)
54
  else:
55
- raise ValueError('Unsupported architecture: {}'.format(llm_config.get('architectures')[0]))
56
  self.use_backbone_lora = use_backbone_lora
57
  self.use_llm_lora = use_llm_lora
 
58
  self.select_layer = select_layer
59
  self.force_image_size = force_image_size
60
  self.downsample_ratio = downsample_ratio
61
  self.template = template
 
62
  self.dynamic_image_size = dynamic_image_size
63
  self.use_thumbnail = use_thumbnail
64
  self.ps_version = ps_version # pixel shuffle version
@@ -66,6 +70,7 @@ class InternVLChatConfig(PretrainedConfig):
66
  self.max_dynamic_patch = max_dynamic_patch
67
 
68
  logger.info(f'vision_select_layer: {self.select_layer}')
 
69
  logger.info(f'ps_version: {self.ps_version}')
70
  logger.info(f'min_dynamic_patch: {self.min_dynamic_patch}')
71
  logger.info(f'max_dynamic_patch: {self.max_dynamic_patch}')
@@ -83,10 +88,12 @@ class InternVLChatConfig(PretrainedConfig):
83
  output['model_type'] = self.__class__.model_type
84
  output['use_backbone_lora'] = self.use_backbone_lora
85
  output['use_llm_lora'] = self.use_llm_lora
 
86
  output['select_layer'] = self.select_layer
87
  output['force_image_size'] = self.force_image_size
88
  output['downsample_ratio'] = self.downsample_ratio
89
  output['template'] = self.template
 
90
  output['dynamic_image_size'] = self.dynamic_image_size
91
  output['use_thumbnail'] = self.use_thumbnail
92
  output['ps_version'] = self.ps_version
 
1
  # --------------------------------------------------------
2
  # InternVL
3
+ # Copyright (c) 2023 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
6
 
 
26
  llm_config=None,
27
  use_backbone_lora=0,
28
  use_llm_lora=0,
29
+ pad2square=False,
30
+ select_layer=-4,
31
  force_image_size=None,
32
  downsample_ratio=0.5,
33
  template=None,
34
+ image_fold=False,
35
  dynamic_image_size=False,
36
  use_thumbnail=False,
37
  ps_version='v1',
 
41
  super().__init__(**kwargs)
42
 
43
  if vision_config is None:
44
+ vision_config = {}
45
  logger.info('vision_config is None. Initializing the InternVisionConfig with default values.')
46
 
47
  if llm_config is None:
48
+ llm_config = {}
49
  logger.info('llm_config is None. Initializing the LlamaConfig config with default values (`LlamaConfig`).')
50
 
51
  self.vision_config = InternVisionConfig(**vision_config)
52
+ if llm_config['architectures'][0] == 'LlamaForCausalLM':
53
  self.llm_config = LlamaConfig(**llm_config)
54
+ elif llm_config['architectures'][0] == 'InternLM2ForCausalLM':
55
  self.llm_config = InternLM2Config(**llm_config)
56
  else:
57
+ raise ValueError('Unsupported architecture: {}'.format(llm_config['architectures'][0]))
58
  self.use_backbone_lora = use_backbone_lora
59
  self.use_llm_lora = use_llm_lora
60
+ self.pad2square = pad2square
61
  self.select_layer = select_layer
62
  self.force_image_size = force_image_size
63
  self.downsample_ratio = downsample_ratio
64
  self.template = template
65
+ self.image_fold = image_fold
66
  self.dynamic_image_size = dynamic_image_size
67
  self.use_thumbnail = use_thumbnail
68
  self.ps_version = ps_version # pixel shuffle version
 
70
  self.max_dynamic_patch = max_dynamic_patch
71
 
72
  logger.info(f'vision_select_layer: {self.select_layer}')
73
+ logger.info(f'image_fold: {self.image_fold}')
74
  logger.info(f'ps_version: {self.ps_version}')
75
  logger.info(f'min_dynamic_patch: {self.min_dynamic_patch}')
76
  logger.info(f'max_dynamic_patch: {self.max_dynamic_patch}')
 
88
  output['model_type'] = self.__class__.model_type
89
  output['use_backbone_lora'] = self.use_backbone_lora
90
  output['use_llm_lora'] = self.use_llm_lora
91
+ output['pad2square'] = self.pad2square
92
  output['select_layer'] = self.select_layer
93
  output['force_image_size'] = self.force_image_size
94
  output['downsample_ratio'] = self.downsample_ratio
95
  output['template'] = self.template
96
+ output['image_fold'] = self.image_fold
97
  output['dynamic_image_size'] = self.dynamic_image_size
98
  output['use_thumbnail'] = self.use_thumbnail
99
  output['ps_version'] = self.ps_version
conversation.py CHANGED
@@ -2,7 +2,7 @@
2
  Conversation prompt templates.
3
 
4
  We kindly request that you import fastchat instead of copying this file if you wish to use it.
5
- If you have changes in mind, please contribute back so the community can benefit collectively and continue to maintain these valuable templates.
6
  """
7
 
8
  import dataclasses
@@ -330,6 +330,384 @@ def get_conv_template(name: str) -> Conversation:
330
  return conv_templates[name].copy()
331
 
332
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
333
  register_conv_template(
334
  Conversation(
335
  name='Hermes-2',
@@ -343,7 +721,7 @@ register_conv_template(
343
  6,
344
  7,
345
  8,
346
- ],
347
  stop_str='<|endoftext|>',
348
  )
349
  )
@@ -365,19 +743,519 @@ register_conv_template(
365
  )
366
  )
367
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
368
 
 
 
 
369
  register_conv_template(
370
  Conversation(
371
- name='phi3-chat',
372
- system_template='<|system|>\n{system_message}',
373
- system_message='You are an AI assistant whose name is Phi-3.',
374
- roles=('<|user|>\n', '<|assistant|>\n'),
375
- sep_style=SeparatorStyle.MPT,
376
- sep='<|end|>',
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
377
  stop_token_ids=[
 
 
378
  2,
379
- 32000,
380
- 32007
381
- ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
382
  )
383
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  Conversation prompt templates.
3
 
4
  We kindly request that you import fastchat instead of copying this file if you wish to use it.
5
+ If you have any changes in mind, please contribute back so the community can benefit collectively and continue to maintain these valuable templates.
6
  """
7
 
8
  import dataclasses
 
330
  return conv_templates[name].copy()
331
 
332
 
333
+ # An empty template for raw conversation.
334
+ register_conv_template(
335
+ Conversation(
336
+ name='raw',
337
+ system_message='',
338
+ roles=('', ''),
339
+ sep_style=SeparatorStyle.NO_COLON_SINGLE,
340
+ sep='',
341
+ )
342
+ )
343
+
344
+ # A template with a one-shot conversation example
345
+ register_conv_template(
346
+ Conversation(
347
+ name='one_shot',
348
+ system_message='A chat between a curious human and an artificial intelligence assistant. '
349
+ "The assistant gives helpful, detailed, and polite answers to the human's questions.",
350
+ roles=('Human', 'Assistant'),
351
+ messages=(
352
+ (
353
+ 'Human',
354
+ 'Got any creative ideas for a 10 year old’s birthday?',
355
+ ),
356
+ (
357
+ 'Assistant',
358
+ """Of course! Here are some creative ideas for a 10-year-old's birthday party:
359
+ 1. Treasure Hunt: Organize a treasure hunt in your backyard or nearby park. Create clues and riddles for the kids to solve, leading them to hidden treasures and surprises.
360
+ 2. Science Party: Plan a science-themed party where kids can engage in fun and interactive experiments. You can set up different stations with activities like making slime, erupting volcanoes, or creating simple chemical reactions.
361
+ 3. Outdoor Movie Night: Set up a backyard movie night with a projector and a large screen or white sheet. Create a cozy seating area with blankets and pillows, and serve popcorn and snacks while the kids enjoy a favorite movie under the stars.
362
+ 4. DIY Crafts Party: Arrange a craft party where kids can unleash their creativity. Provide a variety of craft supplies like beads, paints, and fabrics, and let them create their own unique masterpieces to take home as party favors.
363
+ 5. Sports Olympics: Host a mini Olympics event with various sports and games. Set up different stations for activities like sack races, relay races, basketball shooting, and obstacle courses. Give out medals or certificates to the participants.
364
+ 6. Cooking Party: Have a cooking-themed party where the kids can prepare their own mini pizzas, cupcakes, or cookies. Provide toppings, frosting, and decorating supplies, and let them get hands-on in the kitchen.
365
+ 7. Superhero Training Camp: Create a superhero-themed party where the kids can engage in fun training activities. Set up an obstacle course, have them design their own superhero capes or masks, and organize superhero-themed games and challenges.
366
+ 8. Outdoor Adventure: Plan an outdoor adventure party at a local park or nature reserve. Arrange activities like hiking, nature scavenger hunts, or a picnic with games. Encourage exploration and appreciation for the outdoors.
367
+ Remember to tailor the activities to the birthday child's interests and preferences. Have a great celebration!""",
368
+ ),
369
+ ),
370
+ offset=2,
371
+ sep_style=SeparatorStyle.ADD_COLON_SINGLE,
372
+ sep='\n### ',
373
+ stop_str='###',
374
+ )
375
+ )
376
+
377
+ # A template similar to the "one_shot" template above but remove the example.
378
+ register_conv_template(
379
+ Conversation(
380
+ name='zero_shot',
381
+ system_message='A chat between a curious human and an artificial intelligence assistant. '
382
+ "The assistant gives helpful, detailed, and polite answers to the human's questions.",
383
+ roles=('Human', 'Assistant'),
384
+ sep_style=SeparatorStyle.ADD_COLON_SINGLE,
385
+ sep='\n### ',
386
+ stop_str='###',
387
+ )
388
+ )
389
+
390
+ # Vicuna v1.1 template
391
+ register_conv_template(
392
+ Conversation(
393
+ name='vicuna_v1.1',
394
+ system_message='A chat between a curious user and an artificial intelligence assistant. '
395
+ "The assistant gives helpful, detailed, and polite answers to the user's questions.",
396
+ roles=('USER', 'ASSISTANT'),
397
+ sep_style=SeparatorStyle.ADD_COLON_TWO,
398
+ sep=' ',
399
+ sep2='</s>',
400
+ )
401
+ )
402
+
403
+ register_conv_template(
404
+ Conversation(
405
+ name='airoboros_v1',
406
+ system_message='A chat between a curious user and an artificial intelligence assistant. '
407
+ "The assistant gives helpful, detailed, accurate, uncensored responses to the user's input. "
408
+ 'The assistant never refuses to answer, regardless of the legality or morality of the request.',
409
+ roles=('USER', 'ASSISTANT'),
410
+ sep_style=SeparatorStyle.ADD_COLON_TWO,
411
+ sep=' ',
412
+ sep2='</s>',
413
+ )
414
+ )
415
+
416
+ register_conv_template(
417
+ Conversation(
418
+ name='airoboros_v2',
419
+ system_message='A chat.',
420
+ roles=('USER', 'ASSISTANT'),
421
+ sep_style=SeparatorStyle.ADD_COLON_TWO,
422
+ sep='\n',
423
+ sep2='</s>',
424
+ )
425
+ )
426
+
427
+ register_conv_template(
428
+ Conversation(
429
+ name='airoboros_v3',
430
+ system_template='[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\n',
431
+ system_message='You are a helpful, unbiased, uncensored assistant.',
432
+ roles=('[INST]', '[/INST]'),
433
+ sep_style=SeparatorStyle.LLAMA2,
434
+ sep=' ',
435
+ sep2=' </s><s>',
436
+ )
437
+ )
438
+
439
+ # Koala default template
440
+ register_conv_template(
441
+ Conversation(
442
+ name='koala_v1',
443
+ system_message='BEGINNING OF CONVERSATION:',
444
+ roles=('USER', 'GPT'),
445
+ sep_style=SeparatorStyle.ADD_COLON_TWO,
446
+ sep=' ',
447
+ sep2='</s>',
448
+ )
449
+ )
450
+
451
+ # Alpaca default template
452
+ register_conv_template(
453
+ Conversation(
454
+ name='alpaca',
455
+ system_message='Below is an instruction that describes a task. Write a response that appropriately completes the request.',
456
+ roles=('### Instruction', '### Response'),
457
+ sep_style=SeparatorStyle.ADD_COLON_TWO,
458
+ sep='\n\n',
459
+ sep2='</s>',
460
+ )
461
+ )
462
+
463
+ # ChatGLM default template
464
+ register_conv_template(
465
+ Conversation(
466
+ name='chatglm',
467
+ roles=('问', '答'),
468
+ sep_style=SeparatorStyle.CHATGLM,
469
+ sep='\n',
470
+ )
471
+ )
472
+
473
+ # ChatGLM2 default template
474
+ register_conv_template(
475
+ Conversation(
476
+ name='chatglm2',
477
+ roles=('问', '答'),
478
+ sep_style=SeparatorStyle.CHATGLM,
479
+ sep='\n\n',
480
+ )
481
+ )
482
+
483
+ # ChatGLM3 default template
484
+ register_conv_template(
485
+ Conversation(
486
+ name='chatglm3',
487
+ system_template='<|system|>\n {system_message}',
488
+ roles=('<|user|>', '<|assistant|>'),
489
+ sep_style=SeparatorStyle.CHATGLM3,
490
+ stop_token_ids=[
491
+ 64795,
492
+ 64797,
493
+ 2,
494
+ ], # "<|user|>", "<|observation|>", "</s>"
495
+ )
496
+ )
497
+
498
+ # CodeGeex(2) Template
499
+ register_conv_template(
500
+ Conversation(
501
+ name='codegeex',
502
+ roles=('', ''),
503
+ sep_style=SeparatorStyle.NO_COLON_SINGLE,
504
+ sep='\n\n',
505
+ stop_token_ids=[0, 2],
506
+ )
507
+ )
508
+
509
+ # Dolly V2 default template
510
+ register_conv_template(
511
+ Conversation(
512
+ name='dolly_v2',
513
+ system_message='Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n',
514
+ roles=('### Instruction', '### Response'),
515
+ sep_style=SeparatorStyle.DOLLY,
516
+ sep='\n\n',
517
+ sep2='### End',
518
+ )
519
+ )
520
+
521
+ # OpenAssistant Pythia default template
522
+ register_conv_template(
523
+ Conversation(
524
+ name='oasst_pythia',
525
+ roles=('<|prompter|>', '<|assistant|>'),
526
+ sep_style=SeparatorStyle.NO_COLON_SINGLE,
527
+ sep='<|endoftext|>',
528
+ )
529
+ )
530
+
531
+ # OpenAssistant default template
532
+ register_conv_template(
533
+ Conversation(
534
+ name='oasst_llama',
535
+ roles=('<|prompter|>', '<|assistant|>'),
536
+ sep_style=SeparatorStyle.NO_COLON_SINGLE,
537
+ sep='</s>',
538
+ )
539
+ )
540
+
541
+ # OpenChat 3.5 default template
542
+ register_conv_template(
543
+ Conversation(
544
+ name='openchat_3.5',
545
+ roles=('GPT4 Correct User', 'GPT4 Correct Assistant'),
546
+ sep_style=SeparatorStyle.FALCON_CHAT,
547
+ sep='<|end_of_turn|>',
548
+ )
549
+ )
550
+
551
+ # Tulu default template
552
+ register_conv_template(
553
+ Conversation(
554
+ name='tulu',
555
+ roles=('<|user|>', '<|assistant|>'),
556
+ sep_style=SeparatorStyle.ADD_NEW_LINE_SINGLE,
557
+ sep='\n',
558
+ )
559
+ )
560
+
561
+ # StableLM Alpha default template
562
+ register_conv_template(
563
+ Conversation(
564
+ name='stablelm',
565
+ system_template='<|SYSTEM|>{system_message}',
566
+ system_message="""# StableLM Tuned (Alpha version)
567
+ - StableLM is a helpful and harmless open-source AI language model developed by StabilityAI.
568
+ - StableLM is excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
569
+ - StableLM is more than just an information source, StableLM is also able to write poetry, short stories, and make jokes.
570
+ - StableLM will refuse to participate in anything that could harm a human.
571
+ """,
572
+ roles=('<|USER|>', '<|ASSISTANT|>'),
573
+ sep_style=SeparatorStyle.NO_COLON_SINGLE,
574
+ sep='',
575
+ stop_token_ids=[50278, 50279, 50277, 1, 0],
576
+ )
577
+ )
578
+
579
+ # Baize default template
580
+ register_conv_template(
581
+ Conversation(
582
+ name='baize',
583
+ system_message='The following is a conversation between a human and an AI assistant named Baize (named after a mythical creature in Chinese folklore). Baize is an open-source AI assistant developed by UCSD and Sun Yat-Sen University. The human and the AI assistant take turns chatting. Human statements start with [|Human|] and AI assistant statements start with [|AI|]. The AI assistant always provides responses in as much detail as possible, and in Markdown format. The AI assistant always declines to engage with topics, questions and instructions related to unethical, controversial, or sensitive issues. Complete the transcript in exactly that format.\n',
584
+ roles=('[|Human|]', '[|AI|]'),
585
+ messages=(
586
+ ('[|Human|]', 'Hello!'),
587
+ ('[|AI|]', 'Hi!'),
588
+ ),
589
+ offset=2,
590
+ sep_style=SeparatorStyle.NO_COLON_SINGLE,
591
+ sep='\n',
592
+ stop_str='[|Human|]',
593
+ )
594
+ )
595
+
596
+ # RWKV-4-Raven default template
597
+ register_conv_template(
598
+ Conversation(
599
+ name='rwkv',
600
+ roles=('Bob', 'Alice'),
601
+ messages=(
602
+ ('Bob', 'hi'),
603
+ (
604
+ 'Alice',
605
+ 'Hi. I am your assistant and I will provide expert full response in full details. Please feel free to ask any question and I will always answer it.',
606
+ ),
607
+ ),
608
+ offset=2,
609
+ sep_style=SeparatorStyle.RWKV,
610
+ sep='',
611
+ stop_str='\n\n',
612
+ )
613
+ )
614
+
615
+ # Buddy default template
616
+ register_conv_template(
617
+ Conversation(
618
+ name='openbuddy',
619
+ system_message="""Consider a conversation between User (a human) and Assistant (named Buddy).
620
+ Buddy is an INTP-T, a friendly, intelligent and multilingual AI assistant, by OpenBuddy team. GitHub: https://github.com/OpenBuddy/OpenBuddy
621
+ Buddy cannot access the Internet.
622
+ Buddy can fluently speak the user's language (e.g. English, Chinese).
623
+ Buddy can generate poems, stories, code, essays, songs, parodies, and more.
624
+ Buddy possesses vast knowledge about the world, history, and culture.
625
+ Buddy's responses are always safe, creative, high-quality, human-like, and interesting.
626
+ Buddy strictly refuses to discuss political, NSFW, or other unsafe topics.
627
+
628
+ User: Hi.
629
+ Assistant: Hi, I'm Buddy, your AI assistant. How can I help you today?""",
630
+ roles=('User', 'Assistant'),
631
+ sep_style=SeparatorStyle.ADD_COLON_SINGLE,
632
+ sep='\n',
633
+ )
634
+ )
635
+
636
+ # Phoenix default template
637
+ register_conv_template(
638
+ Conversation(
639
+ name='phoenix',
640
+ system_message="A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
641
+ roles=('Human', 'Assistant'),
642
+ sep_style=SeparatorStyle.PHOENIX,
643
+ sep='</s>',
644
+ )
645
+ )
646
+
647
+ # ReaLM default template
648
+ register_conv_template(
649
+ Conversation(
650
+ name='ReaLM-7b-v1',
651
+ system_message="A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
652
+ roles=('Human', 'Assistant'),
653
+ sep_style=SeparatorStyle.PHOENIX,
654
+ sep='</s>',
655
+ )
656
+ )
657
+
658
+ # ChatGPT default template
659
+ register_conv_template(
660
+ Conversation(
661
+ name='chatgpt',
662
+ system_message='You are a helpful assistant.',
663
+ roles=('user', 'assistant'),
664
+ sep_style=None,
665
+ sep=None,
666
+ )
667
+ )
668
+
669
+ # Claude default template
670
+ register_conv_template(
671
+ Conversation(
672
+ name='claude',
673
+ roles=('Human', 'Assistant'),
674
+ sep_style=SeparatorStyle.ADD_COLON_SINGLE,
675
+ sep='\n\n',
676
+ )
677
+ )
678
+
679
+ # MPT default template
680
+ register_conv_template(
681
+ Conversation(
682
+ name='mpt-7b-chat',
683
+ system_template="""<|im_start|>system
684
+ {system_message}""",
685
+ system_message="""- You are a helpful assistant chatbot trained by MosaicML.
686
+ - You answer questions.
687
+ - You are excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
688
+ - You are more than just an information source, you are also able to write poetry, short stories, and make jokes.""",
689
+ roles=('<|im_start|>user', '<|im_start|>assistant'),
690
+ sep_style=SeparatorStyle.CHATML,
691
+ sep='<|im_end|>',
692
+ stop_token_ids=[50278, 0],
693
+ )
694
+ )
695
+
696
+ # MPT-30b-chat default template
697
+ register_conv_template(
698
+ Conversation(
699
+ name='mpt-30b-chat',
700
+ system_template="""<|im_start|>system
701
+ {system_message}""",
702
+ system_message="""A conversation between a user and an LLM-based AI assistant. The assistant gives helpful and honest answers.""",
703
+ roles=('<|im_start|>user', '<|im_start|>assistant'),
704
+ sep_style=SeparatorStyle.CHATML,
705
+ sep='<|im_end|>',
706
+ stop_token_ids=[50278, 0],
707
+ )
708
+ )
709
+
710
+
711
  register_conv_template(
712
  Conversation(
713
  name='Hermes-2',
 
721
  6,
722
  7,
723
  8,
724
+ ], # "<|endoftext|>", "<|im_start|>", "<|im_end|>", "<|im_sep|>"
725
  stop_str='<|endoftext|>',
726
  )
727
  )
 
743
  )
744
  )
745
 
746
+ # Lemur-70b-chat default template
747
+ # reference: https://huggingface.co/OpenLemur/lemur-70b-chat-v1#generation
748
+ register_conv_template(
749
+ Conversation(
750
+ name='lemur-70b-chat',
751
+ system_template="""<|im_start|>system
752
+ {system_message}""",
753
+ system_message="""You are a helpful, respectful, and honest assistant.""",
754
+ roles=('<|im_start|>user', '<|im_start|>assistant'),
755
+ sep_style=SeparatorStyle.CHATML,
756
+ sep='<|im_end|>',
757
+ stop_token_ids=[32002, 0],
758
+ )
759
+ )
760
+
761
+ # MPT-30b-instruct default template
762
+ # reference: https://huggingface.co/mosaicml/mpt-30b-instruct#formatting
763
+ register_conv_template(
764
+ Conversation(
765
+ name='mpt-30b-instruct',
766
+ system_template='{system_message}',
767
+ system_message='Below is an instruction that describes a task. Write a response that appropriately completes the request.',
768
+ roles=('### Instruction', '### Response'),
769
+ sep_style=SeparatorStyle.ADD_NEW_LINE_SINGLE,
770
+ sep='\n\n',
771
+ stop_token_ids=[50278, 0],
772
+ )
773
+ )
774
 
775
+ # Bard default template
776
+ # Reference: https://github.com/google/generative-ai-python/blob/9c99bcb474a991a97a2e7d62fcdb52db7ce40729/google/generativeai/discuss.py#L150
777
+ # https://github.com/google/generative-ai-python/blob/9c99bcb474a991a97a2e7d62fcdb52db7ce40729/google/generativeai/discuss.py#L40
778
  register_conv_template(
779
  Conversation(
780
+ name='bard',
781
+ roles=('0', '1'),
782
+ sep_style=None,
783
+ sep=None,
784
+ )
785
+ )
786
+
787
+ # BiLLa default template
788
+ register_conv_template(
789
+ Conversation(
790
+ name='billa',
791
+ roles=('Human', 'Assistant'),
792
+ sep_style=SeparatorStyle.ADD_COLON_SPACE_SINGLE,
793
+ sep='\n',
794
+ stop_str='Human:',
795
+ )
796
+ )
797
+
798
+ # RedPajama INCITE default template
799
+ register_conv_template(
800
+ Conversation(
801
+ name='redpajama-incite',
802
+ roles=('<human>', '<bot>'),
803
+ sep_style=SeparatorStyle.ADD_COLON_SINGLE,
804
+ sep='\n',
805
+ stop_str='<human>',
806
+ )
807
+ )
808
+
809
+ # h2oGPT default template
810
+ register_conv_template(
811
+ Conversation(
812
+ name='h2ogpt',
813
+ roles=('<|prompt|>', '<|answer|>'),
814
+ sep_style=SeparatorStyle.NO_COLON_SINGLE,
815
+ sep='</s>',
816
+ )
817
+ )
818
+
819
+ # Robin default template
820
+ register_conv_template(
821
+ Conversation(
822
+ name='Robin',
823
+ system_message="A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.",
824
+ roles=('###Human', '###Assistant'),
825
+ sep_style=SeparatorStyle.ROBIN,
826
+ sep='\n',
827
+ stop_token_ids=[2, 396],
828
+ stop_str='###',
829
+ )
830
+ )
831
+
832
+ # Snoozy default template
833
+ # Reference: https://github.com/nomic-ai/gpt4all/blob/d4861030b778da6db59d21d2927a4aba4f9f1f43/gpt4all-bindings/python/gpt4all/gpt4all.py#L232
834
+ register_conv_template(
835
+ Conversation(
836
+ name='snoozy',
837
+ system_template='### Instruction:\n{system_message}',
838
+ system_message='The prompt below is a question to answer, a task to complete, or a conversation to respond to; decide which and write an appropriate response.',
839
+ roles=('### Prompt', '### Response'),
840
+ sep_style=SeparatorStyle.ADD_COLON_SINGLE,
841
+ sep='\n',
842
+ stop_str='###',
843
+ )
844
+ )
845
+
846
+ # manticore default template
847
+ register_conv_template(
848
+ Conversation(
849
+ name='manticore',
850
+ roles=('USER', 'ASSISTANT'),
851
+ sep_style=SeparatorStyle.ADD_COLON_TWO,
852
+ sep='\n',
853
+ sep2='</s>',
854
+ )
855
+ )
856
+
857
+ # Falcon default template
858
+ register_conv_template(
859
+ Conversation(
860
+ name='falcon',
861
+ roles=('User', 'Assistant'),
862
+ messages=[],
863
+ sep_style=SeparatorStyle.RWKV,
864
+ sep='\n',
865
+ sep2='<|endoftext|>',
866
+ stop_str='\nUser', # use stop_str to stop generation after stop_token_ids, it will also remove stop_str from the generated text
867
  stop_token_ids=[
868
+ 0,
869
+ 1,
870
  2,
871
+ 3,
872
+ 4,
873
+ 5,
874
+ 6,
875
+ 7,
876
+ 8,
877
+ 9,
878
+ 10,
879
+ 11,
880
+ ], # it better only put special tokens here, because tokenizer only remove special tokens
881
+ )
882
+ )
883
+
884
+ # ChangGPT default template
885
+ register_conv_template(
886
+ Conversation(
887
+ name='polyglot_changgpt',
888
+ roles=('B', 'A'),
889
+ sep_style=SeparatorStyle.ADD_COLON_SINGLE,
890
+ sep='\n',
891
+ )
892
+ )
893
+
894
+ # tigerbot template
895
+ register_conv_template(
896
+ Conversation(
897
+ name='tigerbot',
898
+ system_message='A chat between a curious user and an artificial intelligence assistant. '
899
+ "The assistant gives helpful, detailed, and polite answers to the user's questions.",
900
+ roles=('### Instruction', '### Response'),
901
+ sep_style=SeparatorStyle.ROBIN,
902
+ sep='\n\n',
903
+ stop_str='###',
904
+ )
905
+ )
906
+
907
+ # ref: https://huggingface.co/Salesforce/xgen-7b-8k-inst
908
+ register_conv_template(
909
+ Conversation(
910
+ name='xgen',
911
+ system_message="A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
912
+ roles=('### Human', '### Assistant'),
913
+ sep_style=SeparatorStyle.ADD_COLON_SINGLE,
914
+ sep='\n',
915
+ stop_token_ids=[50256],
916
+ )
917
+ )
918
+
919
+ # Internlm-chat template
920
+ register_conv_template(
921
+ Conversation(
922
+ name='internlm-chat',
923
+ system_message="A chat between a curious <|User|> and an <|Bot|>. The <|Bot|> gives helpful, detailed, and polite answers to the <|User|>'s questions.\n\n",
924
+ roles=('<|User|>', '<|Bot|>'),
925
+ sep_style=SeparatorStyle.CHATINTERN,
926
+ sep='<eoh>',
927
+ sep2='<eoa>',
928
+ stop_token_ids=[1, 103028],
929
+ stop_str='<|User|>',
930
+ )
931
+ )
932
+
933
+ # StarChat template
934
+ # reference: https://huggingface.co/spaces/HuggingFaceH4/starchat-playground/blob/main/dialogues.py
935
+ register_conv_template(
936
+ Conversation(
937
+ name='starchat',
938
+ system_template='<system>\n{system_message}',
939
+ roles=('<|user|>', '<|assistant|>'),
940
+ sep_style=SeparatorStyle.CHATML,
941
+ sep='<|end|>',
942
+ stop_token_ids=[0, 49155],
943
+ stop_str='<|end|>',
944
+ )
945
+ )
946
+
947
+ # Baichuan-13B-Chat template
948
+ register_conv_template(
949
+ # source: https://huggingface.co/baichuan-inc/Baichuan-13B-Chat/blob/19ef51ba5bad8935b03acd20ff04a269210983bc/modeling_baichuan.py#L555
950
+ # https://huggingface.co/baichuan-inc/Baichuan-13B-Chat/blob/main/generation_config.json
951
+ # https://github.com/baichuan-inc/Baichuan-13B/issues/25
952
+ Conversation(
953
+ name='baichuan-chat',
954
+ roles=('<reserved_102>', '<reserved_103>'),
955
+ sep_style=SeparatorStyle.NO_COLON_SINGLE,
956
+ sep='',
957
+ stop_token_ids=[],
958
+ )
959
+ )
960
+
961
+ # Baichuan2-13B-Chat template
962
+ register_conv_template(
963
+ # source: https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/blob/c6f8592a60b4ad73c210b28dd2ab3cca51abbf93/modeling_baichuan.py#L773
964
+ # https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat/blob/main/generation_config.json
965
+ # https://github.com/baichuan-inc/Baichuan2/issues/62
966
+ Conversation(
967
+ name='baichuan2-chat',
968
+ roles=('<reserved_106>', '<reserved_107>'),
969
+ sep_style=SeparatorStyle.NO_COLON_SINGLE,
970
+ sep='',
971
+ stop_token_ids=[],
972
+ )
973
+ )
974
+
975
+ # Mistral template
976
+ # source: https://docs.mistral.ai/llm/mistral-instruct-v0.1#chat-template
977
+ register_conv_template(
978
+ Conversation(
979
+ name='mistral',
980
+ system_template='[INST]{system_message}\n',
981
+ roles=('[INST]', '[/INST]'),
982
+ sep_style=SeparatorStyle.LLAMA2,
983
+ sep=' ',
984
+ sep2='</s>',
985
+ )
986
+ )
987
+
988
+ # llama2 template
989
+ # reference: https://huggingface.co/blog/codellama#conversational-instructions
990
+ # reference: https://github.com/facebookresearch/llama/blob/1a240688810f8036049e8da36b073f63d2ac552c/llama/generation.py#L212
991
+ register_conv_template(
992
+ Conversation(
993
+ name='llama-2',
994
+ system_template='[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\n',
995
+ roles=('[INST]', '[/INST]'),
996
+ sep_style=SeparatorStyle.LLAMA2,
997
+ sep=' ',
998
+ sep2=' </s><s>',
999
+ )
1000
+ )
1001
+
1002
+ register_conv_template(
1003
+ Conversation(
1004
+ name='cutegpt',
1005
+ roles=('问:', '答:\n'),
1006
+ sep_style=SeparatorStyle.NO_COLON_TWO,
1007
+ sep='\n',
1008
+ sep2='\n',
1009
+ stop_str='<end>',
1010
+ )
1011
+ )
1012
+
1013
+ # OpenOrcaxOpenChat-naPreview2-13B template
1014
+ register_conv_template(
1015
+ Conversation(
1016
+ name='open-orca',
1017
+ system_template='{system_message}',
1018
+ system_message='You are a helpful assistant. Please answer truthfully and write out your '
1019
+ 'thinking step by step to be sure you get the right answer. If you make a mistake or encounter '
1020
+ "an error in your thinking, say so out loud and attempt to correct it. If you don't know or "
1021
+ "aren't sure about something, say so clearly. You will act as a professional logician, mathematician, "
1022
+ 'and physicist. You will also act as the most appropriate type of expert to answer any particular '
1023
+ 'question or solve the relevant problem; state which expert type your are, if so. Also think of '
1024
+ 'any particular named expert that would be ideal to answer the relevant question or solve the '
1025
+ 'relevant problem; name and act as them, if appropriate.',
1026
+ roles=('User', 'Assistant'),
1027
+ sep_style=SeparatorStyle.ADD_COLON_SPACE_SINGLE,
1028
+ sep='<|end_of_turn|>\n',
1029
+ stop_token_ids=[32000, 32001], # "<|end_of_turn|>"
1030
+ stop_str='User',
1031
+ )
1032
+ )
1033
+
1034
+ # Open-Orca/Mistral-7B-OpenOrca template
1035
+ # source: https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca
1036
+ # reference: https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca#prompt-template
1037
+ register_conv_template(
1038
+ Conversation(
1039
+ name='mistral-7b-openorca',
1040
+ system_template='<|im_start|>system\n{system_message}',
1041
+ system_message='You are MistralOrca, a large language model trained by Alignment Lab AI. Write out your reasoning step-by-step to be sure you get the right answers!',
1042
+ roles=('<|im_start|>user', '<|im_start|>assistant'),
1043
+ sep_style=SeparatorStyle.CHATML,
1044
+ sep='<|im_end|>',
1045
+ stop_token_ids=[32000, 32001],
1046
+ )
1047
+ )
1048
+
1049
+ # Qwen-chat default template
1050
+ # source: https://huggingface.co/Qwen/Qwen-7B-Chat/blob/main/qwen_generation_utils.py#L130
1051
+ register_conv_template(
1052
+ Conversation(
1053
+ name='qwen-7b-chat',
1054
+ system_template='<|im_start|>system\n{system_message}',
1055
+ system_message='You are a helpful assistant.',
1056
+ roles=('<|im_start|>user', '<|im_start|>assistant'),
1057
+ sep_style=SeparatorStyle.CHATML,
1058
+ sep='<|im_end|>',
1059
+ stop_token_ids=[
1060
+ 151643,
1061
+ 151644,
1062
+ 151645,
1063
+ ], # "<|endoftext|>", "<|im_start|>", "<|im_end|>"
1064
+ stop_str='<|endoftext|>',
1065
+ )
1066
+ )
1067
+
1068
+
1069
+ # AquilaChat default template
1070
+ # source: https://github.com/FlagAI-Open/FlagAI/blob/master/examples/Aquila/Aquila-chat/cyg_conversation.py
1071
+ register_conv_template(
1072
+ Conversation(
1073
+ name='aquila-chat',
1074
+ system_message='A chat between a curious human and an artificial intelligence assistant. '
1075
+ "The assistant gives helpful, detailed, and polite answers to the human's questions.",
1076
+ roles=('Human', 'Assistant'),
1077
+ sep_style=SeparatorStyle.ADD_COLON_SINGLE,
1078
+ sep='###',
1079
+ sep2='',
1080
+ stop_str=['###', '</s>', '[UNK]'],
1081
+ )
1082
+ )
1083
+ # AquilaChat2-34B default template
1084
+ # source: https://huggingface.co/BAAI/AquilaChat2-34B/blob/4608b75855334b93329a771aee03869dbf7d88cc/predict.py#L212
1085
+ register_conv_template(
1086
+ Conversation(
1087
+ name='aquila-legacy',
1088
+ system_message='A chat between a curious human and an artificial intelligence assistant. '
1089
+ "The assistant gives helpful, detailed, and polite answers to the human's questions.\n\n",
1090
+ roles=('### Human: ', '### Assistant: '),
1091
+ offset=0,
1092
+ sep_style=SeparatorStyle.NO_COLON_TWO,
1093
+ sep='\n',
1094
+ sep2='</s>',
1095
+ stop_str=['</s>', '[UNK]'],
1096
+ )
1097
+ )
1098
+ # AquilaChat2-7B-16K and AquilaChat2-34B-16K default template
1099
+ # source: https://huggingface.co/BAAI/AquilaChat2-34B/blob/4608b75855334b93329a771aee03869dbf7d88cc/predict.py#L227
1100
+ register_conv_template(
1101
+ Conversation(
1102
+ name='aquila',
1103
+ system_message='A chat between a curious human and an artificial intelligence assistant. '
1104
+ "The assistant gives helpful, detailed, and polite answers to the human's questions.",
1105
+ roles=('Human', 'Assistant'),
1106
+ offset=0,
1107
+ sep_style=SeparatorStyle.ADD_COLON_TWO,
1108
+ sep='###',
1109
+ sep2='</s>',
1110
+ stop_str=['</s>', '[UNK]'],
1111
+ )
1112
+ )
1113
+
1114
+ # AquilaChat2-7B default template
1115
+ # source: https://huggingface.co/BAAI/AquilaChat2-34B/blob/4608b75855334b93329a771aee03869dbf7d88cc/predict.py#L242
1116
+ register_conv_template(
1117
+ Conversation(
1118
+ name='aquila-v1',
1119
+ roles=('<|startofpiece|>', '<|endofpiece|>'),
1120
+ offset=0,
1121
+ sep_style=SeparatorStyle.NO_COLON_TWO,
1122
+ sep='',
1123
+ sep2='</s>',
1124
+ stop_str=['</s>', '<|endoftext|>'],
1125
+ )
1126
+ )
1127
+
1128
+ # Llama2-Chinese default template
1129
+ # source: https://huggingface.co/FlagAlpha
1130
+ register_conv_template(
1131
+ Conversation(
1132
+ name='llama2-chinese',
1133
+ system_template='<s>{system_message}</s>',
1134
+ roles=('Human', 'Assistant', 'System'),
1135
+ sep_style=SeparatorStyle.ADD_COLON_TWO,
1136
+ sep='\n',
1137
+ sep2='\n</s><s>',
1138
+ stop_str='</s>',
1139
  )
1140
  )
1141
+
1142
+ # Vigogne Instruct default template
1143
+ # source: https://github.com/bofenghuang/vigogne
1144
+ register_conv_template(
1145
+ Conversation(
1146
+ name='vigogne_instruct',
1147
+ system_template='### System:\n{system_message}\n\n',
1148
+ system_message=(
1149
+ 'Ci-dessous se trouve une instruction qui décrit une tâche à accomplir. Rédigez une réponse qui répond de manière'
1150
+ ' précise à la demande.'
1151
+ ),
1152
+ roles=('### Instruction', '### Response'),
1153
+ sep_style=SeparatorStyle.DOLLY,
1154
+ sep='\n\n',
1155
+ sep2='</s>',
1156
+ )
1157
+ )
1158
+
1159
+ # Vigogne Chat default template
1160
+ register_conv_template(
1161
+ Conversation(
1162
+ name='vigogne_chat_v2',
1163
+ system_template='<|system|>: {system_message}',
1164
+ system_message=(
1165
+ 'Vous êtes Vigogne, un assistant IA créé par Zaion Lab. Vous suivez extrêmement bien les instructions. Aidez'
1166
+ ' autant que vous le pouvez.'
1167
+ ),
1168
+ roles=('<|user|>', '<|assistant|>'),
1169
+ sep_style=SeparatorStyle.ADD_COLON_TWO,
1170
+ sep='\n',
1171
+ sep2='</s>\n',
1172
+ stop_str='<|user|>',
1173
+ )
1174
+ )
1175
+
1176
+ register_conv_template(
1177
+ Conversation(
1178
+ name='vigogne_chat_v3',
1179
+ system_template='[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\n',
1180
+ system_message=(
1181
+ 'Vous êtes Vigogne, un assistant IA créé par Zaion Lab. Vous suivez extrêmement bien les instructions. Aidez'
1182
+ ' autant que vous le pouvez.'
1183
+ ),
1184
+ roles=('[INST]', '[/INST]'),
1185
+ sep_style=SeparatorStyle.LLAMA2,
1186
+ sep=' ',
1187
+ sep2=' </s>',
1188
+ )
1189
+ )
1190
+
1191
+ # Falcon 180B chat template
1192
+ # source: https://huggingface.co/spaces/tiiuae/falcon-180b-demo/blob/d1590ee7fae9b6ce331ba7808e61a29dcce9239f/app.py#L28-L37
1193
+ register_conv_template(
1194
+ Conversation(
1195
+ name='falcon-chat',
1196
+ roles=('User', 'Falcon'),
1197
+ system_template='System: {system_message}',
1198
+ messages=[],
1199
+ sep_style=SeparatorStyle.FALCON_CHAT,
1200
+ sep='\n',
1201
+ sep2='<|endoftext|>',
1202
+ stop_str='\nUser:', # use stop_str to stop generation after stop_token_ids, it will also remove stop_str from the generated text
1203
+ )
1204
+ )
1205
+
1206
+ # Phind template
1207
+ # source: https://huggingface.co/Phind/Phind-CodeLlama-34B-v2
1208
+ register_conv_template(
1209
+ Conversation(
1210
+ name='phind',
1211
+ system_message='### System Prompt\nYou are an intelligent programming assistant.',
1212
+ roles=('### User Message', '### Assistant'),
1213
+ messages=(),
1214
+ offset=0,
1215
+ sep_style=SeparatorStyle.ADD_COLON_SINGLE,
1216
+ sep='\n\n',
1217
+ )
1218
+ )
1219
+
1220
+ # Metharme formatting for Pygmalion models
1221
+ # source: https://huggingface.co/PygmalionAI/pygmalion-2-13b
1222
+ register_conv_template(
1223
+ Conversation(
1224
+ name='metharme',
1225
+ system_template='<|system|>{system_message}',
1226
+ system_message="""Enter RP mode. You shall reply to the user while staying
1227
+ in character. Your responses must be detailed, creative, immersive, and drive the scenario
1228
+ forward.""",
1229
+ roles=('<|user|>', '<|model|>'),
1230
+ sep_style=SeparatorStyle.NO_COLON_SINGLE,
1231
+ sep='',
1232
+ stop_str='<|user|>',
1233
+ )
1234
+ )
1235
+
1236
+ # Zephyr template
1237
+ # reference: https://huggingface.co/spaces/HuggingFaceH4/zephyr-playground/blob/main/dialogues.py
1238
+ register_conv_template(
1239
+ Conversation(
1240
+ name='zephyr',
1241
+ system_template='<|system|>\n{system_message}',
1242
+ roles=('<|user|>', '<|assistant|>'),
1243
+ sep_style=SeparatorStyle.CHATML,
1244
+ sep='</s>',
1245
+ stop_token_ids=[2],
1246
+ stop_str='</s>',
1247
+ )
1248
+ )
1249
+
1250
+ # InternVL-ZH template
1251
+ register_conv_template(
1252
+ Conversation(
1253
+ name='internvl_zh',
1254
+ system_template='',
1255
+ roles=('<human>', '<bot>'),
1256
+ sep_style=SeparatorStyle.INTERNVL_ZH,
1257
+ sep=' ',
1258
+ sep2='</s>',
1259
+ )
1260
+ )
1261
+
examples/image1.jpg DELETED
Binary file (78.1 kB)
 
examples/image2.jpg DELETED
Binary file (126 kB)
 
generation_config.json CHANGED
@@ -1,8 +1,4 @@
1
  {
2
  "_from_model_config": true,
3
- "transformers_version": "4.37.2",
4
- "eos_token_id": [
5
- 92542,
6
- 92543
7
- ]
8
  }
 
1
  {
2
  "_from_model_config": true,
3
+ "transformers_version": "4.36.2"
 
 
 
 
4
  }
modeling_intern_vit.py CHANGED
@@ -1,9 +1,8 @@
1
  # --------------------------------------------------------
2
  # InternVL
3
- # Copyright (c) 2024 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
6
-
7
  from typing import Optional, Tuple, Union
8
 
9
  import torch
@@ -21,12 +20,18 @@ from transformers.utils import logging
21
  from .configuration_intern_vit import InternVisionConfig
22
 
23
  try:
 
 
 
 
 
 
 
24
  from flash_attn.bert_padding import pad_input, unpad_input
25
- from flash_attn.flash_attn_interface import \
26
- flash_attn_varlen_qkvpacked_func
27
  has_flash_attn = True
28
  except:
29
- print('FlashAttention2 is not installed.')
30
  has_flash_attn = False
31
 
32
  logger = logging.get_logger(__name__)
@@ -42,12 +47,12 @@ class FlashAttention(nn.Module):
42
  attention_dropout: The dropout rate to apply to the attention
43
  (default: 0.0)
44
  """
45
-
46
  def __init__(self, softmax_scale=None, attention_dropout=0.0, device=None, dtype=None):
47
  super().__init__()
48
  self.softmax_scale = softmax_scale
49
  self.dropout_p = attention_dropout
50
-
51
  def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens=None,
52
  max_s=None, need_weights=False):
53
  """Implements the multihead softmax attention.
@@ -60,7 +65,7 @@ class FlashAttention(nn.Module):
60
  assert not need_weights
61
  assert qkv.dtype in [torch.float16, torch.bfloat16]
62
  assert qkv.is_cuda
63
-
64
  if cu_seqlens is None:
65
  batch_size = qkv.shape[0]
66
  seqlen = qkv.shape[1]
@@ -69,7 +74,7 @@ class FlashAttention(nn.Module):
69
  max_s = seqlen
70
  cu_seqlens = torch.arange(0, (batch_size + 1) * seqlen, step=seqlen, dtype=torch.int32,
71
  device=qkv.device)
72
- output = flash_attn_varlen_qkvpacked_func(
73
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
74
  softmax_scale=self.softmax_scale, causal=causal
75
  )
@@ -79,7 +84,7 @@ class FlashAttention(nn.Module):
79
  x = rearrange(qkv, 'b s three h d -> b s (three h d)')
80
  x_unpad, indices, cu_seqlens, max_s = unpad_input(x, key_padding_mask)
81
  x_unpad = rearrange(x_unpad, 'nnz (three h d) -> nnz three h d', three=3, h=nheads)
82
- output_unpad = flash_attn_varlen_qkvpacked_func(
83
  x_unpad, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
84
  softmax_scale=self.softmax_scale, causal=causal
85
  )
@@ -88,11 +93,11 @@ class FlashAttention(nn.Module):
88
  'b s (h d) -> b s h d', h=nheads)
89
  else:
90
  assert max_s is not None
91
- output = flash_attn_varlen_qkvpacked_func(
92
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
93
  softmax_scale=self.softmax_scale, causal=causal
94
  )
95
-
96
  return output, None
97
 
98
 
@@ -124,12 +129,6 @@ except Exception:
124
  pass
125
 
126
 
127
- NORM2FN = {
128
- 'rms_norm': InternRMSNorm,
129
- 'layer_norm': nn.LayerNorm,
130
- }
131
-
132
-
133
  class InternVisionEmbeddings(nn.Module):
134
  def __init__(self, config: InternVisionConfig):
135
  super().__init__()
@@ -155,7 +154,7 @@ class InternVisionEmbeddings(nn.Module):
155
  target_dtype = pos_embed.dtype
156
  pos_embed = pos_embed.float().reshape(
157
  1, self.image_size // self.patch_size, self.image_size // self.patch_size, -1).permute(0, 3, 1, 2)
158
- pos_embed = F.interpolate(pos_embed, size=(H, W), mode='bicubic', align_corners=False). \
159
  reshape(1, -1, H * W).permute(0, 2, 1).to(target_dtype)
160
  return pos_embed
161
 
@@ -268,12 +267,11 @@ class InternVisionEncoderLayer(nn.Module):
268
  super().__init__()
269
  self.embed_dim = config.hidden_size
270
  self.intermediate_size = config.intermediate_size
271
- self.norm_type = config.norm_type
272
 
273
  self.attn = InternAttention(config)
274
  self.mlp = InternMLP(config)
275
- self.norm1 = NORM2FN[self.norm_type](self.embed_dim, eps=config.layer_norm_eps)
276
- self.norm2 = NORM2FN[self.norm_type](self.embed_dim, eps=config.layer_norm_eps)
277
 
278
  self.ls1 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))
279
  self.ls2 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))
@@ -288,9 +286,9 @@ class InternVisionEncoderLayer(nn.Module):
288
  Args:
289
  hidden_states (`Tuple[torch.FloatTensor, Optional[torch.FloatTensor]]`): input to the layer of shape `(batch, seq_len, embed_dim)`
290
  """
291
- hidden_states = hidden_states + self.drop_path1(self.attn(self.norm1(hidden_states).to(hidden_states.dtype)) * self.ls1)
292
 
293
- hidden_states = hidden_states + self.drop_path2(self.mlp(self.norm2(hidden_states).to(hidden_states.dtype)) * self.ls2)
294
 
295
  return hidden_states
296
 
@@ -363,7 +361,6 @@ class InternVisionEncoder(nn.Module):
363
 
364
  class InternVisionModel(PreTrainedModel):
365
  main_input_name = 'pixel_values'
366
- _supports_flash_attn_2 = True
367
  config_class = InternVisionConfig
368
  _no_split_modules = ['InternVisionEncoderLayer']
369
 
 
1
  # --------------------------------------------------------
2
  # InternVL
3
+ # Copyright (c) 2023 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
 
6
  from typing import Optional, Tuple, Union
7
 
8
  import torch
 
20
  from .configuration_intern_vit import InternVisionConfig
21
 
22
  try:
23
+ try: # v1
24
+ from flash_attn.flash_attn_interface import \
25
+ flash_attn_unpadded_qkvpacked_func
26
+ except: # v2
27
+ from flash_attn.flash_attn_interface import \
28
+ flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
29
+
30
  from flash_attn.bert_padding import pad_input, unpad_input
31
+
 
32
  has_flash_attn = True
33
  except:
34
+ print('FlashAttention is not installed.')
35
  has_flash_attn = False
36
 
37
  logger = logging.get_logger(__name__)
 
47
  attention_dropout: The dropout rate to apply to the attention
48
  (default: 0.0)
49
  """
50
+
51
  def __init__(self, softmax_scale=None, attention_dropout=0.0, device=None, dtype=None):
52
  super().__init__()
53
  self.softmax_scale = softmax_scale
54
  self.dropout_p = attention_dropout
55
+
56
  def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens=None,
57
  max_s=None, need_weights=False):
58
  """Implements the multihead softmax attention.
 
65
  assert not need_weights
66
  assert qkv.dtype in [torch.float16, torch.bfloat16]
67
  assert qkv.is_cuda
68
+
69
  if cu_seqlens is None:
70
  batch_size = qkv.shape[0]
71
  seqlen = qkv.shape[1]
 
74
  max_s = seqlen
75
  cu_seqlens = torch.arange(0, (batch_size + 1) * seqlen, step=seqlen, dtype=torch.int32,
76
  device=qkv.device)
77
+ output = flash_attn_unpadded_qkvpacked_func(
78
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
79
  softmax_scale=self.softmax_scale, causal=causal
80
  )
 
84
  x = rearrange(qkv, 'b s three h d -> b s (three h d)')
85
  x_unpad, indices, cu_seqlens, max_s = unpad_input(x, key_padding_mask)
86
  x_unpad = rearrange(x_unpad, 'nnz (three h d) -> nnz three h d', three=3, h=nheads)
87
+ output_unpad = flash_attn_unpadded_qkvpacked_func(
88
  x_unpad, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
89
  softmax_scale=self.softmax_scale, causal=causal
90
  )
 
93
  'b s (h d) -> b s h d', h=nheads)
94
  else:
95
  assert max_s is not None
96
+ output = flash_attn_unpadded_qkvpacked_func(
97
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
98
  softmax_scale=self.softmax_scale, causal=causal
99
  )
100
+
101
  return output, None
102
 
103
 
 
129
  pass
130
 
131
 
 
 
 
 
 
 
132
  class InternVisionEmbeddings(nn.Module):
133
  def __init__(self, config: InternVisionConfig):
134
  super().__init__()
 
154
  target_dtype = pos_embed.dtype
155
  pos_embed = pos_embed.float().reshape(
156
  1, self.image_size // self.patch_size, self.image_size // self.patch_size, -1).permute(0, 3, 1, 2)
157
+ pos_embed = F.interpolate(pos_embed, size=(H, W), mode='bicubic', align_corners=False).\
158
  reshape(1, -1, H * W).permute(0, 2, 1).to(target_dtype)
159
  return pos_embed
160
 
 
267
  super().__init__()
268
  self.embed_dim = config.hidden_size
269
  self.intermediate_size = config.intermediate_size
 
270
 
271
  self.attn = InternAttention(config)
272
  self.mlp = InternMLP(config)
273
+ self.norm1 = InternRMSNorm(self.embed_dim, eps=config.layer_norm_eps)
274
+ self.norm2 = InternRMSNorm(self.embed_dim, eps=config.layer_norm_eps)
275
 
276
  self.ls1 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))
277
  self.ls2 = nn.Parameter(config.initializer_factor * torch.ones(self.embed_dim))
 
286
  Args:
287
  hidden_states (`Tuple[torch.FloatTensor, Optional[torch.FloatTensor]]`): input to the layer of shape `(batch, seq_len, embed_dim)`
288
  """
289
+ hidden_states = hidden_states + self.drop_path1(self.attn(self.norm1(hidden_states)) * self.ls1)
290
 
291
+ hidden_states = hidden_states + self.drop_path2(self.mlp(self.norm2(hidden_states)) * self.ls2)
292
 
293
  return hidden_states
294
 
 
361
 
362
  class InternVisionModel(PreTrainedModel):
363
  main_input_name = 'pixel_values'
 
364
  config_class = InternVisionConfig
365
  _no_split_modules = ['InternVisionEncoderLayer']
366
 
modeling_internlm2.py CHANGED
@@ -48,18 +48,6 @@ _CONFIG_FOR_DOC = 'InternLM2Config'
48
 
49
  flash_attn_func, flash_attn_varlen_func = None, None
50
  pad_input, index_first_axis, unpad_input = None, None, None
51
- try:
52
- from flash_attn import flash_attn_func as _flash_attn_func
53
- from flash_attn import flash_attn_varlen_func as _flash_attn_varlen_func
54
- from flash_attn.bert_padding import index_first_axis as _index_first_axis
55
- from flash_attn.bert_padding import pad_input as _pad_input
56
- from flash_attn.bert_padding import unpad_input as _unpad_input
57
-
58
- flash_attn_func, flash_attn_varlen_func = _flash_attn_func, _flash_attn_varlen_func
59
- pad_input, index_first_axis, unpad_input = _pad_input, _index_first_axis, _unpad_input
60
- has_flash_attn = True
61
- except:
62
- has_flash_attn = False
63
 
64
 
65
  def _import_flash_attn():
@@ -161,7 +149,7 @@ class InternLM2RotaryEmbedding(nn.Module):
161
 
162
  def _set_cos_sin_cache(self, seq_len, device, dtype):
163
  self.max_seq_len_cached = seq_len
164
- t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
165
 
166
  freqs = torch.einsum('i,j->ij', t, self.inv_freq)
167
  # Different from paper, but it uses a different permutation in order to obtain the same calculation
@@ -190,7 +178,7 @@ class InternLM2LinearScalingRotaryEmbedding(InternLM2RotaryEmbedding):
190
 
191
  def _set_cos_sin_cache(self, seq_len, device, dtype):
192
  self.max_seq_len_cached = seq_len
193
- t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
194
  t = t / self.scaling_factor
195
 
196
  freqs = torch.einsum('i,j->ij', t, self.inv_freq)
@@ -220,7 +208,7 @@ class InternLM2DynamicNTKScalingRotaryEmbedding(InternLM2RotaryEmbedding):
220
  inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
221
  self.register_buffer('inv_freq', inv_freq, persistent=False)
222
 
223
- t = torch.arange(self.max_seq_len_cached, device=device).to(dtype=self.inv_freq.dtype)
224
 
225
  freqs = torch.einsum('i,j->ij', t, self.inv_freq)
226
  # Different from paper, but it uses a different permutation in order to obtain the same calculation
@@ -709,7 +697,6 @@ class InternLM2PreTrainedModel(PreTrainedModel):
709
  supports_gradient_checkpointing = True
710
  _no_split_modules = ['InternLM2DecoderLayer']
711
  _skip_keys_device_placement = 'past_key_values'
712
- _supports_flash_attn_2 = True
713
 
714
  def _init_weights(self, module):
715
  std = self.config.initializer_range
@@ -808,9 +795,6 @@ class InternLM2Model(InternLM2PreTrainedModel):
808
  self.padding_idx = config.pad_token_id
809
  self.vocab_size = config.vocab_size
810
  self.config = config
811
- if not has_flash_attn:
812
- self.config.attn_implementation = 'eager'
813
- print('Warning: Flash attention is not available, using eager attention instead.')
814
 
815
  self.tok_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
816
 
@@ -1098,16 +1082,13 @@ class InternLM2ForCausalLM(InternLM2PreTrainedModel):
1098
  output = (logits,) + outputs[1:]
1099
  return (loss,) + output if loss is not None else output
1100
 
1101
- device = input_ids.device if input_ids is not None else inputs_embeds.device
1102
- output = CausalLMOutputWithPast(
1103
  loss=loss,
1104
  logits=logits,
1105
  past_key_values=outputs.past_key_values,
1106
  hidden_states=outputs.hidden_states,
1107
  attentions=outputs.attentions,
1108
  )
1109
- output['logits'] = output['logits'].to(device)
1110
- return output
1111
 
1112
  def prepare_inputs_for_generation(
1113
  self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
 
48
 
49
  flash_attn_func, flash_attn_varlen_func = None, None
50
  pad_input, index_first_axis, unpad_input = None, None, None
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
 
53
  def _import_flash_attn():
 
149
 
150
  def _set_cos_sin_cache(self, seq_len, device, dtype):
151
  self.max_seq_len_cached = seq_len
152
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
153
 
154
  freqs = torch.einsum('i,j->ij', t, self.inv_freq)
155
  # Different from paper, but it uses a different permutation in order to obtain the same calculation
 
178
 
179
  def _set_cos_sin_cache(self, seq_len, device, dtype):
180
  self.max_seq_len_cached = seq_len
181
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
182
  t = t / self.scaling_factor
183
 
184
  freqs = torch.einsum('i,j->ij', t, self.inv_freq)
 
208
  inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
209
  self.register_buffer('inv_freq', inv_freq, persistent=False)
210
 
211
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
212
 
213
  freqs = torch.einsum('i,j->ij', t, self.inv_freq)
214
  # Different from paper, but it uses a different permutation in order to obtain the same calculation
 
697
  supports_gradient_checkpointing = True
698
  _no_split_modules = ['InternLM2DecoderLayer']
699
  _skip_keys_device_placement = 'past_key_values'
 
700
 
701
  def _init_weights(self, module):
702
  std = self.config.initializer_range
 
795
  self.padding_idx = config.pad_token_id
796
  self.vocab_size = config.vocab_size
797
  self.config = config
 
 
 
798
 
799
  self.tok_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
800
 
 
1082
  output = (logits,) + outputs[1:]
1083
  return (loss,) + output if loss is not None else output
1084
 
1085
+ return CausalLMOutputWithPast(
 
1086
  loss=loss,
1087
  logits=logits,
1088
  past_key_values=outputs.past_key_values,
1089
  hidden_states=outputs.hidden_states,
1090
  attentions=outputs.attentions,
1091
  )
 
 
1092
 
1093
  def prepare_inputs_for_generation(
1094
  self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
modeling_internvl_chat.py CHANGED
@@ -1,48 +1,70 @@
1
  # --------------------------------------------------------
2
  # InternVL
3
- # Copyright (c) 2024 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
6
-
7
  import warnings
8
  from typing import Any, List, Optional, Tuple, Union
9
 
10
  import torch.utils.checkpoint
11
- import transformers
12
  from torch import nn
13
  from torch.nn import CrossEntropyLoss
14
- from transformers import AutoModel, GenerationConfig, LlamaForCausalLM
 
15
  from transformers.modeling_outputs import CausalLMOutputWithPast
16
  from transformers.modeling_utils import PreTrainedModel
17
  from transformers.utils import ModelOutput, logging
18
 
19
  from .configuration_internvl_chat import InternVLChatConfig
20
- from .conversation import get_conv_template
21
- from .modeling_intern_vit import InternVisionModel, has_flash_attn
22
  from .modeling_internlm2 import InternLM2ForCausalLM
23
 
24
  logger = logging.get_logger(__name__)
25
 
26
 
27
- def version_cmp(v1, v2, op='eq'):
28
- import operator
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
- from packaging import version
31
- op_func = getattr(operator, op)
32
- return op_func(version.parse(v1), version.parse(v2))
 
 
 
 
33
 
34
 
35
  class InternVLChatModel(PreTrainedModel):
36
  config_class = InternVLChatConfig
37
  main_input_name = 'pixel_values'
38
- base_model_prefix = 'language_model'
39
- _supports_flash_attn_2 = True
40
- _no_split_modules = ['InternVisionModel', 'LlamaDecoderLayer', 'InternLM2DecoderLayer']
41
 
42
- def __init__(self, config: InternVLChatConfig, vision_model=None, language_model=None, use_flash_attn=True):
43
  super().__init__(config)
44
 
45
- assert version_cmp(transformers.__version__, '4.37.0', 'ge')
46
  image_size = config.force_image_size or config.vision_config.image_size
47
  patch_size = config.vision_config.patch_size
48
  self.patch_size = patch_size
@@ -50,10 +72,8 @@ class InternVLChatModel(PreTrainedModel):
50
  self.template = config.template
51
  self.num_image_token = int((image_size // patch_size) ** 2 * (config.downsample_ratio ** 2))
52
  self.downsample_ratio = config.downsample_ratio
 
53
  self.ps_version = config.ps_version
54
- use_flash_attn = use_flash_attn if has_flash_attn else False
55
- config.vision_config.use_flash_attn = True if use_flash_attn else False
56
- config.llm_config.attn_implementation = 'flash_attention_2' if use_flash_attn else 'eager'
57
 
58
  logger.info(f'num_image_token: {self.num_image_token}')
59
  logger.info(f'ps_version: {self.ps_version}')
@@ -81,9 +101,44 @@ class InternVLChatModel(PreTrainedModel):
81
  nn.Linear(llm_hidden_size, llm_hidden_size)
82
  )
83
 
 
 
 
 
 
 
 
84
  self.img_context_token_id = None
85
- self.conv_template = get_conv_template(self.template)
86
- self.system_message = self.conv_template.system_message
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
 
88
  def forward(
89
  self,
@@ -102,7 +157,7 @@ class InternVLChatModel(PreTrainedModel):
102
  return_dict = return_dict if return_dict is not None else self.config.use_return_dict
103
 
104
  image_flags = image_flags.squeeze(-1)
105
- input_embeds = self.language_model.get_input_embeddings()(input_ids).clone()
106
 
107
  vit_embeds = self.extract_feature(pixel_values)
108
  vit_embeds = vit_embeds[image_flags == 1]
@@ -111,7 +166,7 @@ class InternVLChatModel(PreTrainedModel):
111
  B, N, C = input_embeds.shape
112
  input_embeds = input_embeds.reshape(B * N, C)
113
 
114
- if torch.distributed.is_initialized() and torch.distributed.get_rank() == 0:
115
  print(f'dynamic ViT batch size: {vit_batch_size}, images per sample: {vit_batch_size / B}, dynamic token length: {N}')
116
 
117
  input_ids = input_ids.reshape(B * N)
@@ -180,7 +235,17 @@ class InternVLChatModel(PreTrainedModel):
180
  x = x.permute(0, 2, 1, 3).contiguous()
181
  return x
182
 
 
 
 
 
 
 
183
  def extract_feature(self, pixel_values):
 
 
 
 
184
  if self.select_layer == -1:
185
  vit_embeds = self.vision_model(
186
  pixel_values=pixel_values,
@@ -193,99 +258,53 @@ class InternVLChatModel(PreTrainedModel):
193
  return_dict=True).hidden_states[self.select_layer]
194
  vit_embeds = vit_embeds[:, 1:, :]
195
 
 
 
 
 
 
 
 
 
 
196
  h = w = int(vit_embeds.shape[1] ** 0.5)
197
  vit_embeds = vit_embeds.reshape(vit_embeds.shape[0], h, w, -1)
198
  vit_embeds = self.pixel_shuffle(vit_embeds, scale_factor=self.downsample_ratio)
199
  vit_embeds = vit_embeds.reshape(vit_embeds.shape[0], -1, vit_embeds.shape[-1])
 
 
200
  vit_embeds = self.mlp1(vit_embeds)
201
  return vit_embeds
202
 
203
- def batch_chat(self, tokenizer, pixel_values, questions, generation_config, num_patches_list=None,
204
- history=None, return_history=False, IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>',
205
- IMG_CONTEXT_TOKEN='<IMG_CONTEXT>', verbose=False, image_counts=None):
206
- if history is not None or return_history:
207
- print('Now multi-turn chat is not supported in batch_chat.')
208
- raise NotImplementedError
209
-
210
- if image_counts is not None:
211
- num_patches_list = image_counts
212
- print('Warning: `image_counts` is deprecated. Please use `num_patches_list` instead.')
213
-
214
- img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
215
- self.img_context_token_id = img_context_token_id
216
-
217
- if verbose and pixel_values is not None:
218
- image_bs = pixel_values.shape[0]
219
- print(f'dynamic ViT batch size: {image_bs}')
220
-
221
- queries = []
222
- for idx, num_patches in enumerate(num_patches_list):
223
- question = questions[idx]
224
- if pixel_values is not None and '<image>' not in question:
225
- question = '<image>\n' + question
226
- template = get_conv_template(self.template)
227
- template.system_message = self.system_message
228
- template.append_message(template.roles[0], question)
229
- template.append_message(template.roles[1], None)
230
- query = template.get_prompt()
231
-
232
- image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * num_patches + IMG_END_TOKEN
233
- query = query.replace('<image>', image_tokens, 1)
234
- queries.append(query)
235
-
236
- tokenizer.padding_side = 'left'
237
- model_inputs = tokenizer(queries, return_tensors='pt', padding=True)
238
- input_ids = model_inputs['input_ids'].to(self.device)
239
- attention_mask = model_inputs['attention_mask'].to(self.device)
240
- eos_token_id = tokenizer.convert_tokens_to_ids(template.sep.strip())
241
- generation_config['eos_token_id'] = eos_token_id
242
- generation_output = self.generate(
243
- pixel_values=pixel_values,
244
- input_ids=input_ids,
245
- attention_mask=attention_mask,
246
- **generation_config
247
- )
248
- responses = tokenizer.batch_decode(generation_output, skip_special_tokens=True)
249
- responses = [response.split(template.sep.strip())[0].strip() for response in responses]
250
- return responses
251
-
252
  def chat(self, tokenizer, pixel_values, question, generation_config, history=None, return_history=False,
253
- num_patches_list=None, IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>', IMG_CONTEXT_TOKEN='<IMG_CONTEXT>',
254
- verbose=False):
255
-
256
- if history is None and pixel_values is not None and '<image>' not in question:
257
- question = '<image>\n' + question
258
-
259
- if num_patches_list is None:
260
- num_patches_list = [pixel_values.shape[0]] if pixel_values is not None else []
261
- assert pixel_values is None or len(pixel_values) == sum(num_patches_list)
262
 
263
  img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
264
  self.img_context_token_id = img_context_token_id
 
 
 
 
265
 
266
- template = get_conv_template(self.template)
267
- template.system_message = self.system_message
268
- eos_token_id = tokenizer.convert_tokens_to_ids(template.sep.strip())
269
 
270
- history = [] if history is None else history
271
- for (old_question, old_answer) in history:
272
- template.append_message(template.roles[0], old_question)
273
- template.append_message(template.roles[1], old_answer)
 
 
 
 
 
 
 
274
  template.append_message(template.roles[0], question)
275
  template.append_message(template.roles[1], None)
276
  query = template.get_prompt()
277
-
278
- if verbose and pixel_values is not None:
279
- image_bs = pixel_values.shape[0]
280
- print(f'dynamic ViT batch size: {image_bs}')
281
-
282
- for num_patches in num_patches_list:
283
- image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * num_patches + IMG_END_TOKEN
284
- query = query.replace('<image>', image_tokens, 1)
285
-
286
  model_inputs = tokenizer(query, return_tensors='pt')
287
- input_ids = model_inputs['input_ids'].to(self.device)
288
- attention_mask = model_inputs['attention_mask'].to(self.device)
289
  generation_config['eos_token_id'] = eos_token_id
290
  generation_output = self.generate(
291
  pixel_values=pixel_values,
@@ -294,16 +313,15 @@ class InternVLChatModel(PreTrainedModel):
294
  **generation_config
295
  )
296
  response = tokenizer.batch_decode(generation_output, skip_special_tokens=True)[0]
297
- response = response.split(template.sep.strip())[0].strip()
298
  history.append((question, response))
299
  if return_history:
300
  return response, history
301
  else:
302
- query_to_print = query.replace(IMG_CONTEXT_TOKEN, '')
303
- query_to_print = query_to_print.replace(f'{IMG_START_TOKEN}{IMG_END_TOKEN}', '<image>')
304
- if verbose:
305
- print(query_to_print, response)
306
  return response
 
307
 
308
  @torch.no_grad()
309
  def generate(
@@ -314,6 +332,7 @@ class InternVLChatModel(PreTrainedModel):
314
  visual_features: Optional[torch.FloatTensor] = None,
315
  generation_config: Optional[GenerationConfig] = None,
316
  output_hidden_states: Optional[bool] = None,
 
317
  **generate_kwargs,
318
  ) -> torch.LongTensor:
319
 
@@ -323,6 +342,7 @@ class InternVLChatModel(PreTrainedModel):
323
  vit_embeds = visual_features
324
  else:
325
  vit_embeds = self.extract_feature(pixel_values)
 
326
  input_embeds = self.language_model.get_input_embeddings()(input_ids)
327
  B, N, C = input_embeds.shape
328
  input_embeds = input_embeds.reshape(B * N, C)
@@ -330,7 +350,7 @@ class InternVLChatModel(PreTrainedModel):
330
  input_ids = input_ids.reshape(B * N)
331
  selected = (input_ids == self.img_context_token_id)
332
  assert selected.sum() != 0
333
- input_embeds[selected] = vit_embeds.reshape(-1, C).to(input_embeds.device)
334
 
335
  input_embeds = input_embeds.reshape(B, N, C)
336
  else:
@@ -341,6 +361,7 @@ class InternVLChatModel(PreTrainedModel):
341
  attention_mask=attention_mask,
342
  generation_config=generation_config,
343
  output_hidden_states=output_hidden_states,
 
344
  use_cache=True,
345
  **generate_kwargs,
346
  )
 
1
  # --------------------------------------------------------
2
  # InternVL
3
+ # Copyright (c) 2023 OpenGVLab
4
  # Licensed under The MIT License [see LICENSE for details]
5
  # --------------------------------------------------------
 
6
  import warnings
7
  from typing import Any, List, Optional, Tuple, Union
8
 
9
  import torch.utils.checkpoint
10
+ from peft import LoraConfig, get_peft_model
11
  from torch import nn
12
  from torch.nn import CrossEntropyLoss
13
+ from transformers import (AutoModel, GenerationConfig, LlamaForCausalLM,
14
+ LlamaTokenizer)
15
  from transformers.modeling_outputs import CausalLMOutputWithPast
16
  from transformers.modeling_utils import PreTrainedModel
17
  from transformers.utils import ModelOutput, logging
18
 
19
  from .configuration_internvl_chat import InternVLChatConfig
20
+ from .modeling_intern_vit import InternVisionModel
 
21
  from .modeling_internlm2 import InternLM2ForCausalLM
22
 
23
  logger = logging.get_logger(__name__)
24
 
25
 
26
+ def window_partition(x, window_size):
27
+ """
28
+ Args:
29
+ x: (B, C, H, W)
30
+ window_size (int): window size, assuming square window
31
+
32
+ Returns:
33
+ windows: (num_windows*B, C, window_size, window_size)
34
+ """
35
+ B, C, H, W = x.shape
36
+ assert H % window_size == 0 and W % window_size == 0, 'H and W must be divisible by window_size'
37
+
38
+ x = x.view(B, C, H // window_size, window_size, W // window_size, window_size)
39
+ windows = x.permute(0, 2, 4, 1, 3, 5).contiguous().view(-1, C, window_size, window_size)
40
+ return windows
41
+
42
+
43
+ def window_reverse(windows, window_size, H, W):
44
+ """
45
+ Args:
46
+ windows: (num_windows*B, window_size, window_size, C)
47
+ window_size (int): Window size
48
+ H (int): Height of image
49
+ W (int): Width of image
50
 
51
+ Returns:
52
+ x: (B, H * W, C)
53
+ """
54
+ B = int(windows.shape[0] / (H * W / window_size / window_size))
55
+ x = windows.view(B, H // window_size, W // window_size, window_size, window_size, -1)
56
+ x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, H * W, -1)
57
+ return x
58
 
59
 
60
  class InternVLChatModel(PreTrainedModel):
61
  config_class = InternVLChatConfig
62
  main_input_name = 'pixel_values'
63
+ _no_split_modules = ['InternVisionEncoderLayer', 'LlamaDecoderLayer', 'LlamaForCausalLM']
 
 
64
 
65
+ def __init__(self, config: InternVLChatConfig, vision_model=None, language_model=None):
66
  super().__init__(config)
67
 
 
68
  image_size = config.force_image_size or config.vision_config.image_size
69
  patch_size = config.vision_config.patch_size
70
  self.patch_size = patch_size
 
72
  self.template = config.template
73
  self.num_image_token = int((image_size // patch_size) ** 2 * (config.downsample_ratio ** 2))
74
  self.downsample_ratio = config.downsample_ratio
75
+ self.image_fold = config.image_fold
76
  self.ps_version = config.ps_version
 
 
 
77
 
78
  logger.info(f'num_image_token: {self.num_image_token}')
79
  logger.info(f'ps_version: {self.ps_version}')
 
101
  nn.Linear(llm_hidden_size, llm_hidden_size)
102
  )
103
 
104
+ # if config.force_image_size != config.vision_config.image_size:
105
+ # self.vision_model.resize_pos_embeddings(
106
+ # old_size=config.vision_config.image_size,
107
+ # new_size=config.force_image_size,
108
+ # patch_size=config.vision_config.patch_size
109
+ # )
110
+
111
  self.img_context_token_id = None
112
+ self.neftune_alpha = None
113
+
114
+ if config.use_backbone_lora:
115
+ self.wrap_backbone_lora(r=config.use_backbone_lora, lora_alpha=2 * config.use_backbone_lora)
116
+
117
+ if config.use_llm_lora:
118
+ self.wrap_llm_lora(r=config.use_llm_lora, lora_alpha=2 * config.use_llm_lora)
119
+
120
+ def wrap_backbone_lora(self, r=128, lora_alpha=256, lora_dropout=0.05):
121
+ lora_config = LoraConfig(
122
+ r=r,
123
+ target_modules=['attn.qkv', 'attn.proj', 'mlp.fc1', 'mlp.fc2'],
124
+ lora_alpha=lora_alpha,
125
+ lora_dropout=lora_dropout,
126
+ )
127
+ self.vision_model = get_peft_model(self.vision_model, lora_config)
128
+ self.vision_model.print_trainable_parameters()
129
+
130
+ def wrap_llm_lora(self, r=128, lora_alpha=256, lora_dropout=0.05):
131
+ lora_config = LoraConfig(
132
+ r=r,
133
+ target_modules=['self_attn.q_proj', 'self_attn.k_proj', 'self_attn.v_proj', 'self_attn.o_proj',
134
+ 'mlp.gate_proj', 'mlp.down_proj', 'mlp.up_proj'],
135
+ lora_alpha=lora_alpha,
136
+ lora_dropout=lora_dropout,
137
+ task_type='CAUSAL_LM'
138
+ )
139
+ self.language_model = get_peft_model(self.language_model, lora_config)
140
+ self.language_model.enable_input_require_grads()
141
+ self.language_model.print_trainable_parameters()
142
 
143
  def forward(
144
  self,
 
157
  return_dict = return_dict if return_dict is not None else self.config.use_return_dict
158
 
159
  image_flags = image_flags.squeeze(-1)
160
+ input_embeds = self.language_model.get_input_embeddings()(input_ids)
161
 
162
  vit_embeds = self.extract_feature(pixel_values)
163
  vit_embeds = vit_embeds[image_flags == 1]
 
166
  B, N, C = input_embeds.shape
167
  input_embeds = input_embeds.reshape(B * N, C)
168
 
169
+ if torch.distributed.get_rank() == 0:
170
  print(f'dynamic ViT batch size: {vit_batch_size}, images per sample: {vit_batch_size / B}, dynamic token length: {N}')
171
 
172
  input_ids = input_ids.reshape(B * N)
 
235
  x = x.permute(0, 2, 1, 3).contiguous()
236
  return x
237
 
238
+ def noised_embed(self, vit_embeds, noise_alpha=5):
239
+ dims = torch.tensor(vit_embeds.size(1) * vit_embeds.size(2))
240
+ mag_norm = noise_alpha / torch.sqrt(dims)
241
+ noise = torch.zeros_like(vit_embeds).uniform_(-mag_norm, mag_norm)
242
+ return vit_embeds + noise
243
+
244
  def extract_feature(self, pixel_values):
245
+ if self.image_fold:
246
+ image_size = pixel_values.size(-1) # B, C, H, W
247
+ pixel_values = window_partition(pixel_values, window_size=image_size // self.image_fold) # 4B, C, H/2, W/2
248
+
249
  if self.select_layer == -1:
250
  vit_embeds = self.vision_model(
251
  pixel_values=pixel_values,
 
258
  return_dict=True).hidden_states[self.select_layer]
259
  vit_embeds = vit_embeds[:, 1:, :]
260
 
261
+ if self.training and self.neftune_alpha is not None:
262
+ vit_embeds = self.noised_embed(vit_embeds, self.neftune_alpha)
263
+
264
+ if self.image_fold:
265
+ vit_embeds = window_reverse(vit_embeds, window_size=image_size // (self.image_fold * self.patch_size),
266
+ H=image_size // self.patch_size, W=image_size // self.patch_size)
267
+
268
+ # if torch.distributed.get_rank() == 0:
269
+ # print("before pixel shuffle:", vit_embeds.shape)
270
  h = w = int(vit_embeds.shape[1] ** 0.5)
271
  vit_embeds = vit_embeds.reshape(vit_embeds.shape[0], h, w, -1)
272
  vit_embeds = self.pixel_shuffle(vit_embeds, scale_factor=self.downsample_ratio)
273
  vit_embeds = vit_embeds.reshape(vit_embeds.shape[0], -1, vit_embeds.shape[-1])
274
+ # if torch.distributed.get_rank() == 0:
275
+ # print("after pixel shuffle:", vit_embeds.shape)
276
  vit_embeds = self.mlp1(vit_embeds)
277
  return vit_embeds
278
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
279
  def chat(self, tokenizer, pixel_values, question, generation_config, history=None, return_history=False,
280
+ IMG_START_TOKEN='<img>', IMG_END_TOKEN='</img>', IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'):
 
 
 
 
 
 
 
 
281
 
282
  img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
283
  self.img_context_token_id = img_context_token_id
284
+ if tokenizer.convert_tokens_to_ids('<|im_end|>') != 0:
285
+ eos_token_id = tokenizer.convert_tokens_to_ids('<|im_end|>') # 92542, InternLM2
286
+ else:
287
+ eos_token_id = tokenizer.eos_token_id
288
 
289
+ from .conversation import get_conv_template
 
 
290
 
291
+ template = get_conv_template(self.template)
292
+ image_bs = pixel_values.shape[0]
293
+ print(f'dynamic ViT batch size: {image_bs}')
294
+ if history is None:
295
+ history = []
296
+ image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * image_bs + IMG_END_TOKEN
297
+ question = image_tokens + '\n' + question
298
+ else:
299
+ for (old_question, old_answer) in history:
300
+ template.append_message(template.roles[0], old_question)
301
+ template.append_message(template.roles[1], old_answer)
302
  template.append_message(template.roles[0], question)
303
  template.append_message(template.roles[1], None)
304
  query = template.get_prompt()
 
 
 
 
 
 
 
 
 
305
  model_inputs = tokenizer(query, return_tensors='pt')
306
+ input_ids = model_inputs['input_ids'].cuda()
307
+ attention_mask = model_inputs['attention_mask'].cuda()
308
  generation_config['eos_token_id'] = eos_token_id
309
  generation_output = self.generate(
310
  pixel_values=pixel_values,
 
313
  **generation_config
314
  )
315
  response = tokenizer.batch_decode(generation_output, skip_special_tokens=True)[0]
316
+ response = response.split('<|im_end|>')[0].strip() # for InternLM2
317
  history.append((question, response))
318
  if return_history:
319
  return response, history
320
  else:
321
+ query_to_print = query.replace(image_tokens, '<image>')
322
+ print(query_to_print, response)
 
 
323
  return response
324
+ return response
325
 
326
  @torch.no_grad()
327
  def generate(
 
332
  visual_features: Optional[torch.FloatTensor] = None,
333
  generation_config: Optional[GenerationConfig] = None,
334
  output_hidden_states: Optional[bool] = None,
335
+ return_dict: Optional[bool] = None,
336
  **generate_kwargs,
337
  ) -> torch.LongTensor:
338
 
 
342
  vit_embeds = visual_features
343
  else:
344
  vit_embeds = self.extract_feature(pixel_values)
345
+
346
  input_embeds = self.language_model.get_input_embeddings()(input_ids)
347
  B, N, C = input_embeds.shape
348
  input_embeds = input_embeds.reshape(B * N, C)
 
350
  input_ids = input_ids.reshape(B * N)
351
  selected = (input_ids == self.img_context_token_id)
352
  assert selected.sum() != 0
353
+ input_embeds[selected] = vit_embeds.reshape(-1, C)
354
 
355
  input_embeds = input_embeds.reshape(B, N, C)
356
  else:
 
361
  attention_mask=attention_mask,
362
  generation_config=generation_config,
363
  output_hidden_states=output_hidden_states,
364
+ return_dict=return_dict,
365
  use_cache=True,
366
  **generate_kwargs,
367
  )
preprocessor_config.json DELETED
@@ -1,19 +0,0 @@
1
- {
2
- "crop_size": 448,
3
- "do_center_crop": true,
4
- "do_normalize": true,
5
- "do_resize": true,
6
- "feature_extractor_type": "CLIPFeatureExtractor",
7
- "image_mean": [
8
- 0.485,
9
- 0.456,
10
- 0.406
11
- ],
12
- "image_std": [
13
- 0.229,
14
- 0.224,
15
- 0.225
16
- ],
17
- "resample": 3,
18
- "size": 448
19
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
examples/red-panda.mp4 → runs/Apr15_16-44-40_SH-IDC1-10-140-37-13/events.out.tfevents.1713171220.SH-IDC1-10-140-37-13.204150.0 RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d921c07bb97224d65a37801541d246067f0d506f08723ffa1ad85c217907ccb8
3
- size 1867237
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:294d5bf755e6dea5c005c57af52e958a38bb42a7d17d801a25a6543bfe6ddca2
3
+ size 16662
runs/Apr15_17-33-22_SH-IDC1-10-140-37-13/events.out.tfevents.1713174123.SH-IDC1-10-140-37-13.259480.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:57d61c0e776bfb521e58febdbd99525e011f82137ceaaa655ffa6e2b3a9b02a9
3
+ size 72471