vanewu commited on
Commit
78728a3
·
1 Parent(s): 6fa2ca0

NewAcceleration (#19)

Browse files

- remove files (2b20dbd049a0be820cd120a1447d6499b2d34d8a)
- Update for huggingface hub (b061fc933c657f766d8322eff24fc1f8ff06ea8d)

CHANGELOG.md DELETED
@@ -1,3 +0,0 @@
1
- ## v1.0
2
-
3
- - Add accelerated ChatGLM-6B model (from: https://huggingface.co/THUDM/chatglm-6b)
 
 
 
 
Dockerfile DELETED
@@ -1,11 +0,0 @@
1
- FROM nvcr.io/nvidia/pytorch:23.02-py3
2
-
3
- WORKDIR /workdir
4
-
5
- COPY requirements.txt /workdir/
6
-
7
- # since installing icetk will install protobuf 3.18.3, and we need protobuf==3.20.3
8
- RUN pip install -r requirements.txt && \
9
- pip install protobuf==3.20.3
10
-
11
-
 
 
 
 
 
 
 
 
 
 
 
 
README.md DELETED
@@ -1,120 +0,0 @@
1
- ---
2
- license: creativeml-openrail-m
3
- language:
4
- - en
5
- tags:
6
- - LLM
7
- - tensorRT
8
- - ChatGLM
9
- ---
10
- ## Model Card for lyraChatGLM
11
-
12
- lyraChatGLM is currently the **fastest ChatGLM-6B** available. To the best of our knowledge, it is the **first accelerated version of ChatGLM-6B**.
13
-
14
- The inference speed of lyraChatGLM has achieved **10x** acceleration upon the ealry original version. We are still working hard to further improve the performance.
15
-
16
- Among its main features are:
17
-
18
- - weights: original ChatGLM-6B weights released by THUDM.
19
- - device: lyraChatGLM is mainly based on TensorRT compiled for SM=80 (A100, for example).
20
- - batch_size: compiled with dynamic batch size, max batch_size = 8
21
-
22
- ## Speed
23
-
24
- ### test environment
25
-
26
- - device: Nvidia A100 40G
27
- - batch size: 8
28
-
29
- **Since early chatGLM version didn't suport batch inference, `original` in below table was measured on batch_size=1**
30
-
31
-
32
- **According to [this discussion](https://huggingface.co/TMElyralab/lyraChatGLM/discussions/6), this bug has been fixed and the speed on batch_size=8 reachs up to 137 tokens/s. We will evaluate and update the latest performance.**
33
-
34
- |version|speed|
35
- |:-:|:-:|
36
- |original|30 tokens/s|
37
- |lyraChatGLM|310 tokens/s|
38
-
39
-
40
- ## Model Sources
41
-
42
- - **Repository:** https://huggingface.co/THUDM/chatglm-6b
43
-
44
- ## Try Demo in 2 fast steps
45
-
46
- ``` bash
47
- #step 1
48
- git clone https://huggingface.co/TMElyralab/lyraChatGLM
49
- cd lyraChatGLM
50
-
51
- #step 2
52
- docker run --gpus=1 --rm --net=host -v ${PWD}:/workdir yibolu96/lyra-chatglm-env:0.0.1 python3 /workdir/demo.py
53
- ```
54
-
55
- ## Uses
56
-
57
- ```python
58
- from transformers import AutoTokenizer
59
- from lyraChatGLM import GLM6B, FasterChatGLM
60
- import os
61
-
62
- current_workdir = os.path.dirname(__file__)
63
-
64
- MAX_OUT_LEN = 100
65
- chatglm6b_dir = os.path.join(current_workdir, "models")
66
- tokenizer = AutoTokenizer.from_pretrained(chatglm6b_dir, trust_remote_code=True)
67
- input_str = ["为什么我们需要对深度学习模型加速?", ]
68
- inputs = tokenizer(input_str, return_tensors="pt", padding=True)
69
- input_ids = inputs.input_ids.to('cuda:0')
70
-
71
- plan_path = os.path.join(current_workdir, "models/glm6b-bs8.ftm")
72
-
73
- # kernel for chat model.
74
- kernel = GLM6B(plan_path=plan_path,
75
- batch_size=1,
76
- num_beams=1,
77
- use_cache=True,
78
- num_heads=32,
79
- emb_size_per_heads=128,
80
- decoder_layers=28,
81
- vocab_size=150528,
82
- max_seq_len=MAX_OUT_LEN)
83
-
84
- chat = FasterChatGLM(model_dir=chatglm6b_dir, kernel=kernel).half().cuda()
85
-
86
- # generate
87
- sample_output = chat.generate(inputs=input_ids, max_length=MAX_OUT_LEN)
88
- # de-tokenize model output to text
89
- res = tokenizer.decode(sample_output[0], skip_special_tokens=True)
90
- print(res)
91
- ```
92
- ## Demo output
93
-
94
- ### input
95
- 为什么我们需要对深度学习模型加速? 。
96
-
97
- ### output
98
- 为什么我们需要对深度学习模型加速? 深度学习模型的训练需要大量计算资源,特别是在训练模型时,需要大量的内存、GPU(图形处理器)和其他计算资源。因此,训练深度学习模型需要一定的时间,并且如果模型不能快速训练,则可能会导致训练进度缓慢或无法训练。
99
-
100
- 以下是一些原因我们需要对深度学习模型加速:
101
-
102
- 1. 训练深度神经网络需要大量的计算资源,特别是在训练深度神经网络时,需要更多的计算资源,因此需要更快的训练速度。
103
-
104
- ### TODO:
105
-
106
- We plan to implement a FasterTransformer version to publish a much faster release. Stay tuned!
107
-
108
- ## Citation
109
- ``` bibtex
110
- @Misc{lyraChatGLM2023,
111
- author = {Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu},
112
- title = {lyraChatGLM: Accelerating ChatGLM by 10x+},
113
- howpublished = {\url{https://huggingface.co/TMElyralab/lyraChatGLM}},
114
- year = {2023}
115
- }
116
- ```
117
-
118
- ## Report bug
119
- - start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraChatGLM/discussions
120
- - report bug with a `[bug]` mark in the title.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
demo.py CHANGED
@@ -1,35 +1,20 @@
1
- # coding=utf-8
2
 
3
- from transformers import AutoTokenizer
4
- from lyraChatGLM import GLM6B, FasterChatGLM
5
- import os
 
 
 
6
 
7
- current_workdir = os.path.dirname(__file__)
 
 
8
 
9
- MAX_OUT_LEN = 100
10
- chatglm6b_dir = os.path.join(current_workdir, "models")
11
- tokenizer = AutoTokenizer.from_pretrained(chatglm6b_dir, trust_remote_code=True)
12
- input_str = ["为什么我们需要对深度学习模型加速?", ]
13
- inputs = tokenizer(input_str, return_tensors="pt", padding=True)
14
- input_ids = inputs.input_ids.to('cuda:0')
15
 
16
- plan_path = os.path.join(current_workdir, "models/glm6b-bs8.ftm")
17
 
18
- # kernel for chat model.
19
- kernel = GLM6B(plan_path=plan_path,
20
- batch_size=1,
21
- num_beams=1,
22
- use_cache=True,
23
- num_heads=32,
24
- emb_size_per_heads=128,
25
- decoder_layers=28,
26
- vocab_size=150528,
27
- max_seq_len=MAX_OUT_LEN)
28
 
29
- chat = FasterChatGLM(model_dir=chatglm6b_dir, kernel=kernel).half().cuda()
30
-
31
- # generate
32
- sample_output = chat.generate(inputs=input_ids, max_length=MAX_OUT_LEN)
33
- # de-tokenize model output to text
34
- res = tokenizer.decode(sample_output[0], skip_special_tokens=True)
35
- print(res)
 
1
+ from lyraChatGLM import LyraChatGLM6B
2
 
3
+ model_path = "./models/1-gpu-fp16.h5"
4
+ tokenizer_path = "./models"
5
+ data_type = "fp16"
6
+ int8_mode = 0
7
+ max_output_length = 150
8
+ arch = "Ampere" # Ampere or Volta
9
 
10
+ model = LyraChatGLM6B(model_path, tokenizer_path, data_type, int8_mode, arch)
11
+ prompt = "今天天气大概 25度,有点小雨,吹着风,我想去户外散步,应该穿什么样的衣服裤子鞋子搭配。"
12
+ test_batch_size = 256
13
 
14
+ prompts = [prompt, ]
 
 
 
 
 
15
 
 
16
 
17
+ # If you want to get different output in same batch, you can set do_sample to True
18
+ output_texts = model.generate(prompts, output_length=max_output_length,top_k=30, top_p=0.85, temperature=0.35, repetition_penalty=1.2, do_sample=False)
 
 
 
 
 
 
 
 
19
 
20
+ print(output_texts)
 
 
 
 
 
 
lyraChatGLM/__init__.py CHANGED
@@ -1,10 +1 @@
1
- import os
2
- import ctypes
3
-
4
- current_workdir = os.path.dirname(__file__)
5
- ctypes.cdll.LoadLibrary(os.path.join(current_workdir, "libnvinfer_plugin.so"))
6
- os.environ["TORCH_USE_RTLD_GLOBAL"]="YES"
7
-
8
- import torch
9
- from .glm import GLM6B
10
- from .model import FasterChatGLM
 
1
+ from .lyra_glm import LyraChatGLM6B
 
 
 
 
 
 
 
 
 
lyraChatGLM/config.py ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import dataclasses
2
+ from typing import Optional
3
+
4
+
5
+ @dataclasses.dataclass
6
+ class ChatGLM6BParam:
7
+ num_heads: int = 32
8
+ size_per_head: int = 128
9
+ inter_size: int = 16384
10
+ num_layers: int = 28
11
+ vocab_size: int = 130528
12
+ start_id: Optional[int] = 130004
13
+ end_id: Optional[int] = 130005
14
+ tensor_para_size: int = 1
15
+ pipeline_para_size: int = 1
16
+ remove_padding: bool = True
17
+ shared_contexts_ratio: float = 1.0
18
+ layernorm_eps: float = 1e-5
19
+ weights_data_type: str = "fp16"
20
+
21
+ def __post_init__(self):
22
+ if not 0.0 <= self.shared_contexts_ratio <= 1.0:
23
+ raise ValueError(
24
+ f'Got an invalid value of shared_context_ratio '
25
+ f'{self.shared_contexts_ratio} - range: [0.0, 1.0]')
26
+
27
+ def asdict(self):
28
+ return dataclasses.asdict(self)
29
+
30
+
31
+ CHATGLM_6B_PARAM = ChatGLM6BParam()
lyraChatGLM/lyra_glm.py ADDED
@@ -0,0 +1,174 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import configparser
4
+ import pathlib
5
+ import typing
6
+
7
+ import torch
8
+ import transformers
9
+
10
+ from .config import CHATGLM_6B_PARAM
11
+ from .model import ChatGLM6BModel
12
+
13
+
14
+ class LyraChatGLM6B:
15
+ def __init__(self, model_path, tokenizer_path=None, dtype='fp16', int8_mode=0, arch="Ampere") -> None:
16
+ self.model_path = model_path
17
+ self.tokenizer_path = tokenizer_path
18
+ self.dtype = dtype
19
+ self.arch=arch
20
+ if dtype != 'int8':
21
+ int8_mode = 0
22
+ self.int8_mode = int8_mode
23
+
24
+ self.model, self.tokenizer = self.load_model_and_tokenizer()
25
+ if not (arch in ["Ampere", "Volta"]):
26
+ raise ValueError("Only support GPU device Ampere(A100,A10) or Volta(V100)")
27
+
28
+ print("Got model and tokenizer")
29
+
30
+ def load_model_and_tokenizer(self):
31
+ if self.tokenizer_path is None:
32
+ tokenizer_path = self.model_path
33
+ else:
34
+ tokenizer_path = self.tokenizer_path
35
+
36
+ print(f'Loading tokenizer from {pathlib.Path(tokenizer_path).parent}')
37
+ tokenizer = transformers.AutoTokenizer.from_pretrained(tokenizer_path, trust_remote_code=True)
38
+
39
+ checkpoint_path = pathlib.Path(self.model_path)
40
+
41
+ config_path = checkpoint_path.parent / 'config.ini'
42
+
43
+ if config_path.exists():
44
+ # Read model params from config.
45
+ cfg = configparser.ConfigParser()
46
+ cfg.read(config_path)
47
+ model_name = 'glm6b'
48
+ inference_data_type = self.dtype
49
+ if inference_data_type == None:
50
+ inference_data_type = cfg.get(model_name, "weight_data_type")
51
+ model_args = dict(
52
+ head_num=cfg.getint(model_name, 'head_num'),
53
+ size_per_head=cfg.getint(model_name, "size_per_head"),
54
+ layer_num=cfg.getint(model_name, "num_layer"),
55
+ tensor_para_size=cfg.getint(model_name, "tensor_para_size"),
56
+ vocab_size=cfg.getint(model_name, "vocab_size"),
57
+ start_id=cfg.getint(model_name, "start_id"),
58
+ end_id=cfg.getint(model_name, "end_id"),
59
+ weights_data_type=cfg.get(model_name, "weight_data_type"),
60
+ layernorm_eps=cfg.getfloat(model_name, 'layernorm_eps'),
61
+ inference_data_type=inference_data_type)
62
+ else:
63
+ inference_data_type = self.dtype
64
+ if inference_data_type == None:
65
+ inference_data_type = CHATGLM_6B_PARAM.weights_data_type
66
+ model_args = dict(head_num=CHATGLM_6B_PARAM.num_heads,
67
+ size_per_head=CHATGLM_6B_PARAM.size_per_head,
68
+ vocab_size=CHATGLM_6B_PARAM.vocab_size,
69
+ start_id=CHATGLM_6B_PARAM.start_id or tokenizer.bos_token_id,
70
+ end_id=CHATGLM_6B_PARAM.end_id or tokenizer.eos_token_id,
71
+ layer_num=CHATGLM_6B_PARAM.num_layers,
72
+ tensor_para_size=CHATGLM_6B_PARAM.tensor_para_size,
73
+ weights_data_type=CHATGLM_6B_PARAM.weights_data_type,
74
+ layernorm_eps=CHATGLM_6B_PARAM.layernorm_eps,
75
+ inference_data_type=inference_data_type,
76
+ )
77
+
78
+ # update common parameters
79
+ model_args.update(dict(
80
+ rotary_embedding_dim=64,
81
+ max_seq_len=0, # for position seq embedding
82
+ pipeline_para_size=CHATGLM_6B_PARAM.pipeline_para_size,
83
+ shared_contexts_ratio=CHATGLM_6B_PARAM.shared_contexts_ratio,
84
+ int8_mode=self.int8_mode
85
+ ))
86
+
87
+ print('[INFO] Load Our Highly Optimized LyraChatGLM6B model')
88
+ for k, v in model_args.items():
89
+ print(f' - {k.ljust(25, ".")}: {v}')
90
+
91
+ # Check sanity and consistency between the model and tokenizer.
92
+ checklist = ['head_num', 'size_per_head', 'vocab_size', 'layer_num',
93
+ 'tensor_para_size', 'tensor_para_size', 'weights_data_type']
94
+ if None in [model_args[k] for k in checklist]:
95
+ none_params = [p for p in checklist if model_args[p] is None]
96
+ print(f'[WARNING] Found None parameters {none_params}. They must '
97
+ f'be provided either by config file or CLI arguments.')
98
+ if model_args['start_id'] != tokenizer.bos_token_id:
99
+ print('[WARNING] Given start_id is not matched with the bos token '
100
+ 'id of the pretrained tokenizer.')
101
+ if model_args['end_id'] not in (tokenizer.pad_token_id, tokenizer.eos_token_id):
102
+ print('[WARNING] Given end_id is not matched with neither pad '
103
+ 'token id nor eos token id of the pretrained tokenizer.')
104
+
105
+ print(f'Loading tokenizer from {self.model_path}')
106
+ model = ChatGLM6BModel(arch=self.arch,**model_args)
107
+ if not model.load(ckpt_path=self.model_path):
108
+ print('[WARNING] Skip model loading since no checkpoints are found')
109
+
110
+ return model, tokenizer
111
+
112
+ def generate(self, prompts: typing.List[str] | str,
113
+ output_length: int = 512,
114
+ beam_width: int = 1,
115
+ top_k: typing.Optional[torch.IntTensor] = 1,
116
+ top_p: typing.Optional[torch.FloatTensor] = 1.0,
117
+ beam_search_diversity_rate: typing.Optional[torch.FloatTensor] = 0.0,
118
+ temperature: typing.Optional[torch.FloatTensor] = 1.0,
119
+ len_penalty: typing.Optional[torch.FloatTensor] = 0.0,
120
+ repetition_penalty: typing.Optional[torch.FloatTensor] = 1.0,
121
+ presence_penalty: typing.Optional[torch.FloatTensor] = None,
122
+ min_length: typing.Optional[torch.IntTensor] = None,
123
+ bad_words_list: typing.Optional[torch.IntTensor] = None,
124
+ do_sample: bool = False,
125
+ return_output_length: bool = False,
126
+ return_cum_log_probs: int = 0):
127
+ #
128
+ if isinstance(prompts, str):
129
+ prompts = [prompts, ]
130
+
131
+ inputs = prompts
132
+
133
+ batch_size = len(inputs)
134
+ ones_int = torch.ones(size=[batch_size], dtype=torch.int32)
135
+ ones_float = torch.ones(size=[batch_size], dtype=torch.float32)
136
+
137
+ input_token_ids = self.tokenizer(prompts, return_tensors="pt", padding=True).input_ids.int()
138
+ input_lengths = torch.IntTensor([len(ids) for ids in input_token_ids])
139
+ mask_positions = torch.IntTensor([seq.index(130001) for seq in input_token_ids.tolist()])
140
+
141
+ random_seed = None
142
+ if do_sample:
143
+ random_seed = torch.randint(0, 262144, (batch_size,), dtype=torch.long)
144
+
145
+ outputs = self.model(start_ids=input_token_ids,
146
+ start_lengths=input_lengths,
147
+ mask_positions=mask_positions,
148
+ output_len=output_length,
149
+ beam_width=beam_width,
150
+ top_k=top_k*ones_int,
151
+ top_p=top_p*ones_float,
152
+ beam_search_diversity_rate=beam_search_diversity_rate*ones_float,
153
+ temperature=temperature*ones_float,
154
+ len_penalty=len_penalty*ones_float,
155
+ repetition_penalty=repetition_penalty*ones_float,
156
+ presence_penalty=presence_penalty,
157
+ min_length=min_length,
158
+ random_seed=random_seed,
159
+ bad_words_list=bad_words_list,
160
+ return_output_length=return_output_length,
161
+ return_cum_log_probs=return_cum_log_probs)
162
+
163
+ if return_cum_log_probs > 0:
164
+ outputs = outputs[0] # output_token_ids.
165
+
166
+ # Slice the generated token ids of the 1st beam result.
167
+ # output = input tokens + generated tokens.
168
+ output_token_ids = [out[0, length:].cpu()
169
+ for out, length in zip(outputs, input_lengths)]
170
+
171
+ output_texts = self.tokenizer.batch_decode(
172
+ output_token_ids, skip_special_tokens=False)
173
+
174
+ return output_texts
lyraChatGLM/model.py CHANGED
@@ -1,131 +1,625 @@
 
 
 
 
 
 
1
  import torch
2
- from transformers.modeling_outputs import CausalLMOutputWithPast
3
- from transformers.modeling_utils import PreTrainedModel
4
- from transformers import AutoConfig
5
- from typing import Dict, List, Tuple, Union, Optional
6
-
7
-
8
- class FasterChatGLM(PreTrainedModel):
9
- def __init__(self, model_dir, kernel, *inputs, **kwargs):
10
- config = AutoConfig.from_pretrained(model_dir, trust_remote_code=True)
11
- config.n_head = config.num_attention_heads
12
- config.n_embd = config.hidden_size
13
- config.n_layer = config.num_layers
14
- super().__init__(config, *inputs, **kwargs)
15
- self.kernel = kernel
16
- self.fake_reg = torch.nn.Linear(2, 2)
17
- self.position_encoding_2d = True
18
-
19
- def forward(self, input_ids, position_ids, attention_mask, past_key_values, *args, **kwargs):
20
- inputs_values = [input_ids, position_ids, attention_mask]
21
- if past_key_values is not None:
22
- inputs_values = inputs_values + past_key_values
23
-
24
- computed = self.kernel.infer(inputs_values)
25
- logits = computed[0]
26
- if len(computed) == 1:
27
- present_key_values = None
28
- else:
29
- present_key_values = computed[1:]
30
-
31
- return CausalLMOutputWithPast(logits=logits, past_key_values=present_key_values)
32
-
33
- def get_masks_and_position_ids(self, seq, mask_position, context_length, device, gmask=False):
34
- attention_mask = torch.ones((1, context_length, context_length), device=device)
35
- attention_mask.tril_()
36
- attention_mask[..., :context_length - 1] = 1
37
- attention_mask.unsqueeze_(1)
38
- attention_mask = (attention_mask < 0.5).bool()
39
-
40
- if self.position_encoding_2d:
41
- seq_length = seq.index(150004)
42
- position_ids = torch.arange(context_length, dtype=torch.long, device=device)
43
- if not gmask:
44
- position_ids[seq_length:] = mask_position
45
- block_position_ids = torch.cat((
46
- torch.zeros(seq_length, dtype=torch.long, device=device),
47
- torch.arange(context_length - seq_length, dtype=torch.long, device=device) + 1
48
- ))
49
- position_ids = torch.stack((position_ids, block_position_ids), dim=0)
50
  else:
51
- position_ids = torch.arange(context_length, dtype=torch.long, device=device)
52
- if not gmask:
53
- position_ids[context_length - 1:] = mask_position
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
- position_ids = position_ids.unsqueeze(0)
 
56
 
57
- return attention_mask, position_ids
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
- def prepare_one_sample(self, input_id, mask_token, past, past_key_values, use_gmask):
 
60
 
61
- seq = input_id.tolist()
62
- mask_position = seq.index(mask_token)
 
 
 
 
 
 
 
 
 
 
 
63
 
64
- if mask_token not in seq:
65
- raise ValueError("You have to add either [MASK] or [gMASK] in your input")
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
- # only last token for input_ids if past is not None
68
- if past is not None or past_key_values is not None:
69
- context_length = seq.index(150004)
70
- last_token = input_id[-1].unsqueeze(-1).unsqueeze(0) # 2 dim
71
- proc_input_id = last_token
72
- if self.position_encoding_2d:
73
- position_ids = torch.tensor([[[mask_position], [len(seq) - context_length]]], dtype=torch.long,
74
- device=input_id.device)
 
 
 
 
 
 
 
 
75
  else:
76
- position_ids = torch.tensor([[mask_position]], dtype=torch.long, device=input_id.device)
 
 
 
 
 
 
77
 
78
- attention_mask = torch.zeros(1, 1, 1, 1, device=input_id.device)
79
- else:
80
- proc_input_id = input_id.unsqueeze(0)
81
- attention_mask, position_ids = self.get_masks_and_position_ids(
82
- seq=seq,
83
- mask_position=mask_position,
84
- context_length=len(seq),
85
- device=input_id.device,
86
- gmask=use_gmask
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
  )
88
 
89
- return (proc_input_id.to(torch.int32), position_ids.to(torch.int32),
90
- attention_mask.to(torch.bool))
91
-
92
- def prepare_inputs_for_generation(
93
- self,
94
- input_ids: torch.LongTensor,
95
- past: Optional[torch.Tensor] = None,
96
- past_key_values: Optional[torch.Tensor] = None,
97
- attention_mask: Optional[torch.Tensor] = None,
98
- use_cache: bool = None,
99
- **kwargs
100
- ) -> dict:
101
-
102
- MASK, gMASK = 150000, 150001
103
- mask_token = MASK if MASK in input_ids else gMASK
104
- use_gmask = False if MASK in input_ids else gMASK
105
-
106
- batch_input_ids, batch_position_ids, batch_attention_mask = [], [], []
107
- for input_id in input_ids:
108
- proc_input_id, position_id, attention_mask = self.prepare_one_sample(
109
- input_id, mask_token, past, past_key_values, use_gmask)
110
- batch_input_ids.append(proc_input_id)
111
- batch_position_ids.append(position_id)
112
- batch_attention_mask.append(attention_mask)
113
-
114
- batch_input_ids = torch.vstack(batch_input_ids)
115
- batch_position_ids = torch.vstack(batch_position_ids)
116
- batch_attention_mask = torch.vstack(batch_attention_mask)
117
-
118
- if past is None:
119
- past = past_key_values
120
-
121
- if past is not None or past_key_values is not None:
122
- self.kernel.set_context_mode(False)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
  else:
124
- self.kernel.set_context_mode(self.config.use_cache)
125
-
126
- return {
127
- "input_ids": batch_input_ids,
128
- "past_key_values": past_key_values,
129
- "position_ids": batch_position_ids,
130
- "attention_mask": batch_attention_mask
131
- }
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import h5py
3
+ import pathlib
4
+ import typing
5
+
6
+ import numpy as np
7
  import torch
8
+ import torch.distributed as dist
9
+ import torch.nn as nn
10
+
11
+ str_type_map = {"fp32": torch.float32, "fp16": torch.float16, "bf16": torch.bfloat16}
12
+
13
+
14
+ class ChatGLM6BWeights:
15
+ def __init__(
16
+ self, head_num, size_per_head, layer_num, vocab_size, max_seq_len, tensor_para_size, pipeline_para_size,
17
+ weights_data_type: typing.Union[str, np.dtype],
18
+ inference_data_type: str, has_adapters: bool = False, adapter_inter_size: int = 0, gpt_with_moe: bool = False,
19
+ has_positional_encoding: bool = False, has_pre_decoder_layernorm: bool = False,
20
+ has_post_decoder_layernorm: bool = True, int8_mode: int = 0, inter_size: int = 0):
21
+ assert(head_num % tensor_para_size == 0)
22
+ if int8_mode == 1:
23
+ torch_infer_dtype = str_type_map[inference_data_type]
24
+ assert torch_infer_dtype == torch.float16 or torch_infer_dtype == torch.bfloat16, "Weight only quant only supported for infer type fp16 or bf16."
25
+ quant = torch.ops.fastertransformer.symmetric_quantize_last_axis_of_batched_matrix
26
+ self.weight_transpose_calibrate_quantize = lambda x: quant(x, torch.int8)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  else:
28
+ assert int8_mode == 0, "Invalid int8 mode for GPT. Must be 0 or 1"
29
+
30
+ self.head_num = head_num
31
+ self.size_per_head = size_per_head
32
+ self.layer_num = layer_num
33
+ self.vocab_size = vocab_size
34
+ self.max_seq_len = max_seq_len
35
+ self.tensor_para_size = tensor_para_size
36
+ self.pipeline_para_size = pipeline_para_size
37
+ self.layers_per_device = layer_num // pipeline_para_size
38
+
39
+ self.has_adapters = has_adapters
40
+ self.adapter_inter_size = adapter_inter_size
41
+ self.gpt_with_moe = gpt_with_moe
42
+ self.has_positional_encoding = has_positional_encoding
43
+ self.has_pre_decoder_layernorm = has_pre_decoder_layernorm
44
+ self.has_post_decoder_layernorm = has_post_decoder_layernorm
45
+
46
+ local_head_num = head_num // tensor_para_size
47
+ global_head_num = head_num
48
+ local_hidden_units = local_head_num * size_per_head
49
+ global_hidden_units = global_head_num * size_per_head
50
+ local_inter_size = local_hidden_units * 4
51
+ if inter_size != 0:
52
+ assert inter_size % tensor_para_size == 0, f"inter_size({inter_size}) \% tensor_para_size({tensor_para_size}) must be 0"
53
+ local_inter_size = inter_size // tensor_para_size
54
+ local_adapter_inter_size = self.adapter_inter_size // tensor_para_size
55
+
56
+ self.local_head_num = local_head_num
57
+ self.global_head_num = global_head_num
58
+ self.local_hidden_units = local_hidden_units
59
+ self.global_hidden_units = global_hidden_units
60
+ self.local_inter_size = local_inter_size
61
+
62
+ self.int8_mode = int8_mode
63
+ self.share_embed = False
64
+
65
+ if isinstance(weights_data_type, str):
66
+ try:
67
+ weights_data_type = {
68
+ "fp16": np.float16,
69
+ "fp32": np.float32,
70
+ "float16": np.float16,
71
+ "float32": np.float32,
72
+ }[weights_data_type]
73
+ except KeyError:
74
+ raise ValueError(f"Don't know how to interpret weights_data_type: {weights_data_type}")
75
+
76
+ assert weights_data_type in [np.float32, np.float16]
77
+ self.weights_data_type = weights_data_type
78
+ self.inference_data_type = inference_data_type
79
+
80
+ self.w = []
81
+ self.int8_w = []
82
+ self.scale = []
83
+
84
+ # Transformer blocks
85
+ self.w.extend([torch.zeros(global_hidden_units, dtype=str_type_map[self.inference_data_type])]
86
+ * layer_num) # self_layernorm_gamma
87
+ self.w.extend([torch.zeros(global_hidden_units, dtype=str_type_map[self.inference_data_type])]
88
+ * layer_num) # self_layernorm_beta
89
+ self.w.extend([torch.zeros(global_hidden_units, local_hidden_units * 3,
90
+ dtype=str_type_map[self.inference_data_type])] * layer_num) # self_kernel
91
+ self.w.extend([torch.zeros(local_hidden_units * 3, dtype=str_type_map[self.inference_data_type])]
92
+ * layer_num) # self_bias
93
+ self.w.extend(
94
+ [torch.zeros(local_hidden_units, global_hidden_units, dtype=str_type_map[self.inference_data_type])] *
95
+ layer_num) # self_output_kernel
96
+ self.w.extend([torch.zeros(global_hidden_units, dtype=str_type_map[self.inference_data_type])]
97
+ * layer_num) # self_output_bias
98
+ self.w.extend([torch.zeros(global_hidden_units, dtype=str_type_map[self.inference_data_type])]
99
+ * layer_num) # ffn_layernorm_gamma
100
+ self.w.extend([torch.zeros(global_hidden_units, dtype=str_type_map[self.inference_data_type])]
101
+ * layer_num) # ffn_layernorm_beta
102
+ self.w.extend(
103
+ [torch.zeros(global_hidden_units, local_inter_size, dtype=str_type_map[self.inference_data_type])] *
104
+ layer_num) # ffn_kernel1
105
+ self.w.extend([torch.zeros(local_inter_size, dtype=str_type_map[self.inference_data_type])]
106
+ * layer_num) # ffn_bias1
107
+ self.w.extend(
108
+ [torch.zeros(local_inter_size, global_hidden_units, dtype=str_type_map[self.inference_data_type])] *
109
+ layer_num) # ffn_kernel2
110
+ self.w.extend([torch.zeros(global_hidden_units, dtype=str_type_map[self.inference_data_type])]
111
+ * layer_num) # ffn_bias2
112
+
113
+ optional_adapter_offset = 0
114
+
115
+ # After Transformer blocks
116
+ if self.has_pre_decoder_layernorm:
117
+ self.w.append(torch.zeros(global_hidden_units, dtype=str_type_map[
118
+ self.inference_data_type])) # embedding layernorm gamma
119
+ self.w.append(torch.zeros(global_hidden_units, dtype=str_type_map[
120
+ self.inference_data_type])) # embedding layernorm beta
121
+ optional_adapter_offset += 2
122
+ if self.has_post_decoder_layernorm:
123
+ self.w.append(torch.zeros(global_hidden_units, dtype=str_type_map[
124
+ self.inference_data_type])) # final layernorm gamma
125
+ self.w.append(torch.zeros(global_hidden_units, dtype=str_type_map[
126
+ self.inference_data_type])) # final layernorm beta
127
+ optional_adapter_offset += 2
128
+ if self.has_positional_encoding:
129
+ self.w.append(torch.zeros(max_seq_len, global_hidden_units, dtype=str_type_map[
130
+ self.inference_data_type])) # position_encoding_table
131
+ optional_adapter_offset += 1
132
+
133
+ self.pre_embed_idx = len(self.w)
134
+ self.w.append(torch.zeros(vocab_size, global_hidden_units,
135
+ dtype=str_type_map[self.inference_data_type])) # embedding_table
136
+ self.post_embed_idx = len(self.w)
137
+ self.w.append(torch.zeros(vocab_size, global_hidden_units, dtype=str_type_map[
138
+ self.inference_data_type])) # post embedding_kernel
139
+ self.adapter_offset = 2 + optional_adapter_offset
140
 
141
+ self.w.extend([torch.empty(0, dtype=str_type_map[self.inference_data_type])] * layer_num) # gating_weight
142
+ self.adapter_offset += layer_num
143
 
144
+ # adapters
145
+ if self.has_adapters:
146
+ self.w.extend([torch.zeros(global_hidden_units, local_adapter_inter_size,
147
+ dtype=str_type_map[self.inference_data_type])] * layer_num) # adaptor1_kernel1
148
+ self.w.extend([torch.zeros(local_adapter_inter_size, dtype=str_type_map[
149
+ self.inference_data_type])] * layer_num) # adaptor1_bias1
150
+ self.w.extend([torch.zeros(local_adapter_inter_size, global_hidden_units,
151
+ dtype=str_type_map[self.inference_data_type])] * layer_num) # adaptor1_kernel2
152
+ self.w.extend([torch.zeros(global_hidden_units, dtype=str_type_map[
153
+ self.inference_data_type])] * layer_num) # adaptor1_bias2
154
+ self.w.extend([torch.zeros(global_hidden_units, local_adapter_inter_size,
155
+ dtype=str_type_map[self.inference_data_type])] * layer_num) # adaptor2_kernel1
156
+ self.w.extend([torch.zeros(local_adapter_inter_size, dtype=str_type_map[
157
+ self.inference_data_type])] * layer_num) # adaptor2_bias1
158
+ self.w.extend([torch.zeros(local_adapter_inter_size, global_hidden_units,
159
+ dtype=str_type_map[self.inference_data_type])] * layer_num) # adaptor2_kernel2
160
+ self.w.extend([torch.zeros(global_hidden_units, dtype=str_type_map[
161
+ self.inference_data_type])] * layer_num) # adaptor2_bias2
162
 
163
+ # Initialization
164
+ self._map(lambda w: torch.nn.init.normal_(w, mean=0., std=1.))
165
 
166
+ if (self.int8_mode != 0):
167
+ self.int8_w.extend([torch.zeros(global_hidden_units, local_hidden_units *
168
+ 3, dtype=torch.int8)] * layer_num) # self_int8_kernel
169
+ self.scale.extend([torch.zeros(local_hidden_units * 3, dtype=torch.float)] * layer_num) # self_scale
170
+ self.int8_w.extend([torch.zeros(local_hidden_units, global_hidden_units, dtype=torch.int8)]
171
+ * layer_num) # self_output_int8_kernel
172
+ self.scale.extend([torch.zeros(global_hidden_units, dtype=torch.float)] * layer_num) # self_output_scale
173
+ self.int8_w.extend([torch.zeros(global_hidden_units, local_inter_size,
174
+ dtype=torch.int8)] * layer_num) # ffn_int8_kernel1
175
+ self.scale.extend([torch.zeros(local_inter_size, dtype=torch.float)] * layer_num) # ffn_scale1
176
+ self.int8_w.extend([torch.zeros(local_inter_size, global_hidden_units,
177
+ dtype=torch.int8)] * layer_num) # ffn_int8_kernel2
178
+ self.scale.extend([torch.zeros(global_hidden_units, dtype=torch.float)] * layer_num) # ffn_scale2
179
 
180
+ if self.has_adapters:
181
+ self.int8_w.extend([torch.zeros(global_hidden_units, local_adapter_inter_size,
182
+ dtype=torch.int8)] * layer_num) # adaptor1_int8_kernel1
183
+ self.scale.extend([torch.zeros(local_adapter_inter_size, dtype=torch.float)]
184
+ * layer_num) # adaptor1_scale1
185
+ self.int8_w.extend([torch.zeros(local_adapter_inter_size, global_hidden_units,
186
+ dtype=torch.int8)] * layer_num) # adaptor1_int8_kernel2
187
+ self.scale.extend([torch.zeros(global_hidden_units, dtype=torch.float)] * layer_num) # adaptor1_scale2
188
+ self.int8_w.extend([torch.zeros(global_hidden_units, local_adapter_inter_size,
189
+ dtype=torch.int8)] * layer_num) # adaptor2_int8_kernel1
190
+ self.scale.extend([torch.zeros(local_adapter_inter_size, dtype=torch.float)]
191
+ * layer_num) # adaptor2_scale1
192
+ self.int8_w.extend([torch.zeros(local_adapter_inter_size, global_hidden_units,
193
+ dtype=torch.int8)] * layer_num) # adaptor2_int8_kernel2
194
+ self.scale.extend([torch.zeros(global_hidden_units, dtype=torch.float)] * layer_num) # adaptor2_scale2
195
 
196
+ def __getitem__(self, idx):
197
+ return self.w[idx]
198
+
199
+ def __setitem__(self, idx, val):
200
+ self.w[idx] = val
201
+
202
+ def __len__(self):
203
+ return len(self.w)
204
+
205
+ def _map(self, func):
206
+ assert(self.pre_embed_idx < self.post_embed_idx,
207
+ "Pre decoder embedding index should be lower than post decoder embedding index.")
208
+ for i in range(len(self.w)):
209
+ if isinstance(self.w[i], list):
210
+ for j in range(len(self.w[i])):
211
+ self.w[i][j] = func(self.w[i][j])
212
  else:
213
+ if self.share_embed and i == self.post_embed_idx:
214
+ # If sharing the pre and post embedding, any mapping to
215
+ # the pre decoder weight will give the same output to the
216
+ # post decoder weight, so we just copy here.
217
+ self.w[self.post_embed_idx] = self.w[self.pre_embed_idx]
218
+ else:
219
+ self.w[i] = func(self.w[i])
220
 
221
+ def _map_int8(self, func):
222
+ for i in range(len(self.int8_w)):
223
+ if isinstance(self.int8_w[i], list):
224
+ for j in range(len(self.int8_w[i])):
225
+ self.int8_w[i][j] = func(self.int8_w[i][j])
226
+
227
+ else:
228
+ self.int8_w[i] = func(self.int8_w[i])
229
+ for i in range(len(self.scale)):
230
+ if isinstance(self.scale[i], list):
231
+ for j in range(len(self.scale[i])):
232
+ self.scale[i][j] = func(self.scale[i][j])
233
+
234
+ else:
235
+ self.scale[i] = func(self.scale[i])
236
+
237
+ def _map_int8_scales(self, func):
238
+ for i in range(len(self.scale)):
239
+ if isinstance(self.scale[i], list):
240
+ for j in range(len(self.scale[i])):
241
+ self.scale[i][j] = func(self.scale[i][j])
242
+
243
+ else:
244
+ self.scale[i] = func(self.scale[i])
245
+
246
+ def load(self, ckpt_path, tp_rank, pipeline_para_rank):
247
+ if not os.path.exists(ckpt_path):
248
+ raise FileNotFoundError(f"Failed to find {ckpt_path}")
249
+ w = []
250
+
251
+ type_map = {np.float32: torch.float32, np.float16: torch.float16}
252
+ # Load
253
+
254
+ def is_load(i): return i >= self.layers_per_device * \
255
+ pipeline_para_rank and i < self.layers_per_device * (pipeline_para_rank + 1)
256
+
257
+ h5f = h5py.File(ckpt_path, "r")
258
+
259
+ def load_to_torch(key, is_load: bool):
260
+ if is_load:
261
+ npdata = h5f[key]["weights"][:]
262
+ return torch.from_numpy(npdata).to(str_type_map[self.inference_data_type])
263
+ else:
264
+ return torch.empty(0).to(str_type_map[self.inference_data_type])
265
+ w.extend([load_to_torch(f"model.layers.{i}.input_layernorm.weight", is_load(i))
266
+ for i in range(self.layer_num)])
267
+ w.extend([load_to_torch(f"model.layers.{i}.input_layernorm.bias", is_load(i))
268
+ for i in range(self.layer_num)])
269
+ w.extend(
270
+ [load_to_torch(
271
+ f"model.layers.{i}.attention.query_key_value.weight.{tp_rank}", is_load(i))
272
+ for i in range(self.layer_num)])
273
+ w.extend([
274
+ load_to_torch(
275
+ f"model.layers.{i}.attention.query_key_value.bias.{tp_rank}", is_load(i))
276
+ for i in range(self.layer_num)])
277
+ w.extend([load_to_torch(f"model.layers.{i}.attention.dense.weight.{tp_rank}",
278
+ is_load(i)) for i in range(self.layer_num)])
279
+ w.extend([load_to_torch(f"model.layers.{i}.attention.dense.bias", is_load(i))
280
+ for i in range(self.layer_num)])
281
+ w.extend([load_to_torch(f"model.layers.{i}.post_attention_layernorm.weight",
282
+ is_load(i)) for i in range(self.layer_num)])
283
+ w.extend([load_to_torch(f"model.layers.{i}.post_attention_layernorm.bias",
284
+ is_load(i)) for i in range(self.layer_num)])
285
+ w.extend(
286
+ [load_to_torch(f"model.layers.{i}.mlp.dense_h_to_4h.weight.{tp_rank}", is_load(i))
287
+ for i in range(self.layer_num)])
288
+ w.extend(
289
+ [load_to_torch(f"model.layers.{i}.mlp.dense_h_to_4h.bias.{tp_rank}", is_load(i))
290
+ for i in range(self.layer_num)])
291
+ w.extend(
292
+ [load_to_torch(f"model.layers.{i}.mlp.dense_4h_to_h.weight.{tp_rank}", is_load(i))
293
+ for i in range(self.layer_num)])
294
+ w.extend([load_to_torch(f"model.layers.{i}.mlp.dense_4h_to_h.bias", is_load(i)) for i in range(self.layer_num)])
295
+
296
+ if self.has_pre_decoder_layernorm:
297
+ w.append(load_to_torch(f"model.pre_decoder_layernorm.weight", True))
298
+ w.append(load_to_torch(f"model.pre_decoder_layernorm.bias", True))
299
+
300
+ if self.has_post_decoder_layernorm:
301
+ w.append(load_to_torch(f"model.final_layernorm.weight", True))
302
+ w.append(load_to_torch(f"model.final_layernorm.bias", True))
303
+
304
+ if self.has_positional_encoding:
305
+ wpe = load_to_torch(f"model.wpe", True).reshape(-1, self.global_hidden_units)
306
+ assert self.max_seq_len <= wpe.size(0), (
307
+ f"max_seq_len ({self.max_seq_len} must not exceed "
308
+ f"the value of maximum sequence length during training ({wpe.size(0)})."
309
+ )
310
+ w.append(wpe)
311
+ w.append(load_to_torch(f"model.wte", True))
312
+ self.share_embed = True
313
+ w.append(torch.empty(0).to(str_type_map[self.inference_data_type]))
314
+
315
+ gate_list = []
316
+ for i in range(self.layer_num):
317
+ gate_list.append(load_to_torch(f"model.layers.{i}.mlp.moe.gate.wg.weight", False))
318
+ w.extend(gate_list)
319
+
320
+ if self.has_adapters:
321
+ w.extend(
322
+ [load_to_torch(
323
+ f"model.layers.{i}.after_attention_adapter.dense_h_to_4h.weight.{tp_rank}", is_load(i))
324
+ for i in range(self.layer_num)])
325
+ w.extend([
326
+ load_to_torch(
327
+ f"model.layers.{i}.after_attention_adapter.dense_h_to_4h.bias.{tp_rank}", is_load(i))
328
+ for i in range(self.layer_num)])
329
+ w.extend(
330
+ [load_to_torch(
331
+ f"model.layers.{i}.after_attention_adapter.dense_4h_to_h.weight.{tp_rank}", is_load(i))
332
+ for i in range(self.layer_num)])
333
+ w.extend(
334
+ [load_to_torch(f"model.layers.{i}.after_attention_adapter.dense_4h_to_h.bias", is_load(i))
335
+ for i in range(self.layer_num)])
336
+ w.extend(
337
+ [load_to_torch(f"model.layers.{i}.after_ffn_adapter.dense_h_to_4h.weight.{tp_rank}", is_load(i))
338
+ for i in range(self.layer_num)])
339
+ w.extend(
340
+ [load_to_torch(f"model.layers.{i}.after_ffn_adapter.dense_h_to_4h.bias.{tp_rank}", is_load(i))
341
+ for i in range(self.layer_num)])
342
+ w.extend(
343
+ [load_to_torch(f"model.layers.{i}.after_ffn_adapter.dense_4h_to_h.weight.{tp_rank}", is_load(i))
344
+ for i in range(self.layer_num)])
345
+ w.extend([load_to_torch(
346
+ f"model.layers.{i}.after_ffn_adapter.dense_4h_to_h.bias", is_load(i)) for i in range(self.layer_num)])
347
+
348
+ assert len(self.w) == len(w)
349
+
350
+ # Reshape
351
+ try:
352
+ for i in range(len(w)):
353
+ if w[i].nelement() == self.w[i].nelement():
354
+ self.w[i] = w[i].reshape(self.w[i].shape)
355
+ else:
356
+ self.w[i] = w[i]
357
+
358
+ except RuntimeError:
359
+ raise RuntimeError(
360
+ f"head_num, size_per_head, vocab_size, and max_seq_len must be the same as the ones during training "
361
+ f"(idx: {i} expected shape: {self.w[i].shape} got shape: {w[i].shape})."
362
  )
363
 
364
+ # transpose calibrate quantize the kernel
365
+ layer_num = self.layer_num
366
+ if self.int8_mode != 0:
367
+ for i in range(layer_num):
368
+ self.int8_w[i + 0 * layer_num], self.scale[i + 0 *
369
+ layer_num] = self.weight_transpose_calibrate_quantize(self.w[2 * layer_num + i])
370
+ self.int8_w[i + 1 * layer_num], self.scale[i + 1 *
371
+ layer_num] = self.weight_transpose_calibrate_quantize(self.w[4 * layer_num + i])
372
+ self.int8_w[i + 2 * layer_num], self.scale[i + 2 *
373
+ layer_num] = self.weight_transpose_calibrate_quantize(self.w[8 * layer_num + i])
374
+ self.int8_w[i + 3 * layer_num], self.scale[i + 3 *
375
+ layer_num] = self.weight_transpose_calibrate_quantize(self.w[10 * layer_num + i])
376
+
377
+ # We clear the original weights since they are no longer needed
378
+ if self.int8_mode == 1:
379
+ self.w[2 * layer_num + i] = torch.empty(0).to(str_type_map[self.inference_data_type])
380
+ self.w[4 * layer_num + i] = torch.empty(0).to(str_type_map[self.inference_data_type])
381
+ self.w[8 * layer_num + i] = torch.empty(0).to(str_type_map[self.inference_data_type])
382
+ self.w[10 * layer_num + i] = torch.empty(0).to(str_type_map[self.inference_data_type])
383
+
384
+ if self.has_adapters:
385
+ self.int8_w[i + 4 * layer_num], self.scale[i + 4 * layer_num] = self.weight_transpose_calibrate_quantize(
386
+ self.w[12 * layer_num + i + self.adapter_offset])
387
+ self.int8_w[i + 5 * layer_num], self.scale[i + 5 * layer_num] = self.weight_transpose_calibrate_quantize(
388
+ self.w[14 * layer_num + i + self.adapter_offset])
389
+ self.int8_w[i + 6 * layer_num], self.scale[i + 6 * layer_num] = self.weight_transpose_calibrate_quantize(
390
+ self.w[16 * layer_num + i + self.adapter_offset])
391
+ self.int8_w[i + 7 * layer_num], self.scale[i + 7 * layer_num] = self.weight_transpose_calibrate_quantize(
392
+ self.w[18 * layer_num + i + self.adapter_offset])
393
+
394
+ # Similar to above:
395
+ if self.int8_mode == 1:
396
+ self.w[12 * layer_num + i + self.adapter_offset] = torch.empty(
397
+ 0).to(str_type_map[self.inference_data_type])
398
+ self.w[14 * layer_num + i + self.adapter_offset] = torch.empty(
399
+ 0).to(str_type_map[self.inference_data_type])
400
+ self.w[16 * layer_num + i + self.adapter_offset] = torch.empty(
401
+ 0).to(str_type_map[self.inference_data_type])
402
+ self.w[18 * layer_num + i + self.adapter_offset] = torch.empty(
403
+ 0).to(str_type_map[self.inference_data_type])
404
+ return True
405
+
406
+
407
+ class ChatGLM6BModel(nn.Module):
408
+ def __init__(self,
409
+ head_num, size_per_head,
410
+ vocab_size,
411
+ rotary_embedding_dim,
412
+ start_id, end_id, layer_num,
413
+ arch,
414
+ max_seq_len: int,
415
+ tensor_para_size: int,
416
+ pipeline_para_size: int,
417
+ inference_data_type: str,
418
+ inter_size: int = 0,
419
+ # glm_variant_params
420
+ layernorm_eps: float = 1e-5,
421
+ layernorm_type: typing.Literal['pre_layernorm', 'post_layernorm'] = "pre_layernorm",
422
+ activation_type: str = "Gelu",
423
+ gpt_with_moe: bool = False,
424
+ expert_num: int = 0,
425
+ moe_k: int = 0,
426
+ moe_layer_index: typing.List = [],
427
+ has_positional_encoding: bool = False,
428
+ has_pre_decoder_layernorm: bool = False,
429
+ has_post_decoder_layernorm: bool = True,
430
+ has_adapters: bool = False,
431
+ adapter_inter_size: int = 0,
432
+ use_attention_linear_bias: bool = False,
433
+ int8_mode: int = 0,
434
+ weights_data_type: typing.Union[str, np.dtype] = np.float32,
435
+ shared_contexts_ratio: float = 1.0):
436
+ super().__init__()
437
+ self.head_num = head_num
438
+ self.size_per_head = size_per_head
439
+ self.vocab_size = vocab_size
440
+ self.rotary_embedding_dim = rotary_embedding_dim
441
+ self.start_id = start_id
442
+ self.end_id = end_id
443
+ self.layer_num = layer_num
444
+ self.inter_size = inter_size if inter_size != 0 else 4 * self.head_num * self.size_per_head
445
+ self.arch = arch
446
+ # gpt_variant_params
447
+ self.layernorm_eps = layernorm_eps
448
+ self.layernorm_type = layernorm_type
449
+ self.activation_type = activation_type
450
+ self.gpt_with_moe = gpt_with_moe
451
+ self.expert_num = expert_num
452
+ self.moe_k = moe_k
453
+ self.moe_layer_index = moe_layer_index
454
+ self.has_positional_encoding = has_positional_encoding
455
+ self.has_pre_decoder_layernorm = has_pre_decoder_layernorm
456
+ self.has_post_decoder_layernorm = has_post_decoder_layernorm
457
+ self.has_adapters = has_adapters
458
+ self.adapter_inter_size = adapter_inter_size
459
+ self.use_attention_linear_bias = use_attention_linear_bias
460
+
461
+ # multi-gpu params
462
+ self.tensor_para_size = tensor_para_size
463
+ self.pipeline_para_size = pipeline_para_size
464
+ self.use_sparse_gemm = False
465
+ self.build_model = False
466
+ self.int8_mode = int8_mode
467
+ self.weights_data_type = weights_data_type
468
+ self.shared_contexts_ratio = shared_contexts_ratio
469
+
470
+ assert torch.cuda.is_available(), "CUDA is required for this model."
471
+
472
+ assert head_num % tensor_para_size == 0, "head_num must be a multiple of tensor_para_size."
473
+ assert layer_num % pipeline_para_size == 0, "layer_num must be a multiple of pipeline_para_size."
474
+
475
+ # Load the C++ model into Pytorch model.
476
+ if arch == "Ampere":
477
+ lib_path = pathlib.Path(__file__).parent / "ftlib" / "libth_transformer_sm80.so"
478
+ elif arch == "Volta":
479
+ lib_path = pathlib.Path(__file__).parent / "ftlib" / "libth_transformer_sm70.so"
480
+ torch.classes.load_library(os.path.abspath(lib_path))
481
+
482
+ # Prepare weights
483
+ self.weights = ChatGLM6BWeights(head_num, size_per_head, layer_num, vocab_size,
484
+ max_seq_len, tensor_para_size, pipeline_para_size,
485
+ weights_data_type=weights_data_type,
486
+ inference_data_type=inference_data_type,
487
+ gpt_with_moe=self.gpt_with_moe,
488
+ has_positional_encoding=self.has_positional_encoding,
489
+ has_pre_decoder_layernorm=self.has_pre_decoder_layernorm,
490
+ has_post_decoder_layernorm=self.has_post_decoder_layernorm,
491
+ has_adapters=self.has_adapters,
492
+ adapter_inter_size=self.adapter_inter_size,
493
+ int8_mode=int8_mode,
494
+ inter_size=inter_size)
495
+
496
+ # Prepare for tensor/pipeline parallel
497
+ try:
498
+ dist.init_process_group(backend='mpi')
499
+ except:
500
+ print("[INFO] WARNING: Have initialized the process group")
501
+ self.rank = dist.get_rank()
502
+ self.device_count = torch.cuda.device_count()
503
+ self.device = self.rank % self.device_count
504
+ torch.cuda.set_device(self.device)
505
+
506
+ world_size = dist.get_world_size()
507
+ assert world_size == tensor_para_size * pipeline_para_size, "tensor_para_size * pipeline_para_size must be equal to world_size."
508
+
509
+ self.tensor_para_rank = self.rank % self.tensor_para_size
510
+ self.pipeline_para_rank = self.rank // self.tensor_para_size
511
+
512
+ def load(self, ckpt_path):
513
+ is_load = self.weights.load(ckpt_path, tp_rank=self.tensor_para_rank,
514
+ pipeline_para_rank=self.pipeline_para_rank)
515
+ self.cuda()
516
+ torch.cuda.empty_cache() # clean cache for model weight preprocessing
517
+ return is_load
518
+
519
+ def sparse(self):
520
+ if not self.use_sparse_gemm:
521
+ self.use_sparse_gemm = True
522
+
523
+ def cuda(self):
524
+ self.weights._map(lambda w: w.cuda(self.device))
525
+ if self.int8_mode != 0:
526
+ self.weights._map_int8(lambda w: w.cuda(self.device))
527
+
528
+ if self.build_model:
529
+ del self.model
530
+ self.build_model = False
531
+
532
+ self.model = torch.classes.FasterTransformer.GlmOp(
533
+ self.head_num, self.size_per_head, self.inter_size,
534
+ self.layer_num,
535
+ self.expert_num,
536
+ self.moe_k,
537
+ self.moe_layer_index,
538
+ self.vocab_size,
539
+ self.rotary_embedding_dim,
540
+ self.start_id, self.end_id,
541
+ self.tensor_para_size, self.pipeline_para_size, self.int8_mode,
542
+ # GLM variant parameters
543
+ self.layernorm_eps,
544
+ self.layernorm_type,
545
+ self.activation_type,
546
+ self.has_positional_encoding,
547
+ self.has_pre_decoder_layernorm,
548
+ self.has_post_decoder_layernorm,
549
+ self.has_adapters,
550
+ self.adapter_inter_size,
551
+ self.use_attention_linear_bias,
552
+ self.weights.w,
553
+ self.weights.int8_w,
554
+ self.weights.scale,
555
+ self.shared_contexts_ratio)
556
+ self.build_model = True
557
+
558
+ def forward(self,
559
+ start_ids: torch.IntTensor,
560
+ start_lengths: torch.IntTensor,
561
+ mask_positions: torch.IntTensor,
562
+ output_len: int,
563
+ beam_width: int = 1,
564
+ top_k: typing.Optional[torch.IntTensor] = None,
565
+ top_p: typing.Optional[torch.FloatTensor] = None,
566
+ beam_search_diversity_rate: typing.Optional[torch.FloatTensor] = None,
567
+ temperature: typing.Optional[torch.FloatTensor] = None,
568
+ len_penalty: typing.Optional[torch.FloatTensor] = None,
569
+ repetition_penalty: typing.Optional[torch.FloatTensor] = None,
570
+ presence_penalty: typing.Optional[torch.FloatTensor] = None,
571
+ min_length: typing.Optional[torch.IntTensor] = None,
572
+ random_seed: typing.Optional[torch.LongTensor] = None,
573
+ bad_words_list: typing.Optional[torch.IntTensor] = None,
574
+ return_output_length: bool = False,
575
+ return_cum_log_probs: int = 0):
576
+ if not self.build_model:
577
+ # for the cases we don't load model
578
+ self.cuda()
579
+ torch.cuda.empty_cache() # clean cache for model weight preprocessing
580
+ input_len = start_ids.size(1)
581
+ assert input_len > 0, "input len must be larger than zero. For an unconditional case, use start_id as the first token."
582
+
583
+ # Inputs to device
584
+ start_ids = start_ids.cuda(self.device)
585
+ start_lengths = start_lengths.cuda(self.device)
586
+ mask_positions = mask_positions.cuda(self.device)
587
+
588
+ # outputs: output_ids, output_lengths, output_cum_log_probs (optional)
589
+ outputs = self.model.forward(start_ids,
590
+ start_lengths,
591
+ mask_positions,
592
+ output_len,
593
+ beam_width, # optional, can be None
594
+ top_k, # optional, can be None
595
+ top_p, # optional, can be None
596
+ beam_search_diversity_rate, # optional, can be None
597
+ temperature, # optional, can be None
598
+ len_penalty, # optional, can be None
599
+ repetition_penalty, # optional, can be None
600
+ presence_penalty, # optional, can be None
601
+ min_length, # optional, can be None
602
+ random_seed, # optional, can be None
603
+ bad_words_list, # optional, can be None
604
+ return_cum_log_probs) # optional, can be None
605
+ if return_cum_log_probs == 0:
606
+ output_ids, output_lengths = outputs
607
  else:
608
+ output_ids, output_lengths, output_cum_log_probs = outputs
609
+ if return_output_length:
610
+ if return_cum_log_probs > 0:
611
+ return output_ids, output_lengths, output_cum_log_probs
612
+ else:
613
+ return output_ids, output_lengths
614
+ else:
615
+ return output_ids
616
+
617
+ def set_input_tensor(self, input_tensor):
618
+ """Set input tensor to be used instead of forward()'s input.
619
+
620
+ When doing pipeline parallelism the input from the previous
621
+ stage comes from communication, not from the input, so the
622
+ model's forward_step_func won't have it. This function is thus
623
+ used by internal code to bypass the input provided by the
624
+ forward_step_func"""
625
+ self.input_tensor = input_tensor
models/config.ini ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [glm6b]
2
+ model_name = chatglm-6b
3
+ head_num = 32
4
+ size_per_head = 128
5
+ inter_size = 16384
6
+ max_pos_seq_len = 2048
7
+ num_layer = 28
8
+ vocab_size = 130528
9
+ start_id = 130004
10
+ end_id = 130005
11
+ weight_data_type = fp16
12
+ tensor_para_size = 1
13
+ layernorm_eps = 1e-5
models/config.json DELETED
@@ -1,25 +0,0 @@
1
- {
2
- "_name_or_path": "THUDM/chatglm-6b",
3
- "architectures": [
4
- "ChatGLMModel"
5
- ],
6
- "auto_map": {
7
- "AutoConfig": "configuration_chatglm.ChatGLMConfig",
8
- "AutoModel": "modeling_chatglm.ChatGLMForConditionalGeneration",
9
- "AutoModelForSeq2SeqLM": "modeling_chatglm.ChatGLMForConditionalGeneration"
10
- },
11
- "bos_token_id": 150004,
12
- "eos_token_id": 150005,
13
- "hidden_size": 4096,
14
- "inner_hidden_size": 16384,
15
- "layernorm_epsilon": 1e-05,
16
- "max_sequence_length": 2048,
17
- "model_type": "chatglm",
18
- "num_attention_heads": 32,
19
- "num_layers": 28,
20
- "position_encoding_2d": true,
21
- "torch_dtype": "float16",
22
- "transformers_version": "4.23.1",
23
- "use_cache": true,
24
- "vocab_size": 150528
25
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
models/configuration_chatglm.py DELETED
@@ -1,92 +0,0 @@
1
- """ ChatGLM model configuration """
2
-
3
- from transformers.configuration_utils import PretrainedConfig
4
- from transformers.utils import logging
5
-
6
- logger = logging.get_logger(__name__)
7
-
8
-
9
- class ChatGLMConfig(PretrainedConfig):
10
- r"""
11
- This is the configuration class to store the configuration of a [`~ChatGLMModel`].
12
- It is used to instantiate an ChatGLM model according to the specified arguments, defining the model
13
- architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
14
- the ChatGLM-6B [THUDM/ChatGLM-6B](https://huggingface.co/THUDM/chatglm-6b) architecture.
15
-
16
- Configuration objects inherit from [`PretrainedConfig`] and can be used
17
- to control the model outputs. Read the documentation from [`PretrainedConfig`]
18
- for more information.
19
-
20
-
21
- Args:
22
- vocab_size (`int`, *optional*, defaults to 150528):
23
- Vocabulary size of the ChatGLM-6B model. Defines the number of different tokens that can be represented by the
24
- `inputs_ids` passed when calling [`~ChatGLMModel`] or
25
- [`~TFChatGLMModel`].
26
- hidden_size (`int`, *optional*, defaults to 4096):
27
- Dimension of the encoder layers and the pooler layer.
28
- num_hidden_layers (`int`, *optional*, defaults to 28):
29
- Number of hidden layers in the Transformer encoder.
30
- num_attention_heads (`int`, *optional*, defaults to 32):
31
- Number of attention heads for each attention layer in the Transformer encoder.
32
- inner_hidden_size (`int`, *optional*, defaults to 16384):
33
- Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
34
- max_sequence_length (`int`, *optional*, defaults to 512):
35
- The maximum sequence length that this model might ever be used with.
36
- Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
37
- layernorm_epsilon (`float`, *optional*, defaults to 1e-5):
38
- The epsilon used by the layer normalization layers.
39
- use_cache (`bool`, *optional*, defaults to `True`):
40
- Whether the model should return the last key/values attentions (not used by all models).
41
- Example:
42
-
43
- ```python
44
- >>> from configuration_chatglm import ChatGLMConfig
45
- >>> from modeling_chatglm import ChatGLMModel
46
-
47
- >>> # Initializing a ChatGLM-6B THUDM/ChatGLM-6B style configuration
48
- >>> configuration = ChatGLMConfig()
49
-
50
- >>> # Initializing a model from the THUDM/ChatGLM-6B style configuration
51
- >>> model = ChatGLMModel(configuration)
52
-
53
- >>> # Accessing the model configuration
54
- >>> configuration = model.config
55
- ```
56
- """
57
- model_type = "chatglm"
58
-
59
- def __init__(
60
- self,
61
- vocab_size=150528,
62
- hidden_size=4096,
63
- num_layers=28,
64
- num_attention_heads=32,
65
- layernorm_epsilon=1e-5,
66
- use_cache=False,
67
- bos_token_id=150004,
68
- eos_token_id=150005,
69
- pad_token_id=0,
70
- max_sequence_length=2048,
71
- inner_hidden_size=16384,
72
- position_encoding_2d=True,
73
- **kwargs
74
- ):
75
- self.num_layers = num_layers
76
- self.vocab_size = vocab_size
77
- self.hidden_size = hidden_size
78
- self.num_attention_heads = num_attention_heads
79
- self.max_sequence_length = max_sequence_length
80
- self.layernorm_epsilon = layernorm_epsilon
81
- self.inner_hidden_size = inner_hidden_size
82
- self.use_cache = use_cache
83
- self.bos_token_id = bos_token_id
84
- self.eos_token_id = eos_token_id
85
- self.pad_token_id = pad_token_id
86
- self.position_encoding_2d = position_encoding_2d
87
- super().__init__(
88
- pad_token_id=pad_token_id,
89
- bos_token_id=bos_token_id,
90
- eos_token_id=eos_token_id,
91
- **kwargs
92
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
models/ice_text.model DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:99871e0c85db81ad7af1028854fd091cd5778c8414ae9d94bbbc10d02c831c21
3
- size 2699926
 
 
 
 
models/tokenization_chatglm.py CHANGED
@@ -1,17 +1,13 @@
1
  """Tokenization classes for ChatGLM."""
2
- import sys
3
- import unicodedata
4
  from typing import List, Optional, Union
5
- from functools import lru_cache
6
  import os
7
- import collections
8
- import re
9
 
10
  from transformers.tokenization_utils import PreTrainedTokenizer
11
- from icetk.text_tokenizer import TextTokenizer
12
- from icetk.utils import auto_create
13
- import icetk.sentencepiece_model_pb2 as sp_model
14
- from transformers.utils import logging
 
15
 
16
  logger = logging.get_logger(__name__)
17
 
@@ -20,61 +16,55 @@ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
20
  }
21
 
22
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
  class SPTokenizer:
24
  def __init__(
25
- self,
26
- vocab_file,
27
- max_blank_length=80,
28
- byte_fallback=True,
 
29
  ):
30
  assert vocab_file is not None
31
  self.vocab_file = vocab_file
 
32
  self.special_tokens = ["[MASK]", "[gMASK]", "[sMASK]", "<unused_0>", "<sop>", "<eop>", "<ENC>", "<dBLOCK>"]
33
  self.max_blank_length = max_blank_length
34
  self.byte_fallback = byte_fallback
35
- self.text_tokenizer = self._build_text_tokenizer(encode_special_tokens=False)
36
- self.special_text_tokenizer = self._build_text_tokenizer(encode_special_tokens=True)
37
-
38
- @staticmethod
39
- def _configure_tokenizer(
40
- text_tokenizer: TextTokenizer,
41
- special_tokens: List[str],
42
- max_blank_length: int,
43
- byte_fallback: bool,
44
- encode_special_tokens=False,
45
- ):
46
- # special token
47
- special_token_type = 4 if encode_special_tokens else 3 # 3 - CONTROL, 4 - USER_DEFINE
48
- for token in special_tokens:
49
- text_tokenizer.proto.pieces.append(
50
- sp_model.ModelProto.SentencePiece(piece=token, score=0.0, type=special_token_type)
51
- )
52
- # whitespaces
53
- for token in [SPTokenizer.get_tab_token()] + [
54
- SPTokenizer.get_blank_token(i) for i in range(2, max_blank_length + 1)
55
- ]:
56
- text_tokenizer.proto.pieces.append(sp_model.ModelProto.SentencePiece(piece=token, score=0.0, type=4))
57
- # byte fallback
58
- if byte_fallback:
59
- text_tokenizer.proto.trainer_spec.byte_fallback = True
60
- for i in range(256):
61
- text_tokenizer.proto.pieces.append(
62
- sp_model.ModelProto.SentencePiece(piece="<0x{:02X}>".format(i), score=0.0, type=6)
63
- )
64
- text_tokenizer.refresh()
65
-
66
- def _build_text_tokenizer(self, encode_special_tokens=False):
67
- tokenizer = TextTokenizer(self.vocab_file)
68
- self._configure_tokenizer(
69
- tokenizer, self.special_tokens, self.max_blank_length, self.byte_fallback, encode_special_tokens
70
- )
71
- return tokenizer
72
 
73
- def _get_text_tokenizer(self, encode_special_tokens=False):
74
- if encode_special_tokens:
75
- return self.special_text_tokenizer
76
- else:
77
- return self.text_tokenizer
78
 
79
  @staticmethod
80
  def get_blank_token(length: int):
@@ -85,10 +75,6 @@ class SPTokenizer:
85
  def get_tab_token():
86
  return f"<|tab|>"
87
 
88
- @property
89
- def num_image_tokens(self):
90
- return 20000
91
-
92
  @property
93
  def num_text_tokens(self):
94
  return self.text_tokenizer.num_tokens
@@ -112,7 +98,7 @@ class SPTokenizer:
112
  return text
113
 
114
  def encode(
115
- self, text: str, linebreak=True, whitespaces=True, special_tokens=False, add_dummy_prefix=True
116
  ) -> List[int]:
117
  """
118
  @param text: Text to encode.
@@ -124,22 +110,31 @@ class SPTokenizer:
124
  text = self._preprocess(text, linebreak, whitespaces)
125
  if not add_dummy_prefix:
126
  text = "<n>" + text
127
- tmp = self._get_text_tokenizer(encode_special_tokens=special_tokens).encode(text)
128
  tokens = [x + self.num_image_tokens for x in tmp]
129
  return tokens if add_dummy_prefix else tokens[2:]
130
 
131
- def decode(self, text_ids: List[int], special_tokens=False) -> str:
132
- ids = [int(_id) - self.num_image_tokens for _id in text_ids]
133
- ids = [_id for _id in ids if _id >= 0]
134
- text = self._get_text_tokenizer(encode_special_tokens=special_tokens).decode(ids)
135
  text = text.replace("<n>", "\n")
136
  text = text.replace(SPTokenizer.get_tab_token(), "\t")
137
  for i in range(2, self.max_blank_length + 1):
138
  text = text.replace(self.get_blank_token(i), " " * i)
139
  return text
140
 
 
 
 
 
 
 
 
 
 
 
 
 
141
  def tokenize(
142
- self, text: str, linebreak=True, whitespaces=True, special_tokens=False, add_dummy_prefix=True
143
  ) -> List[str]:
144
  """
145
  @param text: Text to encode.
@@ -151,7 +146,7 @@ class SPTokenizer:
151
  text = self._preprocess(text, linebreak, whitespaces)
152
  if not add_dummy_prefix:
153
  text = "<n>" + text
154
- tokens = self._get_text_tokenizer(encode_special_tokens=special_tokens).tokenize(text)
155
  return tokens if add_dummy_prefix else tokens[2:]
156
 
157
  def __getitem__(self, x: Union[int, str]):
@@ -180,25 +175,36 @@ class ChatGLMTokenizer(PreTrainedTokenizer):
180
 
181
  vocab_files_names = {"vocab_file": "ice_text.model"}
182
  max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
183
- model_input_names = ["input_ids"]
184
 
185
  def __init__(
186
  self,
187
  vocab_file,
188
  do_lower_case=False,
189
  remove_space=False,
190
- bos_token='sop',
191
- eos_token='eos',
192
- eop_token='eop',
193
  mask_token='[MASK]',
194
  gmask_token='[gMASK]',
195
  padding_side="left",
 
 
 
196
  **kwargs
197
  ) -> None:
198
  super().__init__(
199
  do_lower_case=do_lower_case,
200
  remove_space=remove_space,
201
  padding_side=padding_side,
 
 
 
 
 
 
 
 
202
  **kwargs
203
  )
204
 
@@ -208,23 +214,29 @@ class ChatGLMTokenizer(PreTrainedTokenizer):
208
 
209
  self.bos_token = bos_token
210
  self.eos_token = eos_token
211
- self.eop_token = eop_token
212
  self.mask_token = mask_token
213
- self.gMASK_token = gmask_token
214
 
215
- self.sp_tokenizer = SPTokenizer(vocab_file)
216
 
217
  """ Initialisation """
218
 
219
  @property
220
- def eop_token_id(self) -> Optional[int]:
 
 
 
 
 
 
221
  """
222
- `Optional[int]`: Id of the end of sentence token in the vocabulary. Returns `None` if the token has not been
223
  set.
224
  """
225
- if self.eop_token is None:
226
  return None
227
- return self.convert_tokens_to_ids(self.eop_token)
228
 
229
  @property
230
  def vocab_size(self):
@@ -256,25 +268,21 @@ class ChatGLMTokenizer(PreTrainedTokenizer):
256
 
257
  return seq
258
 
259
- def decode(
 
 
 
260
  self,
261
- token_ids: Union[List[int], List[List[int]]],
262
- skip_special_tokens: bool = False,
263
- clean_up_tokenization_spaces: bool = True,
264
- spaces_between_special_tokens: bool = True,
265
  **kwargs
266
  ) -> str:
267
- if isinstance(token_ids[0], list):
268
- tokens = []
269
- for single_token_ids in token_ids:
270
- if self.pad_token_id in single_token_ids: # remove pad
271
- single_token_ids = list(filter((self.pad_token_id).__ne__, single_token_ids))
272
- tokens.append(self.sp_tokenizer.decode(single_token_ids))
273
- return (tokens)
274
- else:
275
- if self.pad_token_id in token_ids: # remove pad
276
- token_ids = list(filter((self.pad_token_id).__ne__, token_ids))
277
- return self.sp_tokenizer.decode(token_ids)
278
 
279
  def _convert_token_to_id(self, token):
280
  """ Converts a token (str) in an id using the vocab. """
@@ -299,7 +307,7 @@ class ChatGLMTokenizer(PreTrainedTokenizer):
299
  """
300
  if os.path.isdir(save_directory):
301
  vocab_file = os.path.join(
302
- save_directory, VOCAB_FILES_NAMES["vocab_file"]
303
  )
304
  else:
305
  vocab_file = save_directory
@@ -331,16 +339,105 @@ class ChatGLMTokenizer(PreTrainedTokenizer):
331
  Returns:
332
  `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
333
  """
 
 
 
334
  if token_ids_1 is not None:
335
- token_ids_0 += token_ids_1
336
- mask_ids = self.sp_tokenizer[self.mask_token]
337
- gmask_ids = self.sp_tokenizer[self.gMASK_token]
338
- if mask_ids not in token_ids_0 and gmask_ids not in token_ids_0:
339
- token_ids_0 += [gmask_ids]
340
-
341
- if token_ids_0[-1] != mask_ids and token_ids_0[-1] != gmask_ids:
342
- token_ids_0 += [self.sp_tokenizer[self.eos_token]]
343
 
344
- token_ids_0 += [self.sp_tokenizer[self.bos_token]]
 
 
 
 
 
 
 
 
 
345
 
346
- return token_ids_0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  """Tokenization classes for ChatGLM."""
 
 
2
  from typing import List, Optional, Union
 
3
  import os
 
 
4
 
5
  from transformers.tokenization_utils import PreTrainedTokenizer
6
+ from transformers.utils import logging, PaddingStrategy
7
+ from transformers.tokenization_utils_base import EncodedInput, BatchEncoding
8
+ from typing import Dict
9
+ import sentencepiece as spm
10
+ import numpy as np
11
 
12
  logger = logging.get_logger(__name__)
13
 
 
16
  }
17
 
18
 
19
+ class TextTokenizer:
20
+ def __init__(self, model_path):
21
+ self.sp = spm.SentencePieceProcessor()
22
+ self.sp.Load(model_path)
23
+ self.num_tokens = self.sp.vocab_size()
24
+
25
+ def encode(self, text):
26
+ return self.sp.EncodeAsIds(text)
27
+
28
+ def decode(self, ids: List[int]):
29
+ return self.sp.DecodeIds(ids)
30
+
31
+ def tokenize(self, text):
32
+ return self.sp.EncodeAsPieces(text)
33
+
34
+ def convert_tokens_to_string(self, tokens):
35
+ return self.sp.DecodePieces(tokens)
36
+
37
+ def convert_tokens_to_ids(self, tokens):
38
+ return [self.sp.PieceToId(token) for token in tokens]
39
+
40
+ def convert_token_to_id(self, token):
41
+ return self.sp.PieceToId(token)
42
+
43
+ def convert_id_to_token(self, idx):
44
+ return self.sp.IdToPiece(idx)
45
+
46
+ def __len__(self):
47
+ return self.num_tokens
48
+
49
+
50
  class SPTokenizer:
51
  def __init__(
52
+ self,
53
+ vocab_file,
54
+ num_image_tokens=20000,
55
+ max_blank_length=80,
56
+ byte_fallback=True,
57
  ):
58
  assert vocab_file is not None
59
  self.vocab_file = vocab_file
60
+ self.num_image_tokens = num_image_tokens
61
  self.special_tokens = ["[MASK]", "[gMASK]", "[sMASK]", "<unused_0>", "<sop>", "<eop>", "<ENC>", "<dBLOCK>"]
62
  self.max_blank_length = max_blank_length
63
  self.byte_fallback = byte_fallback
64
+ self.text_tokenizer = TextTokenizer(vocab_file)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
66
+ def _get_text_tokenizer(self):
67
+ return self.text_tokenizer
 
 
 
68
 
69
  @staticmethod
70
  def get_blank_token(length: int):
 
75
  def get_tab_token():
76
  return f"<|tab|>"
77
 
 
 
 
 
78
  @property
79
  def num_text_tokens(self):
80
  return self.text_tokenizer.num_tokens
 
98
  return text
99
 
100
  def encode(
101
+ self, text: str, linebreak=True, whitespaces=True, add_dummy_prefix=True
102
  ) -> List[int]:
103
  """
104
  @param text: Text to encode.
 
110
  text = self._preprocess(text, linebreak, whitespaces)
111
  if not add_dummy_prefix:
112
  text = "<n>" + text
113
+ tmp = self._get_text_tokenizer().encode(text)
114
  tokens = [x + self.num_image_tokens for x in tmp]
115
  return tokens if add_dummy_prefix else tokens[2:]
116
 
117
+ def postprocess(self, text):
 
 
 
118
  text = text.replace("<n>", "\n")
119
  text = text.replace(SPTokenizer.get_tab_token(), "\t")
120
  for i in range(2, self.max_blank_length + 1):
121
  text = text.replace(self.get_blank_token(i), " " * i)
122
  return text
123
 
124
+ def decode(self, text_ids: List[int]) -> str:
125
+ ids = [int(_id) - self.num_image_tokens for _id in text_ids]
126
+ ids = [_id for _id in ids if _id >= 0]
127
+ text = self._get_text_tokenizer().decode(ids)
128
+ text = self.postprocess(text)
129
+ return text
130
+
131
+ def decode_tokens(self, tokens: List[str]) -> str:
132
+ text = self._get_text_tokenizer().convert_tokens_to_string(tokens)
133
+ text = self.postprocess(text)
134
+ return text
135
+
136
  def tokenize(
137
+ self, text: str, linebreak=True, whitespaces=True, add_dummy_prefix=True
138
  ) -> List[str]:
139
  """
140
  @param text: Text to encode.
 
146
  text = self._preprocess(text, linebreak, whitespaces)
147
  if not add_dummy_prefix:
148
  text = "<n>" + text
149
+ tokens = self._get_text_tokenizer().tokenize(text)
150
  return tokens if add_dummy_prefix else tokens[2:]
151
 
152
  def __getitem__(self, x: Union[int, str]):
 
175
 
176
  vocab_files_names = {"vocab_file": "ice_text.model"}
177
  max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
178
+ model_input_names = ["input_ids", "attention_mask", "position_ids"]
179
 
180
  def __init__(
181
  self,
182
  vocab_file,
183
  do_lower_case=False,
184
  remove_space=False,
185
+ bos_token='<sop>',
186
+ eos_token='<eop>',
187
+ end_token='</s>',
188
  mask_token='[MASK]',
189
  gmask_token='[gMASK]',
190
  padding_side="left",
191
+ pad_token="<pad>",
192
+ unk_token="<unk>",
193
+ num_image_tokens=20000,
194
  **kwargs
195
  ) -> None:
196
  super().__init__(
197
  do_lower_case=do_lower_case,
198
  remove_space=remove_space,
199
  padding_side=padding_side,
200
+ bos_token=bos_token,
201
+ eos_token=eos_token,
202
+ end_token=end_token,
203
+ mask_token=mask_token,
204
+ gmask_token=gmask_token,
205
+ pad_token=pad_token,
206
+ unk_token=unk_token,
207
+ num_image_tokens=num_image_tokens,
208
  **kwargs
209
  )
210
 
 
214
 
215
  self.bos_token = bos_token
216
  self.eos_token = eos_token
217
+ self.end_token = end_token
218
  self.mask_token = mask_token
219
+ self.gmask_token = gmask_token
220
 
221
+ self.sp_tokenizer = SPTokenizer(vocab_file, num_image_tokens=num_image_tokens)
222
 
223
  """ Initialisation """
224
 
225
  @property
226
+ def gmask_token_id(self) -> Optional[int]:
227
+ if self.gmask_token is None:
228
+ return None
229
+ return self.convert_tokens_to_ids(self.gmask_token)
230
+
231
+ @property
232
+ def end_token_id(self) -> Optional[int]:
233
  """
234
+ `Optional[int]`: Id of the end of context token in the vocabulary. Returns `None` if the token has not been
235
  set.
236
  """
237
+ if self.end_token is None:
238
  return None
239
+ return self.convert_tokens_to_ids(self.end_token)
240
 
241
  @property
242
  def vocab_size(self):
 
268
 
269
  return seq
270
 
271
+ def convert_tokens_to_string(self, tokens: List[str]) -> str:
272
+ return self.sp_tokenizer.decode_tokens(tokens)
273
+
274
+ def _decode(
275
  self,
276
+ token_ids: Union[int, List[int]],
 
 
 
277
  **kwargs
278
  ) -> str:
279
+ if isinstance(token_ids, int):
280
+ token_ids = [token_ids]
281
+ if len(token_ids) == 0:
282
+ return ""
283
+ if self.pad_token_id in token_ids: # remove pad
284
+ token_ids = list(filter((self.pad_token_id).__ne__, token_ids))
285
+ return super()._decode(token_ids, **kwargs)
 
 
 
 
286
 
287
  def _convert_token_to_id(self, token):
288
  """ Converts a token (str) in an id using the vocab. """
 
307
  """
308
  if os.path.isdir(save_directory):
309
  vocab_file = os.path.join(
310
+ save_directory, self.vocab_files_names["vocab_file"]
311
  )
312
  else:
313
  vocab_file = save_directory
 
339
  Returns:
340
  `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
341
  """
342
+ gmask_id = self.sp_tokenizer[self.gmask_token]
343
+ eos_id = self.sp_tokenizer[self.eos_token]
344
+ token_ids_0 = token_ids_0 + [gmask_id, self.sp_tokenizer[self.bos_token]]
345
  if token_ids_1 is not None:
346
+ token_ids_0 = token_ids_0 + token_ids_1 + [eos_id]
347
+ return token_ids_0
 
 
 
 
 
 
348
 
349
+ def _pad(
350
+ self,
351
+ encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],
352
+ max_length: Optional[int] = None,
353
+ padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
354
+ pad_to_multiple_of: Optional[int] = None,
355
+ return_attention_mask: Optional[bool] = None,
356
+ ) -> dict:
357
+ """
358
+ Pad encoded inputs (on left/right and up to predefined length or max length in the batch)
359
 
360
+ Args:
361
+ encoded_inputs:
362
+ Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`).
363
+ max_length: maximum length of the returned list and optionally padding length (see below).
364
+ Will truncate by taking into account the special tokens.
365
+ padding_strategy: PaddingStrategy to use for padding.
366
+
367
+ - PaddingStrategy.LONGEST Pad to the longest sequence in the batch
368
+ - PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
369
+ - PaddingStrategy.DO_NOT_PAD: Do not pad
370
+ The tokenizer padding sides are defined in self.padding_side:
371
+
372
+ - 'left': pads on the left of the sequences
373
+ - 'right': pads on the right of the sequences
374
+ pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
375
+ This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
376
+ `>= 7.5` (Volta).
377
+ return_attention_mask:
378
+ (optional) Set to False to avoid returning attention mask (default: set to model specifics)
379
+ """
380
+ # Load from model defaults
381
+ bos_token_id = self.sp_tokenizer[self.bos_token]
382
+ mask_token_id = self.sp_tokenizer[self.mask_token]
383
+ gmask_token_id = self.sp_tokenizer[self.gmask_token]
384
+ assert self.padding_side == "left"
385
+
386
+ required_input = encoded_inputs[self.model_input_names[0]]
387
+ seq_length = len(required_input)
388
+
389
+ if padding_strategy == PaddingStrategy.LONGEST:
390
+ max_length = len(required_input)
391
+
392
+ if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):
393
+ max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
394
+
395
+ needs_to_be_padded = padding_strategy != PaddingStrategy.DO_NOT_PAD and len(required_input) != max_length
396
+
397
+ # Initialize attention mask if not present.
398
+ if max_length is not None:
399
+ if "attention_mask" not in encoded_inputs:
400
+ if bos_token_id in required_input:
401
+ context_length = required_input.index(bos_token_id)
402
+ else:
403
+ context_length = seq_length
404
+ attention_mask = np.ones((1, seq_length, seq_length))
405
+ attention_mask = np.tril(attention_mask)
406
+ attention_mask[:, :, :context_length] = 1
407
+ attention_mask = np.bool_(attention_mask < 0.5)
408
+ encoded_inputs["attention_mask"] = attention_mask
409
+
410
+ if "position_ids" not in encoded_inputs:
411
+ if bos_token_id in required_input:
412
+ context_length = required_input.index(bos_token_id)
413
+ else:
414
+ context_length = seq_length
415
+ position_ids = np.arange(seq_length, dtype=np.int64)
416
+ mask_token = mask_token_id if mask_token_id in required_input else gmask_token_id
417
+ if mask_token in required_input:
418
+ mask_position = required_input.index(mask_token)
419
+ position_ids[context_length:] = mask_position
420
+ block_position_ids = np.concatenate(
421
+ [np.zeros(context_length, dtype=np.int64),
422
+ np.arange(1, seq_length - context_length + 1, dtype=np.int64)])
423
+ encoded_inputs["position_ids"] = np.stack([position_ids, block_position_ids], axis=0)
424
+
425
+ if needs_to_be_padded:
426
+ difference = max_length - len(required_input)
427
+
428
+ if "attention_mask" in encoded_inputs:
429
+ encoded_inputs["attention_mask"] = np.pad(encoded_inputs["attention_mask"],
430
+ pad_width=[(0, 0), (difference, 0), (difference, 0)],
431
+ mode='constant', constant_values=True)
432
+ if "token_type_ids" in encoded_inputs:
433
+ encoded_inputs["token_type_ids"] = [self.pad_token_type_id] * difference + encoded_inputs[
434
+ "token_type_ids"
435
+ ]
436
+ if "special_tokens_mask" in encoded_inputs:
437
+ encoded_inputs["special_tokens_mask"] = [1] * difference + encoded_inputs["special_tokens_mask"]
438
+ if "position_ids" in encoded_inputs:
439
+ encoded_inputs["position_ids"] = np.pad(encoded_inputs["position_ids"],
440
+ pad_width=[(0, 0), (difference, 0)])
441
+ encoded_inputs[self.model_input_names[0]] = [self.pad_token_id] * difference + required_input
442
+
443
+ return encoded_inputs
models/tokenizer_config.json CHANGED
@@ -1,8 +1,8 @@
1
  {
2
  "name_or_path": "THUDM/chatglm-6b",
3
  "bos_token": "<sop>",
4
- "eop_token": "<eop>",
5
- "eos_token": "</s>",
6
  "gmask_token": "[gMASK]",
7
  "mask_token": "[MASK]",
8
  "pad_token": "<pad>",
@@ -10,6 +10,7 @@
10
  "remove_space": false,
11
  "do_lower_case": false,
12
  "tokenizer_class": "ChatGLMTokenizer",
 
13
  "auto_map": {
14
  "AutoTokenizer": [
15
  "tokenization_chatglm.ChatGLMTokenizer",
 
1
  {
2
  "name_or_path": "THUDM/chatglm-6b",
3
  "bos_token": "<sop>",
4
+ "eos_token": "<eop>",
5
+ "end_token": "</s>",
6
  "gmask_token": "[gMASK]",
7
  "mask_token": "[MASK]",
8
  "pad_token": "<pad>",
 
10
  "remove_space": false,
11
  "do_lower_case": false,
12
  "tokenizer_class": "ChatGLMTokenizer",
13
+ "num_image_tokens": 0,
14
  "auto_map": {
15
  "AutoTokenizer": [
16
  "tokenization_chatglm.ChatGLMTokenizer",
requirements.txt CHANGED
@@ -1,4 +1,8 @@
1
  icetk
2
- torch
3
  transformers
4
-
 
 
 
 
 
1
  icetk
2
+ cpm_kernels
3
  transformers
4
+ huggingface_hub
5
+ numpy
6
+ setuptools
7
+ torch
8
+ protobuf==3.20.3