vanewu commited on
Commit
2b20dbd
·
1 Parent(s): 6fa2ca0

remove files

Browse files
CHANGELOG.md DELETED
@@ -1,3 +0,0 @@
1
- ## v1.0
2
-
3
- - Add accelerated ChatGLM-6B model (from: https://huggingface.co/THUDM/chatglm-6b)
 
 
 
 
Dockerfile DELETED
@@ -1,11 +0,0 @@
1
- FROM nvcr.io/nvidia/pytorch:23.02-py3
2
-
3
- WORKDIR /workdir
4
-
5
- COPY requirements.txt /workdir/
6
-
7
- # since installing icetk will install protobuf 3.18.3, and we need protobuf==3.20.3
8
- RUN pip install -r requirements.txt && \
9
- pip install protobuf==3.20.3
10
-
11
-
 
 
 
 
 
 
 
 
 
 
 
 
README.md DELETED
@@ -1,120 +0,0 @@
1
- ---
2
- license: creativeml-openrail-m
3
- language:
4
- - en
5
- tags:
6
- - LLM
7
- - tensorRT
8
- - ChatGLM
9
- ---
10
- ## Model Card for lyraChatGLM
11
-
12
- lyraChatGLM is currently the **fastest ChatGLM-6B** available. To the best of our knowledge, it is the **first accelerated version of ChatGLM-6B**.
13
-
14
- The inference speed of lyraChatGLM has achieved **10x** acceleration upon the ealry original version. We are still working hard to further improve the performance.
15
-
16
- Among its main features are:
17
-
18
- - weights: original ChatGLM-6B weights released by THUDM.
19
- - device: lyraChatGLM is mainly based on TensorRT compiled for SM=80 (A100, for example).
20
- - batch_size: compiled with dynamic batch size, max batch_size = 8
21
-
22
- ## Speed
23
-
24
- ### test environment
25
-
26
- - device: Nvidia A100 40G
27
- - batch size: 8
28
-
29
- **Since early chatGLM version didn't suport batch inference, `original` in below table was measured on batch_size=1**
30
-
31
-
32
- **According to [this discussion](https://huggingface.co/TMElyralab/lyraChatGLM/discussions/6), this bug has been fixed and the speed on batch_size=8 reachs up to 137 tokens/s. We will evaluate and update the latest performance.**
33
-
34
- |version|speed|
35
- |:-:|:-:|
36
- |original|30 tokens/s|
37
- |lyraChatGLM|310 tokens/s|
38
-
39
-
40
- ## Model Sources
41
-
42
- - **Repository:** https://huggingface.co/THUDM/chatglm-6b
43
-
44
- ## Try Demo in 2 fast steps
45
-
46
- ``` bash
47
- #step 1
48
- git clone https://huggingface.co/TMElyralab/lyraChatGLM
49
- cd lyraChatGLM
50
-
51
- #step 2
52
- docker run --gpus=1 --rm --net=host -v ${PWD}:/workdir yibolu96/lyra-chatglm-env:0.0.1 python3 /workdir/demo.py
53
- ```
54
-
55
- ## Uses
56
-
57
- ```python
58
- from transformers import AutoTokenizer
59
- from lyraChatGLM import GLM6B, FasterChatGLM
60
- import os
61
-
62
- current_workdir = os.path.dirname(__file__)
63
-
64
- MAX_OUT_LEN = 100
65
- chatglm6b_dir = os.path.join(current_workdir, "models")
66
- tokenizer = AutoTokenizer.from_pretrained(chatglm6b_dir, trust_remote_code=True)
67
- input_str = ["为什么我们需要对深度学习模型加速?", ]
68
- inputs = tokenizer(input_str, return_tensors="pt", padding=True)
69
- input_ids = inputs.input_ids.to('cuda:0')
70
-
71
- plan_path = os.path.join(current_workdir, "models/glm6b-bs8.ftm")
72
-
73
- # kernel for chat model.
74
- kernel = GLM6B(plan_path=plan_path,
75
- batch_size=1,
76
- num_beams=1,
77
- use_cache=True,
78
- num_heads=32,
79
- emb_size_per_heads=128,
80
- decoder_layers=28,
81
- vocab_size=150528,
82
- max_seq_len=MAX_OUT_LEN)
83
-
84
- chat = FasterChatGLM(model_dir=chatglm6b_dir, kernel=kernel).half().cuda()
85
-
86
- # generate
87
- sample_output = chat.generate(inputs=input_ids, max_length=MAX_OUT_LEN)
88
- # de-tokenize model output to text
89
- res = tokenizer.decode(sample_output[0], skip_special_tokens=True)
90
- print(res)
91
- ```
92
- ## Demo output
93
-
94
- ### input
95
- 为什么我们需要对深度学习模型加速? 。
96
-
97
- ### output
98
- 为什么我们需要对深度学习模型加速? 深度学习模型的训练需要大量计算资源,特别是在训练模型时,需要大量的内存、GPU(图形处理器)和其他计算资源。因此,训练深度学习模型需要一定的时间,并且如果模型不能快速训练,则可能会导致训练进度缓慢或无法训练。
99
-
100
- 以下是一些原因我们需要对深度学习模型加速:
101
-
102
- 1. 训练深度神经网络需要大量的计算资源,特别是在训练深度神经网络时,需要更多的计算资源,因此需要更快的训练速度。
103
-
104
- ### TODO:
105
-
106
- We plan to implement a FasterTransformer version to publish a much faster release. Stay tuned!
107
-
108
- ## Citation
109
- ``` bibtex
110
- @Misc{lyraChatGLM2023,
111
- author = {Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu},
112
- title = {lyraChatGLM: Accelerating ChatGLM by 10x+},
113
- howpublished = {\url{https://huggingface.co/TMElyralab/lyraChatGLM}},
114
- year = {2023}
115
- }
116
- ```
117
-
118
- ## Report bug
119
- - start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraChatGLM/discussions
120
- - report bug with a `[bug]` mark in the title.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
demo.py DELETED
@@ -1,35 +0,0 @@
1
- # coding=utf-8
2
-
3
- from transformers import AutoTokenizer
4
- from lyraChatGLM import GLM6B, FasterChatGLM
5
- import os
6
-
7
- current_workdir = os.path.dirname(__file__)
8
-
9
- MAX_OUT_LEN = 100
10
- chatglm6b_dir = os.path.join(current_workdir, "models")
11
- tokenizer = AutoTokenizer.from_pretrained(chatglm6b_dir, trust_remote_code=True)
12
- input_str = ["为什么我们需要对深度学习模型加速?", ]
13
- inputs = tokenizer(input_str, return_tensors="pt", padding=True)
14
- input_ids = inputs.input_ids.to('cuda:0')
15
-
16
- plan_path = os.path.join(current_workdir, "models/glm6b-bs8.ftm")
17
-
18
- # kernel for chat model.
19
- kernel = GLM6B(plan_path=plan_path,
20
- batch_size=1,
21
- num_beams=1,
22
- use_cache=True,
23
- num_heads=32,
24
- emb_size_per_heads=128,
25
- decoder_layers=28,
26
- vocab_size=150528,
27
- max_seq_len=MAX_OUT_LEN)
28
-
29
- chat = FasterChatGLM(model_dir=chatglm6b_dir, kernel=kernel).half().cuda()
30
-
31
- # generate
32
- sample_output = chat.generate(inputs=input_ids, max_length=MAX_OUT_LEN)
33
- # de-tokenize model output to text
34
- res = tokenizer.decode(sample_output[0], skip_special_tokens=True)
35
- print(res)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
lyraChatGLM/__init__.py DELETED
@@ -1,10 +0,0 @@
1
- import os
2
- import ctypes
3
-
4
- current_workdir = os.path.dirname(__file__)
5
- ctypes.cdll.LoadLibrary(os.path.join(current_workdir, "libnvinfer_plugin.so"))
6
- os.environ["TORCH_USE_RTLD_GLOBAL"]="YES"
7
-
8
- import torch
9
- from .glm import GLM6B
10
- from .model import FasterChatGLM
 
 
 
 
 
 
 
 
 
 
 
lyraChatGLM/model.py DELETED
@@ -1,131 +0,0 @@
1
- import torch
2
- from transformers.modeling_outputs import CausalLMOutputWithPast
3
- from transformers.modeling_utils import PreTrainedModel
4
- from transformers import AutoConfig
5
- from typing import Dict, List, Tuple, Union, Optional
6
-
7
-
8
- class FasterChatGLM(PreTrainedModel):
9
- def __init__(self, model_dir, kernel, *inputs, **kwargs):
10
- config = AutoConfig.from_pretrained(model_dir, trust_remote_code=True)
11
- config.n_head = config.num_attention_heads
12
- config.n_embd = config.hidden_size
13
- config.n_layer = config.num_layers
14
- super().__init__(config, *inputs, **kwargs)
15
- self.kernel = kernel
16
- self.fake_reg = torch.nn.Linear(2, 2)
17
- self.position_encoding_2d = True
18
-
19
- def forward(self, input_ids, position_ids, attention_mask, past_key_values, *args, **kwargs):
20
- inputs_values = [input_ids, position_ids, attention_mask]
21
- if past_key_values is not None:
22
- inputs_values = inputs_values + past_key_values
23
-
24
- computed = self.kernel.infer(inputs_values)
25
- logits = computed[0]
26
- if len(computed) == 1:
27
- present_key_values = None
28
- else:
29
- present_key_values = computed[1:]
30
-
31
- return CausalLMOutputWithPast(logits=logits, past_key_values=present_key_values)
32
-
33
- def get_masks_and_position_ids(self, seq, mask_position, context_length, device, gmask=False):
34
- attention_mask = torch.ones((1, context_length, context_length), device=device)
35
- attention_mask.tril_()
36
- attention_mask[..., :context_length - 1] = 1
37
- attention_mask.unsqueeze_(1)
38
- attention_mask = (attention_mask < 0.5).bool()
39
-
40
- if self.position_encoding_2d:
41
- seq_length = seq.index(150004)
42
- position_ids = torch.arange(context_length, dtype=torch.long, device=device)
43
- if not gmask:
44
- position_ids[seq_length:] = mask_position
45
- block_position_ids = torch.cat((
46
- torch.zeros(seq_length, dtype=torch.long, device=device),
47
- torch.arange(context_length - seq_length, dtype=torch.long, device=device) + 1
48
- ))
49
- position_ids = torch.stack((position_ids, block_position_ids), dim=0)
50
- else:
51
- position_ids = torch.arange(context_length, dtype=torch.long, device=device)
52
- if not gmask:
53
- position_ids[context_length - 1:] = mask_position
54
-
55
- position_ids = position_ids.unsqueeze(0)
56
-
57
- return attention_mask, position_ids
58
-
59
- def prepare_one_sample(self, input_id, mask_token, past, past_key_values, use_gmask):
60
-
61
- seq = input_id.tolist()
62
- mask_position = seq.index(mask_token)
63
-
64
- if mask_token not in seq:
65
- raise ValueError("You have to add either [MASK] or [gMASK] in your input")
66
-
67
- # only last token for input_ids if past is not None
68
- if past is not None or past_key_values is not None:
69
- context_length = seq.index(150004)
70
- last_token = input_id[-1].unsqueeze(-1).unsqueeze(0) # 2 dim
71
- proc_input_id = last_token
72
- if self.position_encoding_2d:
73
- position_ids = torch.tensor([[[mask_position], [len(seq) - context_length]]], dtype=torch.long,
74
- device=input_id.device)
75
- else:
76
- position_ids = torch.tensor([[mask_position]], dtype=torch.long, device=input_id.device)
77
-
78
- attention_mask = torch.zeros(1, 1, 1, 1, device=input_id.device)
79
- else:
80
- proc_input_id = input_id.unsqueeze(0)
81
- attention_mask, position_ids = self.get_masks_and_position_ids(
82
- seq=seq,
83
- mask_position=mask_position,
84
- context_length=len(seq),
85
- device=input_id.device,
86
- gmask=use_gmask
87
- )
88
-
89
- return (proc_input_id.to(torch.int32), position_ids.to(torch.int32),
90
- attention_mask.to(torch.bool))
91
-
92
- def prepare_inputs_for_generation(
93
- self,
94
- input_ids: torch.LongTensor,
95
- past: Optional[torch.Tensor] = None,
96
- past_key_values: Optional[torch.Tensor] = None,
97
- attention_mask: Optional[torch.Tensor] = None,
98
- use_cache: bool = None,
99
- **kwargs
100
- ) -> dict:
101
-
102
- MASK, gMASK = 150000, 150001
103
- mask_token = MASK if MASK in input_ids else gMASK
104
- use_gmask = False if MASK in input_ids else gMASK
105
-
106
- batch_input_ids, batch_position_ids, batch_attention_mask = [], [], []
107
- for input_id in input_ids:
108
- proc_input_id, position_id, attention_mask = self.prepare_one_sample(
109
- input_id, mask_token, past, past_key_values, use_gmask)
110
- batch_input_ids.append(proc_input_id)
111
- batch_position_ids.append(position_id)
112
- batch_attention_mask.append(attention_mask)
113
-
114
- batch_input_ids = torch.vstack(batch_input_ids)
115
- batch_position_ids = torch.vstack(batch_position_ids)
116
- batch_attention_mask = torch.vstack(batch_attention_mask)
117
-
118
- if past is None:
119
- past = past_key_values
120
-
121
- if past is not None or past_key_values is not None:
122
- self.kernel.set_context_mode(False)
123
- else:
124
- self.kernel.set_context_mode(self.config.use_cache)
125
-
126
- return {
127
- "input_ids": batch_input_ids,
128
- "past_key_values": past_key_values,
129
- "position_ids": batch_position_ids,
130
- "attention_mask": batch_attention_mask
131
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
models/config.json DELETED
@@ -1,25 +0,0 @@
1
- {
2
- "_name_or_path": "THUDM/chatglm-6b",
3
- "architectures": [
4
- "ChatGLMModel"
5
- ],
6
- "auto_map": {
7
- "AutoConfig": "configuration_chatglm.ChatGLMConfig",
8
- "AutoModel": "modeling_chatglm.ChatGLMForConditionalGeneration",
9
- "AutoModelForSeq2SeqLM": "modeling_chatglm.ChatGLMForConditionalGeneration"
10
- },
11
- "bos_token_id": 150004,
12
- "eos_token_id": 150005,
13
- "hidden_size": 4096,
14
- "inner_hidden_size": 16384,
15
- "layernorm_epsilon": 1e-05,
16
- "max_sequence_length": 2048,
17
- "model_type": "chatglm",
18
- "num_attention_heads": 32,
19
- "num_layers": 28,
20
- "position_encoding_2d": true,
21
- "torch_dtype": "float16",
22
- "transformers_version": "4.23.1",
23
- "use_cache": true,
24
- "vocab_size": 150528
25
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
models/configuration_chatglm.py DELETED
@@ -1,92 +0,0 @@
1
- """ ChatGLM model configuration """
2
-
3
- from transformers.configuration_utils import PretrainedConfig
4
- from transformers.utils import logging
5
-
6
- logger = logging.get_logger(__name__)
7
-
8
-
9
- class ChatGLMConfig(PretrainedConfig):
10
- r"""
11
- This is the configuration class to store the configuration of a [`~ChatGLMModel`].
12
- It is used to instantiate an ChatGLM model according to the specified arguments, defining the model
13
- architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
14
- the ChatGLM-6B [THUDM/ChatGLM-6B](https://huggingface.co/THUDM/chatglm-6b) architecture.
15
-
16
- Configuration objects inherit from [`PretrainedConfig`] and can be used
17
- to control the model outputs. Read the documentation from [`PretrainedConfig`]
18
- for more information.
19
-
20
-
21
- Args:
22
- vocab_size (`int`, *optional*, defaults to 150528):
23
- Vocabulary size of the ChatGLM-6B model. Defines the number of different tokens that can be represented by the
24
- `inputs_ids` passed when calling [`~ChatGLMModel`] or
25
- [`~TFChatGLMModel`].
26
- hidden_size (`int`, *optional*, defaults to 4096):
27
- Dimension of the encoder layers and the pooler layer.
28
- num_hidden_layers (`int`, *optional*, defaults to 28):
29
- Number of hidden layers in the Transformer encoder.
30
- num_attention_heads (`int`, *optional*, defaults to 32):
31
- Number of attention heads for each attention layer in the Transformer encoder.
32
- inner_hidden_size (`int`, *optional*, defaults to 16384):
33
- Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
34
- max_sequence_length (`int`, *optional*, defaults to 512):
35
- The maximum sequence length that this model might ever be used with.
36
- Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
37
- layernorm_epsilon (`float`, *optional*, defaults to 1e-5):
38
- The epsilon used by the layer normalization layers.
39
- use_cache (`bool`, *optional*, defaults to `True`):
40
- Whether the model should return the last key/values attentions (not used by all models).
41
- Example:
42
-
43
- ```python
44
- >>> from configuration_chatglm import ChatGLMConfig
45
- >>> from modeling_chatglm import ChatGLMModel
46
-
47
- >>> # Initializing a ChatGLM-6B THUDM/ChatGLM-6B style configuration
48
- >>> configuration = ChatGLMConfig()
49
-
50
- >>> # Initializing a model from the THUDM/ChatGLM-6B style configuration
51
- >>> model = ChatGLMModel(configuration)
52
-
53
- >>> # Accessing the model configuration
54
- >>> configuration = model.config
55
- ```
56
- """
57
- model_type = "chatglm"
58
-
59
- def __init__(
60
- self,
61
- vocab_size=150528,
62
- hidden_size=4096,
63
- num_layers=28,
64
- num_attention_heads=32,
65
- layernorm_epsilon=1e-5,
66
- use_cache=False,
67
- bos_token_id=150004,
68
- eos_token_id=150005,
69
- pad_token_id=0,
70
- max_sequence_length=2048,
71
- inner_hidden_size=16384,
72
- position_encoding_2d=True,
73
- **kwargs
74
- ):
75
- self.num_layers = num_layers
76
- self.vocab_size = vocab_size
77
- self.hidden_size = hidden_size
78
- self.num_attention_heads = num_attention_heads
79
- self.max_sequence_length = max_sequence_length
80
- self.layernorm_epsilon = layernorm_epsilon
81
- self.inner_hidden_size = inner_hidden_size
82
- self.use_cache = use_cache
83
- self.bos_token_id = bos_token_id
84
- self.eos_token_id = eos_token_id
85
- self.pad_token_id = pad_token_id
86
- self.position_encoding_2d = position_encoding_2d
87
- super().__init__(
88
- pad_token_id=pad_token_id,
89
- bos_token_id=bos_token_id,
90
- eos_token_id=eos_token_id,
91
- **kwargs
92
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
models/ice_text.model DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:99871e0c85db81ad7af1028854fd091cd5778c8414ae9d94bbbc10d02c831c21
3
- size 2699926
 
 
 
 
models/tokenization_chatglm.py DELETED
@@ -1,346 +0,0 @@
1
- """Tokenization classes for ChatGLM."""
2
- import sys
3
- import unicodedata
4
- from typing import List, Optional, Union
5
- from functools import lru_cache
6
- import os
7
- import collections
8
- import re
9
-
10
- from transformers.tokenization_utils import PreTrainedTokenizer
11
- from icetk.text_tokenizer import TextTokenizer
12
- from icetk.utils import auto_create
13
- import icetk.sentencepiece_model_pb2 as sp_model
14
- from transformers.utils import logging
15
-
16
- logger = logging.get_logger(__name__)
17
-
18
- PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
19
- "THUDM/chatglm-6b": 2048,
20
- }
21
-
22
-
23
- class SPTokenizer:
24
- def __init__(
25
- self,
26
- vocab_file,
27
- max_blank_length=80,
28
- byte_fallback=True,
29
- ):
30
- assert vocab_file is not None
31
- self.vocab_file = vocab_file
32
- self.special_tokens = ["[MASK]", "[gMASK]", "[sMASK]", "<unused_0>", "<sop>", "<eop>", "<ENC>", "<dBLOCK>"]
33
- self.max_blank_length = max_blank_length
34
- self.byte_fallback = byte_fallback
35
- self.text_tokenizer = self._build_text_tokenizer(encode_special_tokens=False)
36
- self.special_text_tokenizer = self._build_text_tokenizer(encode_special_tokens=True)
37
-
38
- @staticmethod
39
- def _configure_tokenizer(
40
- text_tokenizer: TextTokenizer,
41
- special_tokens: List[str],
42
- max_blank_length: int,
43
- byte_fallback: bool,
44
- encode_special_tokens=False,
45
- ):
46
- # special token
47
- special_token_type = 4 if encode_special_tokens else 3 # 3 - CONTROL, 4 - USER_DEFINE
48
- for token in special_tokens:
49
- text_tokenizer.proto.pieces.append(
50
- sp_model.ModelProto.SentencePiece(piece=token, score=0.0, type=special_token_type)
51
- )
52
- # whitespaces
53
- for token in [SPTokenizer.get_tab_token()] + [
54
- SPTokenizer.get_blank_token(i) for i in range(2, max_blank_length + 1)
55
- ]:
56
- text_tokenizer.proto.pieces.append(sp_model.ModelProto.SentencePiece(piece=token, score=0.0, type=4))
57
- # byte fallback
58
- if byte_fallback:
59
- text_tokenizer.proto.trainer_spec.byte_fallback = True
60
- for i in range(256):
61
- text_tokenizer.proto.pieces.append(
62
- sp_model.ModelProto.SentencePiece(piece="<0x{:02X}>".format(i), score=0.0, type=6)
63
- )
64
- text_tokenizer.refresh()
65
-
66
- def _build_text_tokenizer(self, encode_special_tokens=False):
67
- tokenizer = TextTokenizer(self.vocab_file)
68
- self._configure_tokenizer(
69
- tokenizer, self.special_tokens, self.max_blank_length, self.byte_fallback, encode_special_tokens
70
- )
71
- return tokenizer
72
-
73
- def _get_text_tokenizer(self, encode_special_tokens=False):
74
- if encode_special_tokens:
75
- return self.special_text_tokenizer
76
- else:
77
- return self.text_tokenizer
78
-
79
- @staticmethod
80
- def get_blank_token(length: int):
81
- assert length >= 2
82
- return f"<|blank_{length}|>"
83
-
84
- @staticmethod
85
- def get_tab_token():
86
- return f"<|tab|>"
87
-
88
- @property
89
- def num_image_tokens(self):
90
- return 20000
91
-
92
- @property
93
- def num_text_tokens(self):
94
- return self.text_tokenizer.num_tokens
95
-
96
- @property
97
- def num_tokens(self):
98
- return self.num_image_tokens + self.num_text_tokens
99
-
100
- @staticmethod
101
- def _encode_whitespaces(text: str, max_len: int = 80):
102
- text = text.replace("\t", SPTokenizer.get_tab_token())
103
- for i in range(max_len, 1, -1):
104
- text = text.replace(" " * i, SPTokenizer.get_blank_token(i))
105
- return text
106
-
107
- def _preprocess(self, text: str, linebreak=True, whitespaces=True):
108
- if linebreak:
109
- text = text.replace("\n", "<n>")
110
- if whitespaces:
111
- text = self._encode_whitespaces(text, max_len=self.max_blank_length)
112
- return text
113
-
114
- def encode(
115
- self, text: str, linebreak=True, whitespaces=True, special_tokens=False, add_dummy_prefix=True
116
- ) -> List[int]:
117
- """
118
- @param text: Text to encode.
119
- @param linebreak: Whether to encode newline (\n) in text.
120
- @param whitespaces: Whether to encode multiple whitespaces or tab in text, useful for source code encoding.
121
- @param special_tokens: Whether to encode special token ([MASK], [gMASK], etc.) in text.
122
- @param add_dummy_prefix: Whether to add dummy blank space in the beginning.
123
- """
124
- text = self._preprocess(text, linebreak, whitespaces)
125
- if not add_dummy_prefix:
126
- text = "<n>" + text
127
- tmp = self._get_text_tokenizer(encode_special_tokens=special_tokens).encode(text)
128
- tokens = [x + self.num_image_tokens for x in tmp]
129
- return tokens if add_dummy_prefix else tokens[2:]
130
-
131
- def decode(self, text_ids: List[int], special_tokens=False) -> str:
132
- ids = [int(_id) - self.num_image_tokens for _id in text_ids]
133
- ids = [_id for _id in ids if _id >= 0]
134
- text = self._get_text_tokenizer(encode_special_tokens=special_tokens).decode(ids)
135
- text = text.replace("<n>", "\n")
136
- text = text.replace(SPTokenizer.get_tab_token(), "\t")
137
- for i in range(2, self.max_blank_length + 1):
138
- text = text.replace(self.get_blank_token(i), " " * i)
139
- return text
140
-
141
- def tokenize(
142
- self, text: str, linebreak=True, whitespaces=True, special_tokens=False, add_dummy_prefix=True
143
- ) -> List[str]:
144
- """
145
- @param text: Text to encode.
146
- @param linebreak: Whether to encode newline (\n) in text.
147
- @param whitespaces: Whether to encode multiple whitespaces or tab in text, useful for source code encoding.
148
- @param special_tokens: Whether to encode special token ([MASK], [gMASK], etc.) in text.
149
- @param add_dummy_prefix: Whether to add dummy blank space in the beginning.
150
- """
151
- text = self._preprocess(text, linebreak, whitespaces)
152
- if not add_dummy_prefix:
153
- text = "<n>" + text
154
- tokens = self._get_text_tokenizer(encode_special_tokens=special_tokens).tokenize(text)
155
- return tokens if add_dummy_prefix else tokens[2:]
156
-
157
- def __getitem__(self, x: Union[int, str]):
158
- if isinstance(x, int):
159
- if x < self.num_image_tokens:
160
- return "<image_{}>".format(x)
161
- else:
162
- return self.text_tokenizer.convert_id_to_token(x - self.num_image_tokens)
163
- elif isinstance(x, str):
164
- if x.startswith("<image_") and x.endswith(">") and x[7:-1].isdigit():
165
- return int(x[7:-1])
166
- else:
167
- return self.text_tokenizer.convert_token_to_id(x) + self.num_image_tokens
168
- else:
169
- raise ValueError("The key should be str or int.")
170
-
171
-
172
- class ChatGLMTokenizer(PreTrainedTokenizer):
173
- """
174
- Construct a ChatGLM tokenizer. Based on byte-level Byte-Pair-Encoding.
175
-
176
- Args:
177
- vocab_file (`str`):
178
- Path to the vocabulary file.
179
- """
180
-
181
- vocab_files_names = {"vocab_file": "ice_text.model"}
182
- max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
183
- model_input_names = ["input_ids"]
184
-
185
- def __init__(
186
- self,
187
- vocab_file,
188
- do_lower_case=False,
189
- remove_space=False,
190
- bos_token='sop',
191
- eos_token='eos',
192
- eop_token='eop',
193
- mask_token='[MASK]',
194
- gmask_token='[gMASK]',
195
- padding_side="left",
196
- **kwargs
197
- ) -> None:
198
- super().__init__(
199
- do_lower_case=do_lower_case,
200
- remove_space=remove_space,
201
- padding_side=padding_side,
202
- **kwargs
203
- )
204
-
205
- self.do_lower_case = do_lower_case
206
- self.remove_space = remove_space
207
- self.vocab_file = vocab_file
208
-
209
- self.bos_token = bos_token
210
- self.eos_token = eos_token
211
- self.eop_token = eop_token
212
- self.mask_token = mask_token
213
- self.gMASK_token = gmask_token
214
-
215
- self.sp_tokenizer = SPTokenizer(vocab_file)
216
-
217
- """ Initialisation """
218
-
219
- @property
220
- def eop_token_id(self) -> Optional[int]:
221
- """
222
- `Optional[int]`: Id of the end of sentence token in the vocabulary. Returns `None` if the token has not been
223
- set.
224
- """
225
- if self.eop_token is None:
226
- return None
227
- return self.convert_tokens_to_ids(self.eop_token)
228
-
229
- @property
230
- def vocab_size(self):
231
- """ Returns vocab size """
232
- return self.sp_tokenizer.num_tokens
233
-
234
- def get_vocab(self):
235
- """ Returns vocab as a dict """
236
- vocab = {self._convert_id_to_token(i): i for i in range(self.vocab_size)}
237
- vocab.update(self.added_tokens_encoder)
238
- return vocab
239
-
240
- def preprocess_text(self, inputs):
241
- if self.remove_space:
242
- outputs = " ".join(inputs.strip().split())
243
- else:
244
- outputs = inputs
245
-
246
- if self.do_lower_case:
247
- outputs = outputs.lower()
248
-
249
- return outputs
250
-
251
- def _tokenize(self, text, **kwargs):
252
- """ Returns a tokenized string. """
253
- text = self.preprocess_text(text)
254
-
255
- seq = self.sp_tokenizer.tokenize(text)
256
-
257
- return seq
258
-
259
- def decode(
260
- self,
261
- token_ids: Union[List[int], List[List[int]]],
262
- skip_special_tokens: bool = False,
263
- clean_up_tokenization_spaces: bool = True,
264
- spaces_between_special_tokens: bool = True,
265
- **kwargs
266
- ) -> str:
267
- if isinstance(token_ids[0], list):
268
- tokens = []
269
- for single_token_ids in token_ids:
270
- if self.pad_token_id in single_token_ids: # remove pad
271
- single_token_ids = list(filter((self.pad_token_id).__ne__, single_token_ids))
272
- tokens.append(self.sp_tokenizer.decode(single_token_ids))
273
- return (tokens)
274
- else:
275
- if self.pad_token_id in token_ids: # remove pad
276
- token_ids = list(filter((self.pad_token_id).__ne__, token_ids))
277
- return self.sp_tokenizer.decode(token_ids)
278
-
279
- def _convert_token_to_id(self, token):
280
- """ Converts a token (str) in an id using the vocab. """
281
- return self.sp_tokenizer[token]
282
-
283
- def _convert_id_to_token(self, index):
284
- """Converts an index (integer) in a token (str) using the vocab."""
285
- return self.sp_tokenizer[index]
286
-
287
- def save_vocabulary(self, save_directory, filename_prefix=None):
288
- """
289
- Save the vocabulary and special tokens file to a directory.
290
-
291
- Args:
292
- save_directory (`str`):
293
- The directory in which to save the vocabulary.
294
- filename_prefix (`str`, *optional*):
295
- An optional prefix to add to the named of the saved files.
296
-
297
- Returns:
298
- `Tuple(str)`: Paths to the files saved.
299
- """
300
- if os.path.isdir(save_directory):
301
- vocab_file = os.path.join(
302
- save_directory, VOCAB_FILES_NAMES["vocab_file"]
303
- )
304
- else:
305
- vocab_file = save_directory
306
-
307
- with open(self.vocab_file, 'rb') as fin:
308
- proto_str = fin.read()
309
-
310
- with open(vocab_file, "wb") as writer:
311
- writer.write(proto_str)
312
-
313
- return (vocab_file,)
314
-
315
- def build_inputs_with_special_tokens(
316
- self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
317
- ) -> List[int]:
318
- """
319
- Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
320
- adding special tokens. A BERT sequence has the following format:
321
-
322
- - single sequence: `[CLS] X [SEP]`
323
- - pair of sequences: `[CLS] A [SEP] B [SEP]`
324
-
325
- Args:
326
- token_ids_0 (`List[int]`):
327
- List of IDs to which the special tokens will be added.
328
- token_ids_1 (`List[int]`, *optional*):
329
- Optional second list of IDs for sequence pairs.
330
-
331
- Returns:
332
- `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
333
- """
334
- if token_ids_1 is not None:
335
- token_ids_0 += token_ids_1
336
- mask_ids = self.sp_tokenizer[self.mask_token]
337
- gmask_ids = self.sp_tokenizer[self.gMASK_token]
338
- if mask_ids not in token_ids_0 and gmask_ids not in token_ids_0:
339
- token_ids_0 += [gmask_ids]
340
-
341
- if token_ids_0[-1] != mask_ids and token_ids_0[-1] != gmask_ids:
342
- token_ids_0 += [self.sp_tokenizer[self.eos_token]]
343
-
344
- token_ids_0 += [self.sp_tokenizer[self.bos_token]]
345
-
346
- return token_ids_0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
models/tokenizer_config.json DELETED
@@ -1,19 +0,0 @@
1
- {
2
- "name_or_path": "THUDM/chatglm-6b",
3
- "bos_token": "<sop>",
4
- "eop_token": "<eop>",
5
- "eos_token": "</s>",
6
- "gmask_token": "[gMASK]",
7
- "mask_token": "[MASK]",
8
- "pad_token": "<pad>",
9
- "unk_token": "<unk>",
10
- "remove_space": false,
11
- "do_lower_case": false,
12
- "tokenizer_class": "ChatGLMTokenizer",
13
- "auto_map": {
14
- "AutoTokenizer": [
15
- "tokenization_chatglm.ChatGLMTokenizer",
16
- null
17
- ]
18
- }
19
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
requirements.txt DELETED
@@ -1,4 +0,0 @@
1
- icetk
2
- torch
3
- transformers
4
-