File size: 3,871 Bytes
d1850c3
 
 
 
 
 
 
7991465
d1850c3
 
 
7991465
d1850c3
77bcf50
d1850c3
 
 
 
935be87
7991465
d1850c3
 
 
 
 
 
b977cd1
d1850c3
b98f261
060055d
 
b98f261
060055d
05581a1
 
 
 
d1850c3
 
 
 
 
 
308345c
 
 
 
 
 
 
 
 
 
 
d1850c3
 
 
05581a1
9d60792
 
d1850c3
9d60792
05581a1
f8cc4df
9d60792
 
f8cc4df
05581a1
 
 
9d60792
05581a1
d1850c3
f8cc4df
05581a1
d1850c3
05581a1
d1850c3
 
 
 
 
f8cc4df
9d60792
d1850c3
 
 
05581a1
 
 
d1850c3
 
 
05581a1
78ab63e
d1850c3
05581a1
78ab63e
 
 
 
 
d1850c3
1423815
d1850c3
9ba46d2
d1850c3
 
 
 
308345c
7991465
d1850c3
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
license: creativeml-openrail-m
language:
- en
tags:
- LLM
- tensorRT
- ChatGLM
---
## Model Card for lyraChatGLM

lyraChatGLM is currently the **fastest ChatGLM-6B** available. To the best of our knowledge, it is the **first accelerated version of ChatGLM-6B**.

The inference speed of lyraChatGLM has achieved **10x** acceleration upon the ealry original version. We are still working hard to further improve the performance.

Among its main features are:

- weights: original ChatGLM-6B weights released by THUDM.
- device: lyraChatGLM is mainly based on TensorRT compiled for SM=80 (A100, for example).
- batch_size: compiled with dynamic batch size, max batch_size = 8

## Speed

### test environment

- device: Nvidia A100 40G
- batch size: 8

**Since early chatGLM version didn't suport batch inference, `original` in below table was measured on batch_size=1**


**According to [this discussion](https://huggingface.co/TMElyralab/lyraChatGLM/discussions/6), this bug has been fixed and the speed on batch_size=8 reachs up to 137 tokens/s. We will evaluate and update the latest performance.**

|version|speed|
|:-:|:-:|
|original|30 tokens/s|
|lyraChatGLM|310 tokens/s|


## Model Sources

- **Repository:** [https://huggingface.co/THUDM/chatglm-6b]

## Try Demo in 2 fast steps

``` bash
#step 1
git clone https://huggingface.co/TMElyralab/lyraChatGLM
cd lyraChatGLM

#step 2
docker run --gpus=1 --rm --net=host -v ${PWD}:/workdir yibolu96/lyra-chatglm-env:0.0.1 python3 /workdir/demo.py
```

## Uses

```python
from transformers import AutoTokenizer
from lyraChatGLM import GLM6B, FasterChatGLM
import os

current_workdir = os.path.dirname(__file__)

MAX_OUT_LEN = 100
chatglm6b_dir = os.path.join(current_workdir, "models")
tokenizer = AutoTokenizer.from_pretrained(chatglm6b_dir, trust_remote_code=True)
input_str = ["为什么我们需要对深度学习模型加速?", ]
inputs = tokenizer(input_str, return_tensors="pt", padding=True)
input_ids = inputs.input_ids.to('cuda:0')

plan_path = os.path.join(current_workdir, "models/glm6b-bs8.ftm")

# kernel for chat model.
kernel = GLM6B(plan_path=plan_path,
               batch_size=1,
               num_beams=1,
               use_cache=True,
               num_heads=32,
               emb_size_per_heads=128,
               decoder_layers=28,
               vocab_size=150528,
               max_seq_len=MAX_OUT_LEN)

chat = FasterChatGLM(model_dir=chatglm6b_dir, kernel=kernel).half().cuda()

# generate
sample_output = chat.generate(inputs=input_ids, max_length=MAX_OUT_LEN)
# de-tokenize model output to text
res = tokenizer.decode(sample_output[0], skip_special_tokens=True)
print(res)
```
## Demo output

### input
为什么我们需要对深度学习模型加速? 。

### output
为什么我们需要对深度学习模型加速? 深度学习模型的训练需要大量计算资源,特别是在训练模型时,需要大量的内存、GPU(图形处理器)和其他计算资源。因此,训练深度学习模型需要一定的时间,并且如果模型不能快速训练,则可能会导致训练进度缓慢或无法训练。

以下是一些原因我们需要对深度学习模型加速:

1. 训练深度神经网络需要大量的计算资源,特别是在训练深度神经网络时,需要更多的计算资源,因此需要更快的训练速度。

### TODO:

We plan to implement a FasterTransformer version to publish a much faster release. Stay tuned!

## Citation
``` bibtex
@Misc{lyraChatGLM2023,
  author =       {Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu},
  title =        {lyraChatGLM: Accelerating ChatGLM by 10x+},
  howpublished = {\url{https://huggingface.co/TMElyralab/lyraChatGLM}},
  year =         {2023}
}
```

## Report bug
- start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraChatGLM/discussions
- report bug with a `[bug]` mark in the title.