File size: 3,871 Bytes
d1850c3 7991465 d1850c3 7991465 d1850c3 77bcf50 d1850c3 935be87 7991465 d1850c3 b977cd1 d1850c3 b98f261 060055d b98f261 060055d 05581a1 d1850c3 308345c d1850c3 05581a1 9d60792 d1850c3 9d60792 05581a1 f8cc4df 9d60792 f8cc4df 05581a1 9d60792 05581a1 d1850c3 f8cc4df 05581a1 d1850c3 05581a1 d1850c3 f8cc4df 9d60792 d1850c3 05581a1 d1850c3 05581a1 78ab63e d1850c3 05581a1 78ab63e d1850c3 1423815 d1850c3 9ba46d2 d1850c3 308345c 7991465 d1850c3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 |
---
license: creativeml-openrail-m
language:
- en
tags:
- LLM
- tensorRT
- ChatGLM
---
## Model Card for lyraChatGLM
lyraChatGLM is currently the **fastest ChatGLM-6B** available. To the best of our knowledge, it is the **first accelerated version of ChatGLM-6B**.
The inference speed of lyraChatGLM has achieved **10x** acceleration upon the ealry original version. We are still working hard to further improve the performance.
Among its main features are:
- weights: original ChatGLM-6B weights released by THUDM.
- device: lyraChatGLM is mainly based on TensorRT compiled for SM=80 (A100, for example).
- batch_size: compiled with dynamic batch size, max batch_size = 8
## Speed
### test environment
- device: Nvidia A100 40G
- batch size: 8
**Since early chatGLM version didn't suport batch inference, `original` in below table was measured on batch_size=1**
**According to [this discussion](https://huggingface.co/TMElyralab/lyraChatGLM/discussions/6), this bug has been fixed and the speed on batch_size=8 reachs up to 137 tokens/s. We will evaluate and update the latest performance.**
|version|speed|
|:-:|:-:|
|original|30 tokens/s|
|lyraChatGLM|310 tokens/s|
## Model Sources
- **Repository:** [https://huggingface.co/THUDM/chatglm-6b]
## Try Demo in 2 fast steps
``` bash
#step 1
git clone https://huggingface.co/TMElyralab/lyraChatGLM
cd lyraChatGLM
#step 2
docker run --gpus=1 --rm --net=host -v ${PWD}:/workdir yibolu96/lyra-chatglm-env:0.0.1 python3 /workdir/demo.py
```
## Uses
```python
from transformers import AutoTokenizer
from lyraChatGLM import GLM6B, FasterChatGLM
import os
current_workdir = os.path.dirname(__file__)
MAX_OUT_LEN = 100
chatglm6b_dir = os.path.join(current_workdir, "models")
tokenizer = AutoTokenizer.from_pretrained(chatglm6b_dir, trust_remote_code=True)
input_str = ["为什么我们需要对深度学习模型加速?", ]
inputs = tokenizer(input_str, return_tensors="pt", padding=True)
input_ids = inputs.input_ids.to('cuda:0')
plan_path = os.path.join(current_workdir, "models/glm6b-bs8.ftm")
# kernel for chat model.
kernel = GLM6B(plan_path=plan_path,
batch_size=1,
num_beams=1,
use_cache=True,
num_heads=32,
emb_size_per_heads=128,
decoder_layers=28,
vocab_size=150528,
max_seq_len=MAX_OUT_LEN)
chat = FasterChatGLM(model_dir=chatglm6b_dir, kernel=kernel).half().cuda()
# generate
sample_output = chat.generate(inputs=input_ids, max_length=MAX_OUT_LEN)
# de-tokenize model output to text
res = tokenizer.decode(sample_output[0], skip_special_tokens=True)
print(res)
```
## Demo output
### input
为什么我们需要对深度学习模型加速? 。
### output
为什么我们需要对深度学习模型加速? 深度学习模型的训练需要大量计算资源,特别是在训练模型时,需要大量的内存、GPU(图形处理器)和其他计算资源。因此,训练深度学习模型需要一定的时间,并且如果模型不能快速训练,则可能会导致训练进度缓慢或无法训练。
以下是一些原因我们需要对深度学习模型加速:
1. 训练深度神经网络需要大量的计算资源,特别是在训练深度神经网络时,需要更多的计算资源,因此需要更快的训练速度。
### TODO:
We plan to implement a FasterTransformer version to publish a much faster release. Stay tuned!
## Citation
``` bibtex
@Misc{lyraChatGLM2023,
author = {Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu},
title = {lyraChatGLM: Accelerating ChatGLM by 10x+},
howpublished = {\url{https://huggingface.co/TMElyralab/lyraChatGLM}},
year = {2023}
}
```
## Report bug
- start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraChatGLM/discussions
- report bug with a `[bug]` mark in the title. |