Special Notes

We have released 7 lyraBaichuan models including lyraBaichuan-7B, lyraBaichuan-13B-Base, lyraBaichuan-13B-Chat, lyraBaichuan2-7B-Base, lyraBaichuan2-7B-Chat, lyraBaichuan2-13B-Base and lyraBaichuan2-13B-Chat.

These highly optimized Baichuan models are suitable for Ampere (A100/A10) as well as Volta (V100).

If you like our work and consider to join us, feel free to drop a line at benbinwu@tencent.com.

Model Card for lyraBaichuan

lyraBaichuan is currently the fastest Baichuan models (Baichuan-7B, Baichuan-13B, Baichuan2-7B, Baichuan2-13B) available. The inference speed of lyraBaichuan has achieved up to 4300+ tokens/s on A100, up to 2.4x acceleration upon the torch version.

Among its main features are:

  • device: Nvidia GPU with Amperer architecture or Volta architecture (A100 or higher, V100).
  • batch_size: compiled with dynamic batch size, maximum depends on device. 
  • MEMOPT mode: significantly optimized VRAM usage and increased speed

We use the Baichuan2-7B-Base and Baichuan2-13B-Base model for measurement, but this optimized inference is also applicable to other Baichuan models, including Baichuan-7B and Baichuan-13B.

Speed

  • Evaluated at tokens/s (#tokens of input and output divided by inference time cost)
  • test on A100 40G
  • MEMOPT mode

Baichuan2-7B-Base

Version Batch Size 1 Batch Size 8 Batch Size 16 Batch Size 32 Batch Size 64
Torch 2.0.1 41.2 323.2 640.0 1256.8 2231.0
lyraBaichuan 125.9 948.1 1749.3 2974.0 4370.1

Baichuan2-13B-Base

Version Batch Size 1 Batch Size 8 Batch Size 16 Batch Size 32 Batch Size 64
Torch 2.0.1 40.9 307.9 555.6 1010.4 1601.0
lyraBaichuan 80.0 568.2 1124.4 1942.6 2828.0

Docker Environment Recommendation

  • For Cuda 11.X: we recommend nvcr.io/nvidia/pytorch:22.12-py3
  • For Cuda 12.0: we recommend nvcr.io/nvidia/pytorch:23.02-py3
docker pull nvcr.io/nvidia/pytorch:23.02-py3
docker run --rm -it --gpus all -v ./:/lyraBaichuan nvcr.io/nvidia/pytorch:23.02-py3

pip install -r requirements.txt
python demo.py

Uses

from lyra_baichuan import lyraBaichuan7B, lyraBaichuan13B

model_path = "./models/Baichuan2-13B-lyra"
tokenizer_path = "./models/Baichuan2-13B-lyra"
inference_dtype = 'fp16'
prompt = "登鹳雀楼->王之涣\n夜雨寄北->"

memopt_mode = 1
max_output_length = 64
arch = "Ampere" # Ampere or Volta
cuda_version = 12 # cuda version, we currently support 11 and 12

# To use 7B model, initialize with lyraBaichuan7B
model = lyraBaichuan13B(model_path, 
                        tokenizer_path = tokenizer_path, 
                        dtype = inference_dtype,
                        memopt_mode = memopt_mode,
                        arch = arch,
                        cuda_version = cuda_version)

bs = 1
prompts = [prompt, ] * bs
output_texts = model.generate(
        prompts, output_length=max_output_length,
        top_k=30, top_p=0.85, temperature=1.0, repetition_penalty=1.0, do_sample=False)

print(output_texts)

Demo Outputs

Baichuan2-13B-Base

input

登鹳雀楼->王之涣

夜雨寄北->

output

李商隐

望洞庭->刘禹锡

黄鹤楼送孟浩然之广陵->李白

登岳阳楼->杜甫

秋词->刘禹锡

枫桥夜泊->张继

饮湖上初晴后雨->苏轼

浪淘沙->刘禹锡

TODO

  1. Support for int4
  2. Inference for longer context situations
  3. Streaming inference mode.

Citation

@Misc{lyraBaichuan2023,
  author =       {Haoxiong Su, Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu},
  title =        {lyraBaichuan: Accelerating Baichuan models to 4300+ tokens/s},
  howpublished = {\url{https://huggingface.co/TMElyralab/lyraBaichuan}},
  year =         {2023}
}

Report bug

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.