Model Card for lyraLLMs

Introduction

We have released lyraLLMs, a highly optimized and easy-to-use inference engine for LLMs.

lyraLLMs is suitable for NVIDIA GPUs:

  • Volta (V100)
  • Turing (T4)
  • Ampere (A100/A10)
  • Ada Lovelace (RTX 4090, etc.)

lyraLLMs supports many popular HuggingFace models as follows:

lyraLLMs is fast, memory-efficient & easy to use with:

  • State-of-the-art throughtput (up to 7K tokens/s for LLaMA 13B)
  • Efficient memory usage of attention with FlashAttention2
  • Quantization: MEMOPT mode (W8A16, W4A16), KVCache Int8
  • Easy-to-use Python API to serve LLMs
  • Streaming outputs

If you like our work and consider to join us, feel free to drop a line at benbinwu@tencent.com

Speed

Settings

  • Evaluated at tokens/s (input + output)
  • Test on A100 40G, CUDA 12.0
  • Enable the use of MEMOPT mode and KVCache Int8

Throughputs

XVERSE-13B-Chat

Input

北京的景点:故宫、天坛、万里长城等。\n深圳的景点:

Version Batch Size 1 Batch Size 64 Batch Size 128 Batch Size 256 Batch Size 512
Torch 2.1.0 52.9 2308.1 OOM
lyraXVERSE 200.4 4624.8 5759.7 6075.6 5733

Baichuan2-7B-Base

Input

北京的景点:登鹳雀楼->王之涣\n夜雨寄北->

Version Batch Size 1 Batch Size 8 Batch Size 16 Batch Size 32 Batch Size 64
Torch 2.0.1 41.2 323.2 640.0 1256.8 2231.0
lyraBaichuan 125.9 948.1 1749.3 2974.0 4370.1

Baichuan2-13B-Base

Input

北京的景点:登鹳雀楼->王之涣\n夜雨寄北->

Version Batch Size 1 Batch Size 8 Batch Size 16 Batch Size 32 Batch Size 64
Torch 2.0.1 40.9 307.9 555.6 1010.4 1601.0
lyraBaichuan 80.0 568.2 1124.4 1942.6 2828.0

Yi-6B

Input

# write the quick sort algorithm

Version Batch Size 1 Batch Size 8 Batch Size 16 Batch Size 32 Batch Size 64
Torch 2.1.0 31.4 247.5 490.4 987.2 1796.3
lyraLLaMA 93.8 735.6 2339.8 3020.9 4630.8

Yi-34B

Due to limitation of VRAM, we cannot profile the throughputs of Yi-34B on A100 40G using Torch.

Input

Let me tell you an interesting story about cat Tom and mouse Jerry,

Version Batch Size 1 Batch Size 8 Batch Size 16 Batch Size 32 Batch Size 64
lyraLLaMA 52.5 399.4 753.0 1138.2 1926.2

Usage

Environment (Docker recommended)

  • For Cuda 11.X: we recommend nvcr.io/nvidia/pytorch:22.12-py3
  • For Cuda 12.0: we recommend nvcr.io/nvidia/pytorch:23.02-py3
docker pull nvcr.io/nvidia/pytorch:23.02-py3
docker run --rm -it --gpus all -v ./:/lyraLLMs nvcr.io/nvidia/pytorch:23.02-py3

pip install -r requirements.txt

Convert Models

We have released multiple optimized models converted from original HuggingFace ones:

  • ChatGLM-6B
  • XVERSE-13B-Chat
  • LLaMA-Ziya-13B
  • Baichuan-7B, Baichuan-13B-Base, Baichuan-13B-Chat, Baichuan2-7B-Base, Baichuan2-7B-Chat, Baichuan2-13B-Base and lyraBaichuan2-13B-Chat
  • Yi-6B, Yi-34B

Feel free to contact us if you would like to convert a finetuned version of LLMs.

Inference

Refer to README.md for inference of converted models with lyraLLMs.

Python Demo

from lyra_llama import lyraLlama

model_path = 'XXX' # 包含转换后的模型参数,配置,tokenizer文件目录
data_type = 'fp16'
memopt_mode = 0 # 如需使用MEMOPT模式推理, memopt_mode=1

model = lyraLlama(model_path, data_type, memopt_mode)

prompts = '列出3个不同的机器学习算法,并说明它们的适用范围.'
prompts = [prompts,] * 64

output_texts = model.generate(prompts, output_length=150, do_sample=False, top_k=30, top_p=0.85, temperature=1.0, repetition_penalty=1.0)
print(output_texts)

Citation

@Misc{lyraLLMs2024,
  author =       {Kangjian Wu, Zhengtao Wang, Yibo Lu, Haoxiong Su, Bin Wu},
  title =        {lyraLLMs: A highly optimized and easy-to-use inference engine for LLMs},
  howpublished = {\url{https://huggingface.co/TMElyralab/lyraLLMs}},
  year =         {2024}
}

Report bug

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .