bf16_vs_fp8 / docs /xFasterTransformer.md
zjasper666's picture
Upload folder using huggingface_hub
8655a4b verified
|
raw
history blame
3.54 kB

xFasterTransformer Inference Framework

Integrated xFasterTransformer customized framework into Fastchat to provide Faster inference speed on Intel CPU.

Install xFasterTransformer

Setup environment (please refer to this link for more details):

pip install xfastertransformer

Prepare models

Prepare Model (please refer to this link for more details):

python ./tools/chatglm_convert.py -i ${HF_DATASET_DIR} -o  ${OUTPUT_DIR}

Parameters of xFasterTransformer

--enable-xft to enable xfastertransformer in Fastchat --xft-max-seq-len to set the max token length the model can process. max token length include input token length. --xft-dtype to set datatype used in xFasterTransformer for computation. xFasterTransformer can support fp32, fp16, int8, bf16 and hybrid data types like : bf16_fp16, bf16_int8. For datatype details please refer to this link

Chat with the CLI:

#run inference on all CPUs and using float16
python3 -m fastchat.serve.cli \
    --model-path /path/to/models \
    --enable-xft \
    --xft-dtype fp16

or with numactl on multi-socket server for better performance

#run inference on numanode 0 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16)
numactl -N 0  --localalloc \
python3 -m fastchat.serve.cli \
    --model-path /path/to/models/chatglm2_6b_cpu/ \
    --enable-xft \
    --xft-dtype bf16_fp16

or using MPI to run inference on 2 sockets for better performance

#run inference on numanode 0 and 1 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16)
OMP_NUM_THREADS=$CORE_NUM_PER_SOCKET LD_PRELOAD=libiomp5.so mpirun \
-n 1 numactl -N 0  --localalloc \
python -m fastchat.serve.cli \ 
    --model-path /path/to/models/chatglm2_6b_cpu/ \
    --enable-xft \
    --xft-dtype bf16_fp16 : \
-n 1 numactl -N 1  --localalloc \
python -m fastchat.serve.cli \
    --model-path /path/to/models/chatglm2_6b_cpu/ \
    --enable-xft \
    --xft-dtype bf16_fp16

Start model worker:

# Load model with default configuration (max sequence length 4096, no GPU split setting).
python3 -m fastchat.serve.model_worker \
    --model-path /path/to/models \
    --enable-xft \
    --xft-dtype bf16_fp16 

or with numactl on multi-socket server for better performance

#run inference on numanode 0 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16)
numactl -N 0  --localalloc python3 -m fastchat.serve.model_worker \
    --model-path /path/to/models \
    --enable-xft \
    --xft-dtype bf16_fp16 

or using MPI to run inference on 2 sockets for better performance

#run inference on numanode 0 and 1 and with data type bf16_fp16 (first token uses bfloat16, and rest tokens use float16)
OMP_NUM_THREADS=$CORE_NUM_PER_SOCKET LD_PRELOAD=libiomp5.so mpirun \
-n 1 numactl -N 0  --localalloc  python -m fastchat.serve.model_worker \
    --model-path /path/to/models \
    --enable-xft \
    --xft-dtype bf16_fp16 : \
-n 1 numactl -N 1  --localalloc  python -m fastchat.serve.model_worker \
    --model-path /path/to/models \
    --enable-xft \
    --xft-dtype bf16_fp16 

For more details, please refer to this link