Efficient Inference Kernel Support for 1.58bit.

#8
by LeiWang1999 - opened

Checkout this repo guys ! πŸ™‚

https://github.com/microsoft/BitBLAS/tree/main/integration/BitNet

BitBLAS Results

Performance

Note: To reproduce the results of BitBLAS, Please checkout the benchmark_inference_latency.py. To reproduce the results of the original model, Please checkout the 1bitLLM/bitnet_b1_58-3B repo.

Model Device batchsize in_seq model bitnet-1.58b-3b-huggingface bitnet-1.58b-3b-bitblas
bitnet_b1_58-3B A100 1 1 LLAMA-3B 177.6729107 64.17962909
bitnet_b1_58-3B A100 128 1 LLAMA-3B 188.6145592 63.48158518
bitnet_b1_58-3B A100 1 2048 LLAMA-3B 348.7066031 202.6877999

On-the-Fly GPU Memory Footprint

We measured the GPU memory footprint through the nvidia-smi command. Please checkout nvidia_measure_memory.sh to get the real-time GPU memory usage. And then start a benchmark_model_10k_loops.py workload to measure the overall GPU memory usage.

Model Device batchsize in_seq bitnet-1.58b-3b-huggingface bitnet-1.58b-3b-bitblas
bitnet_b1_58-3B A100 1 1 7595 MB 1729 MB
bitnet_b1_58-3B A100 128 1 7677 MB 1789 MB
bitnet_b1_58-3B A100 1 2048 8731 MB 3163 MB

Just simplely replace the inference kernel of The BitnetLinear

Sign up or log in to comment