Quantized BitNet-B1-58-3B

This repository contains a quantized version of the 1bitLLM/bitnet_b1_58-3B model. While the original repository showcases impressive validation results, it emulates BitNet's Linear layers, resulting in memory usage similar to fp16 models. By leveraging the QuantLinear module from AutoGPTQ, this repository enables the output and execution of a 2-bit quantized model.

The quantized model offers significant advantages in terms of model size and memory consumption. With a model size of just 1GB , the quantized 3B model can perform inference with a context size of 2048 while consuming only 4.5GB of VRAM. Furthermore, since the weights used during execution are the same as the original repository, the perplexity (PPL) output remains unchanged.

Install

pip install -r requirements.txt

Quantization

The quantized model is already provided in this repository. However, if you wish to quantize the model yourself, you can load it from 1bitLLM/bitnet_b1_58-3B and save the quantized version (2-bit) to ./bitnet_b1_58-3B_quantized by running the following command:

python quantization.py

Evaluation

python eval_ppl.py --hf_path ./ --seqlen 2048 --max_dataset_size 1000

python eval_task.py --hf_path ./ \
    --batch_size 1 \
    --tasks \
    --output_path result.json \
    --num_fewshot 0 \
    --ctx_size 2048