|
--- |
|
license: mit |
|
--- |
|
|
|
# Quantized BitNet-B1-58-3B |
|
|
|
This repository contains a quantized version of the [1bitLLM/bitnet_b1_58-3B](https://huggingface.co/1bitLLM/bitnet_b1_58-3B) model. |
|
While the original repository showcases impressive validation results, it emulates BitNet's Linear layers, resulting in memory usage similar to fp16 models. By leveraging the QuantLinear module from [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ), this repository enables the output and execution of a 2-bit quantized model. |
|
|
|
The quantized model offers significant advantages in terms of model size and memory consumption. With a model size of just 1GB , the quantized 3B model can perform inference with a context size of 2048 while consuming only 4.5GB of VRAM. Furthermore, since the weights used during execution are the same as the original repository, the perplexity (PPL) output remains unchanged. |
|
|
|
|
|
## Install |
|
|
|
``` |
|
pip install -r requirements.txt |
|
``` |
|
|
|
## Quantization |
|
|
|
The quantized model is already provided in this repository. However, if you wish to quantize the model yourself, you can load it from 1bitLLM/bitnet_b1_58-3B and save the quantized version (2-bit) to ./bitnet_b1_58-3B_quantized by running the following command: |
|
|
|
``` |
|
python quantization.py |
|
``` |
|
|
|
|
|
## Evaluation |
|
|
|
``` |
|
python eval_ppl.py --hf_path ./ --seqlen 2048 --max_dataset_size 1000 |
|
``` |
|
``` |
|
python eval_task.py --hf_path ./ \ |
|
--batch_size 1 \ |
|
--tasks \ |
|
--output_path result.json \ |
|
--num_fewshot 0 \ |
|
--ctx_size 2048 |
|
``` |
|
|