kousw
/

bitnet_b1_58-3B_quantized

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

bitnet_b1_58-3B_quantized / README.md

kousw's picture

Upload 21 files

29964ce verified 6 months ago

|

history blame contribute delete

No virus

1.5 kB

	---
	license: mit
	---

	# Quantized BitNet-B1-58-3B

	This repository contains a quantized version of the [1bitLLM/bitnet_b1_58-3B](https://huggingface.co/1bitLLM/bitnet_b1_58-3B) model.
	While the original repository showcases impressive validation results, it emulates BitNet's Linear layers, resulting in memory usage similar to fp16 models. By leveraging the QuantLinear module from [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ), this repository enables the output and execution of a 2-bit quantized model.

	The quantized model offers significant advantages in terms of model size and memory consumption. With a model size of just 1GB , the quantized 3B model can perform inference with a context size of 2048 while consuming only 4.5GB of VRAM. Furthermore, since the weights used during execution are the same as the original repository, the perplexity (PPL) output remains unchanged.


	## Install

	```
	pip install -r requirements.txt
	```

	## Quantization

	The quantized model is already provided in this repository. However, if you wish to quantize the model yourself, you can load it from 1bitLLM/bitnet_b1_58-3B and save the quantized version (2-bit) to ./bitnet_b1_58-3B_quantized by running the following command:

	```
	python quantization.py
	```


	## Evaluation

	```
	python eval_ppl.py --hf_path ./ --seqlen 2048 --max_dataset_size 1000
	```
	```
	python eval_task.py --hf_path ./ \
	--batch_size 1 \
	--tasks \
	--output_path result.json \
	--num_fewshot 0 \
	--ctx_size 2048
	```