catid
/

cat-llama-3-70b-hqq

Text Generation

Inference Endpoints

Model card Files Files and versions Community

cat-llama-3-70b-hqq / README.md

catid's picture

Update README.md

a4b7e3a verified 8 months ago

|

2.13 kB

	AI Model Name: Llama 3 70B "Built with Meta Llama 3" https://llama.meta.com/llama3/license/

	How to quantize 70B model so it will fit on 2x4090 GPUs:

	I tried EXL2, AutoAWQ, and SqueezeLLM and they all failed for different reasons (issues opened).

	HQQ worked:

	I rented a 4x GPU 1TB RAM ($19/hr) instance on runpod with 1024GB container and 1024GB workspace disk space.
	I think you only need 2x GPU with 80GB VRAM and 512GB+ system RAM so probably overpaid.

	Note you need to fill in the form to get access to the 70B Meta weights.

	You can copy/paste this on the console and it will just set up everything automatically:

	```bash
	apt update
	apt install git-lfs vim -y

	mkdir -p ~/miniconda3
	wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
	bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
	~/miniconda3/bin/conda init bash
	source ~/.bashrc

	conda create -n hqq python=3.10 -y && conda activate hqq

	git lfs install
	git clone https://github.com/mobiusml/hqq.git
	cd hqq

	pip install torch
	pip install .

	pip install huggingface_hub[hf_transfer]
	export HF_HUB_ENABLE_HF_TRANSFER=1

	huggingface-cli login
	```

	Create `quantize.py` file by copy/pasting this into console:

	```
	echo "
	import torch

	model_id = 'meta-llama/Meta-Llama-3-70B-Instruct'
	save_dir = 'cat-llama-3-70b-hqq'
	compute_dtype = torch.bfloat16

	from hqq.core.quantize import *
	quant_config = BaseQuantizeConfig(nbits=4, group_size=64, offload_meta=True)
	zero_scale_group_size = 128
	quant_config['scale_quant_params']['group_size'] = zero_scale_group_size
	quant_config['zero_quant_params']['group_size'] = zero_scale_group_size

	from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer
	model = HQQModelForCausalLM.from_pretrained(model_id)

	from hqq.models.hf.base import AutoHQQHFModel
	AutoHQQHFModel.quantize_model(model, quant_config=quant_config,
	compute_dtype=compute_dtype)

	AutoHQQHFModel.save_quantized(model, save_dir)
	model = AutoHQQHFModel.from_quantized(save_dir)

	model.eval()

	" > quantize.py
	```

	Run script:

	```
	python quantize.py
	```