Update README.md

dc4ff71 over 1 year ago

4.58 kB

	---
	inference: false
	license: llama2
	model_creator: WizardLM
	model_link: https://huggingface.co/WizardLM/WizardLM-70B-V1.0
	model_name: WizardLM 70B V1.0
	model_type: llama
	quantized_by: Thireus
	---

	# WizardLM 70B V1.0 – EXL2
	- Model creator: [WizardLM](https://huggingface.co/WizardLM)
	- Original model: [WizardLM 70B V1.0](https://huggingface.co/WizardLM/WizardLM-70B-V1.0)
	- Model used for quantization: [WizardLM 70B V1.0-HF](https://huggingface.co/simsim314/WizardLM-70B-V1.0-HF) – float16 of [WizardLM 70B V1.0](https://huggingface.co/WizardLM/WizardLM-70B-V1.0)

	## Models available in this repository

	\| Link \| BITS (-b) \| HEAD BITS (-hb) \| MEASUREMENT LENGTH (-ml) \| LENGTH (-l) \| CAL DATASET (-c) \| Size \| ExLlama \| Max Context Length \|
	\| ------ \| --------- \| --------------- \| ------------------------ \| ----------- \| ---------------- \| ---- \| ------- \| ------------------ \|
	\| [here](https://huggingface.co/Thireus/WizardLM-70B-V1.0-HF-4.0bpw-h6-exl2/) \| 4.0 \| 6 \| 2048 \| 2048 \| [0000.parquet](https://huggingface.co/datasets/wikitext/tree/refs%2Fconvert%2Fparquet/wikitext-2-raw-v1/train)* \| 35GB \| [v2](https://github.com/turboderp/exllamav2) \| 4096 \|
	\| [here](https://huggingface.co/Thireus/WizardLM-70B-V1.0-HF-5.0bpw-h6-exl2/) \| 5.0 \| 6 \| 2048 \| 2048 \| [0000.parquet](https://huggingface.co/datasets/wikitext/tree/refs%2Fconvert%2Fparquet/wikitext-2-raw-v1/train)* \| 44GB \| [v2](https://github.com/turboderp/exllamav2) \| 4096 \|
	\| _coming soon..._ \| 6.0 \| 6 \| 2048 \| 2048 \| [0000.parquet](https://huggingface.co/datasets/wikitext/tree/refs%2Fconvert%2Fparquet/wikitext-2-raw-v1/train)* \| ...GB \| [v2](https://github.com/turboderp/exllamav2) \| 4096 \|

	\* wikitext-2-raw-v1

	## Description:

	_This repository contains EXL2 model files for [WizardLM's WizardLM 70B V1.0](https://huggingface.co/WizardLM/WizardLM-70B-V1.0)._

	EXL2 is a new format used by ExLlamaV2 – https://github.com/turboderp/exllamav2. EXL2 is based on the same optimization method as GPTQ. The format allows for mixing quantization
	levels within a model to achieve any average bitrate between 2 and 8 bits per weight.

	## Prompt template (official):

	```
	A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {prompt} ASSISTANT:
	```

	## Prompt template (suggested):

	```
	A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
	USER:
	{prompt}
	ASSISTANT:


	```

	## Quantization process:

	\| Original Model \| → \| (optional but recommended) Float16 Model* \| → \| Safetensor Model** \| → \| EXL2 Model \|
	\| -------------- \| --- \| ------------- \| --- \| ---------------- \| --- \| ---------- \|
	\| [WizardLM 70B V1.0](https://huggingface.co/WizardLM/WizardLM-70B-V1.0) \| → \| [WizardLM 70B V1.0-HF](https://huggingface.co/simsim314/WizardLM-70B-V1.0-HF)* \| → \| Safetensor** \| → \| EXL2 \|

	Example to convert WizardLM-70B-V1.0-HF to EXL2 4.0 bpw with 6-bit head:

	```
	mkdir -p ~/EXL2/WizardLM-70B-V1.0-HF_4bit # Create the output directory
	python convert.py -i ~/float16_safetensored/WizardLM-70B-V1.0-HF -o ~/EXL2/WizardLM-70B-V1.0-HF_4bit -c ~/EXL2/0000.parquet -b 4.0 -hb 6
	```

	\* Use the following script to convert your local pytorch_model bin files to float16 (you can also choose bfloat16) + safetensors all in one go:

	- https://github.com/oobabooga/text-generation-webui/blob/main/convert-to-safetensors.py
	(best for sharding and float16/FP16 or bfloat16/BF16 conversion)

	Example to convert [WizardLM 70B V1.0](https://huggingface.co/WizardLM/WizardLM-70B-V1.0) directly to float16 safetensors in 10GB shards:

	```
	python convert-to-safetensors.py ~/original/WizardLM-70B-V1.0 --output ~/float16_safetensored/WizardLM-70B-V1.0 --max-shard-size 10GB
	```

	Use `--bf16` if you'd like to try bfloat16 instead, but note that there are concerns about quantization quality – https://github.com/turboderp/exllamav2/issues/30#issuecomment-1719009289

	\\ Use any one of the following scripts to convert your local pytorch_model bin files to safetensors:

	- https://github.com/turboderp/exllamav2/blob/master/util/convert_safetensors.py (official ExLlamaV2)
	- https://huggingface.co/Panchovix/airoboros-l2-70b-gpt4-1.4.1-safetensors/blob/main/bin2safetensors/convert.py (recommended)
	- https://gist.github.com/epicfilemcnulty/1f55fd96b08f8d4d6693293e37b4c55e#file-2safetensors-py

	## Further reading:

	- https://mlabonne.github.io/blog/posts/Introduction_to_Weight_Quantization.html