|
--- |
|
inference: false |
|
language: |
|
- en |
|
license: other |
|
model_type: llama |
|
pipeline_tag: text-generation |
|
tags: |
|
- facebook |
|
- meta |
|
- pytorch |
|
- llama |
|
- llama-2 |
|
- gptq |
|
--- |
|
|
|
# Meta's Llama 2 13B GPTQ |
|
|
|
These files are GPTQ model files for [Meta's Llama 2 13B](https://huggingface.co/meta-llama/Llama-2-13b-hf). |
|
|
|
Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them. |
|
|
|
|
|
## Repositories available |
|
|
|
* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-13B-GPTQ) |
|
* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/Llama-2-13B-GGML) |
|
* [Original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/TheBloke/Llama-2-13B-fp16) |
|
|
|
## Prompt template: None |
|
|
|
``` |
|
### System:\n{system}\n\n### User:\n{instruction}\n\n### Response: |
|
``` |
|
|
|
## Provided files |
|
|
|
Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements. |
|
|
|
Each separate quant is in a different branch. See below for instructions on fetching from different branches. |
|
|
|
| Branch | Bits | Group Size | Act Order (desc_act) | File Size | ExLlama Compatible? | Made With | Description | |
|
| ------ | ---- | ---------- | -------------------- | --------- | ------------------- | --------- | ----------- | |
|
| main | 4 | 128 | False | 7.26 GB | True | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. | |
|
| gptq-4bit-32g-actorder_True | 4 | 32 | True | 8.00 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. | |
|
| gptq-4bit-64g-actorder_True | 4 | 64 | True | 7.51 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. | |
|
| gptq-4bit-128g-actorder_True | 4 | 128 | True | 7.26 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. | |
|
| gptq-8bit-128g-actorder_True | 8 | 128 | True | 13.65 GB | False | AutoGPTQ | 8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. Poor AutoGPTQ CUDA speed. | |
|
| gptq-8bit-64g-actorder_True | 8 | 64 | True | 13.95 GB | False | AutoGPTQ | 8-bit, with group size 64g and Act Order for maximum inference quality. Poor AutoGPTQ CUDA speed. | |
|
| gptq-8bit-128g-actorder_False | 8 | 128 | False | 13.65 GB | False | AutoGPTQ | 8-bit, with group size 128g for higher inference quality and without Act Order to improve AutoGPTQ speed. | |
|
| gptq-8bit--1g-actorder_True | 8 | None | True | 13.36 GB | False | AutoGPTQ | 8-bit, with Act Order. No group size, to lower VRAM requirements and to improve AutoGPTQ speed. | |
|
|
|
## How to download from branches |
|
|
|
- In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/Llama-2-13B-GPTQ:gptq-4bit-32g-actorder_True` |
|
- With Git, you can clone a branch with: |
|
``` |
|
git clone --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/Llama-2-13B-GPTQ` |
|
``` |
|
- In Python Transformers code, the branch is the `revision` parameter; see below. |
|
|
|
## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui). |
|
|
|
Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui). |
|
|
|
It is strongly recommended to use the text-generation-webui one-click-installers unless you know how to make a manual install. |
|
|
|
1. Click the **Model tab**. |
|
2. Under **Download custom model or LoRA**, enter `TheBloke/Llama-2-13B-GPTQ`. |
|
- To download from a specific branch, enter for example `TheBloke/Llama-2-13B-GPTQ:gptq-4bit-32g-actorder_True` |
|
- see Provided Files above for the list of branches for each option. |
|
3. Click **Download**. |
|
4. The model will start downloading. Once it's finished it will say "Done" |
|
5. In the top left, click the refresh icon next to **Model**. |
|
6. In the **Model** dropdown, choose the model you just downloaded: `Llama-2-13B-GPTQ` |
|
7. The model will automatically load, and is now ready for use! |
|
8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right. |
|
* Note that you do not need to set GPTQ parameters any more. These are set automatically from the file `quantize_config.json`. |
|
9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started! |
|
|
|
## How to use this GPTQ model from Python code |
|
|
|
First make sure you have [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) installed: |
|
|
|
`GITHUB_ACTIONS=true pip install auto-gptq` |
|
|
|
Then try the following example code: |
|
|
|
```python |
|
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig |
|
import json |
|
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig, get_gptq_peft_model |
|
|
|
|
|
MODEL_PATH_GPTQ= "Llama-2-13B-GPTQ" |
|
ADAPTER_DIR= "Llama-2-13B-GPTQ-Orca" |
|
|
|
DEV = "cuda:0" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH_GPTQ, use_fast=True) |
|
model = AutoGPTQForCausalLM.from_quantized( |
|
MODEL_PATH_GPTQ, |
|
use_safetensors=True, |
|
trust_remote_code=False, |
|
use_triton=True, |
|
device="cuda:0", |
|
warmup_triton=False, |
|
trainable=True, |
|
inject_fused_attention=True, |
|
inject_fused_mlp=False, |
|
) |
|
model = get_gptq_peft_model( |
|
model, |
|
model_id=ADAPTER_DIR, |
|
train_mode=False |
|
) |
|
model.eval() |
|
``` |
|
|
|
## Compatibility |
|
|
|
The files provided will work with AutoGPTQ (CUDA and Triton modes), GPTQ-for-LLaMa (only CUDA has been tested), and Occ4m's GPTQ-for-LLaMa fork. |
|
|
|
|