tridungduong16's picture
Update README.md
a67a7cb
|
raw
history blame
5.88 kB
---
inference: false
language:
- en
license: other
model_type: llama
pipeline_tag: text-generation
tags:
- facebook
- meta
- pytorch
- llama
- llama-2
- gptq
---
# Meta's Llama 2 13B GPTQ
These files are GPTQ model files for [Meta's Llama 2 13B](https://huggingface.co/meta-llama/Llama-2-13b-hf).
Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
## Repositories available
* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-13B-GPTQ)
* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference](https://huggingface.co/TheBloke/Llama-2-13B-GGML)
* [Original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/TheBloke/Llama-2-13B-fp16)
## Prompt template: None
```
### System:\n{system}\n\n### User:\n{instruction}\n\n### Response:
```
## Provided files
Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
Each separate quant is in a different branch. See below for instructions on fetching from different branches.
| Branch | Bits | Group Size | Act Order (desc_act) | File Size | ExLlama Compatible? | Made With | Description |
| ------ | ---- | ---------- | -------------------- | --------- | ------------------- | --------- | ----------- |
| main | 4 | 128 | False | 7.26 GB | True | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
| gptq-4bit-32g-actorder_True | 4 | 32 | True | 8.00 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
| gptq-4bit-64g-actorder_True | 4 | 64 | True | 7.51 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
| gptq-4bit-128g-actorder_True | 4 | 128 | True | 7.26 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
| gptq-8bit-128g-actorder_True | 8 | 128 | True | 13.65 GB | False | AutoGPTQ | 8-bit, with group size 128g for higher inference quality and with Act Order for even higher accuracy. Poor AutoGPTQ CUDA speed. |
| gptq-8bit-64g-actorder_True | 8 | 64 | True | 13.95 GB | False | AutoGPTQ | 8-bit, with group size 64g and Act Order for maximum inference quality. Poor AutoGPTQ CUDA speed. |
| gptq-8bit-128g-actorder_False | 8 | 128 | False | 13.65 GB | False | AutoGPTQ | 8-bit, with group size 128g for higher inference quality and without Act Order to improve AutoGPTQ speed. |
| gptq-8bit--1g-actorder_True | 8 | None | True | 13.36 GB | False | AutoGPTQ | 8-bit, with Act Order. No group size, to lower VRAM requirements and to improve AutoGPTQ speed. |
## How to download from branches
- In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/Llama-2-13B-GPTQ:gptq-4bit-32g-actorder_True`
- With Git, you can clone a branch with:
```
git clone --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/Llama-2-13B-GPTQ`
```
- In Python Transformers code, the branch is the `revision` parameter; see below.
## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
It is strongly recommended to use the text-generation-webui one-click-installers unless you know how to make a manual install.
1. Click the **Model tab**.
2. Under **Download custom model or LoRA**, enter `TheBloke/Llama-2-13B-GPTQ`.
- To download from a specific branch, enter for example `TheBloke/Llama-2-13B-GPTQ:gptq-4bit-32g-actorder_True`
- see Provided Files above for the list of branches for each option.
3. Click **Download**.
4. The model will start downloading. Once it's finished it will say "Done"
5. In the top left, click the refresh icon next to **Model**.
6. In the **Model** dropdown, choose the model you just downloaded: `Llama-2-13B-GPTQ`
7. The model will automatically load, and is now ready for use!
8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
* Note that you do not need to set GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
## How to use this GPTQ model from Python code
First make sure you have [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) installed:
`GITHUB_ACTIONS=true pip install auto-gptq`
Then try the following example code:
```python
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import json
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig, get_gptq_peft_model
MODEL_PATH_GPTQ= "Llama-2-13B-GPTQ"
ADAPTER_DIR= "Llama-2-13B-GPTQ-Orca"
DEV = "cuda:0"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH_GPTQ, use_fast=True)
model = AutoGPTQForCausalLM.from_quantized(
MODEL_PATH_GPTQ,
use_safetensors=True,
trust_remote_code=False,
use_triton=True,
device="cuda:0",
warmup_triton=False,
trainable=True,
inject_fused_attention=True,
inject_fused_mlp=False,
)
model = get_gptq_peft_model(
model,
model_id=ADAPTER_DIR,
train_mode=False
)
model.eval()
```
## Compatibility
The files provided will work with AutoGPTQ (CUDA and Triton modes), GPTQ-for-LLaMa (only CUDA has been tested), and Occ4m's GPTQ-for-LLaMa fork.