|
--- |
|
license: other |
|
inference: true |
|
tags: |
|
- vicuna |
|
--- |
|
 |
|
|
|
Q. Why quantized in 8bit instead of 4bit? |
|
A. For evaluation purpose. In theory, a 8bit quantized model should provide slightly better perplexity (maybe not noticeable - To Be Evaluated...) over a 4bit quatized version. If your available GPU VRAM is over 15GB you may want to try this out. |
|
Note that quatization in 8bit does not mean loading the model in 8bit precision. Loading your model in 8bit precision (--load-in-8bit) comes with noticeable quality (perplexity) degradation. |
|
|
|
Refs: |
|
- https://github.com/ggerganov/llama.cpp/pull/951 |
|
- https://news.ycombinator.com/item?id=35148542 |
|
- https://arxiv.org/abs/2105.03536 |
|
- https://github.com/IST-DASLab/gptq |
|
- https://arxiv.org/abs/2212.09720 |
|
|
|
<br> |
|
|
|
**This model is a 8bit quantization of Vicuna 13Bv1.1.** |
|
- 13B parameters |
|
- Group size: 128 |
|
- wbits: 8 |
|
- true-sequential: yes |
|
- act-order: yes |
|
- 8-bit quantized |
|
- Conversion process: LLaMa 13B -> LLaMa 13B HF -> Vicuna13B-v1.1 HF -> Vicuna13B-v1.1-8bit-128g |
|
|
|
<br> |
|
<br> |
|
|
|
# Basic installation procedure |
|
|
|
- It was a nightmare, I will only detail briefly what you'll need. WSL was quite painful to sort out. |
|
- I will not provide installation support, sorry. |
|
- You can certainly use llama.cpp and other loaders that support 8bit quantization, I just chose oobabooga/text-generation-webui. |
|
- You will likely face many bugs until text-generation-webui loads, ranging between missing PATH or env variables to having to manually pip uninstall/install packages. |
|
- The notes below will likely become outdated once both text-generation-webui and GPTQ-for-LLaMa receive the appropriate bug fixes. |
|
- If this model produces very slow answers (1 token/s), it means you are not using Cuda for bitsandbytes or that your hardware needs an upgrade. |
|
- If this model produces answers with weird characters, it means you are not using the correct version of qwopqwop200/GPTQ-for-LLaMa as mentioned below. |
|
- If this model produces answers that are out of topic or if it talks to itself, it means you are not using the correct checkout 508de42 of qwopqwop200/GPTQ-for-LLaMa as mentioned below. |
|
|
|
Cuda (Slow tokens/s): |
|
``` |
|
git clone https://github.com/oobabooga/text-generation-webui |
|
cd text-generation-webui |
|
pip install -r requirements.txt |
|
|
|
mkdir repositories |
|
cd repositories |
|
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b cuda # Make sure you obtain the qwopqwop200 version, not the oobabooga one! (because "act-order: yes") |
|
cd GPTQ-for-LLaMa |
|
pip install -r requirements.txt |
|
python setup_cuda.py install |
|
``` |
|
|
|
Triton (Fast tokens/s) - Works on Windows with WSL (what I've used) or Linux: |
|
``` |
|
git clone https://github.com/oobabooga/text-generation-webui |
|
cd text-generation-webui |
|
git fetch origin pull/1229/head:triton # This is the version that supports Triton - https://github.com/oobabooga/text-generation-webui/pull/1229 |
|
git checkout triton |
|
pip install -r requirements.txt |
|
|
|
mkdir repositories |
|
cd repositories |
|
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git # -b cuda |
|
cd GPTQ-for-LLaMa |
|
git checkout 508de42 # Before qwopqwop200 broke everything... - https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/183 |
|
pip install -r requirements.txt |
|
``` |
|
|
|
<br> |
|
<br> |
|
|
|
# Testbench detail and results |
|
|
|
- Latest version of oobabooga + https://github.com/oobabooga/text-generation-webui/pull/1229 |
|
|
|
- NVIDIA GTX 3090 |
|
- 32BG DDR4 |
|
- i9-7980XE OC @4.6Ghz |
|
|
|
- 11 tokens/s on average with Triton |
|
- Preliminary observations: better results than --load-in-8bits (To Be Confirmed) |
|
- Tested and working in both chat mode and text generation mode |
|
|
|
 |
|
|
|
 |
|
|
|
 |
|
|
|
 |
|
|
|
 |
|
|
|
 |
|
|
|
 |
|
|
|
<br> |
|
<br> |
|
|
|
# License |
|
|
|
Research only - non-commercial research purposes - other restrictions apply. See inherited LICENSE file from LLaMa. |
|
|
|
LLaMA-13B converted to work with Transformers/HuggingFace is under a special license, please see the LICENSE file for details. |
|
|
|
https://www.reddit.com/r/LocalLLaMA/comments/12kl68j/comment/jg31ufe/ |
|
|
|
<br> |
|
<br> |
|
|
|
# Vicuna Model Card |
|
|
|
## Model details |
|
|
|
**Model type:** |
|
Vicuna is an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. |
|
It is an auto-regressive language model, based on the transformer architecture. |
|
|
|
**Model date:** |
|
Vicuna was trained between March 2023 and April 2023. |
|
|
|
**Organizations developing the model:** |
|
The Vicuna team with members from UC Berkeley, CMU, Stanford, and UC San Diego. |
|
|
|
**Paper or resources for more information:** |
|
https://vicuna.lmsys.org/ |
|
|
|
**License:** |
|
Apache License 2.0 |
|
|
|
**Where to send questions or comments about the model:** |
|
https://github.com/lm-sys/FastChat/issues |
|
|
|
## Intended use |
|
**Primary intended uses:** |
|
The primary use of Vicuna is research on large language models and chatbots. |
|
|
|
**Primary intended users:** |
|
The primary intended users of the model are researchers and hobbyists in natural language processing, machine learning, and artificial intelligence. |
|
|
|
## Training dataset |
|
70K conversations collected from ShareGPT.com. |
|
|
|
## Evaluation dataset |
|
A preliminary evaluation of the model quality is conducted by creating a set of 80 diverse questions and utilizing GPT-4 to judge the model outputs. See https://vicuna.lmsys.org/ for more details. |
|
|
|
## Major updates of weights v1.1 |
|
- Refactor the tokenization and separator. In Vicuna v1.1, the separator has been changed from `"###"` to the EOS token `"</s>"`. This change makes it easier to determine the generation stop criteria and enables better compatibility with other libraries. |
|
- Fix the supervised fine-tuning loss computation for better model quality. |