Readme Init
Browse files
README.md
CHANGED
@@ -1,3 +1,132 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
inference: true
|
4 |
+
tags:
|
5 |
+
- vicuna
|
6 |
---
|
7 |
+

|
8 |
+
|
9 |
+
**This model is a 8bit quantization of Vicuna 13B.**
|
10 |
+
- 13B parameters
|
11 |
+
- Group size: 128
|
12 |
+
- wbits: 8
|
13 |
+
- true-sequential: yes
|
14 |
+
- act-order: yes
|
15 |
+
- 8-bit quantized - Read more about this here: https://github.com/ggerganov/llama.cpp/pull/951
|
16 |
+
- Conversion process: Llama 13B -> Llama 13B HF -> Vicuna13B-v1.1 HF -> Vicuna13B-v1.1-8bit-128g
|
17 |
+
|
18 |
+
<br>
|
19 |
+
<br>
|
20 |
+
|
21 |
+
# Basic installation procedure
|
22 |
+
|
23 |
+
- It was a nightmare, I will only detail briefly what you'll need. WSL was quite painful to sort out.
|
24 |
+
- I will not provide installation support, sorry.
|
25 |
+
- You can certainly use llama.cpp and other loaders that support 8bit quantization, I just chose oobabooga/text-generation-webui.
|
26 |
+
- You will likely face many bugs until text-generation-webui loads, ranging between missing PATH or env variables to having to manually pip uninstall/install packages.
|
27 |
+
- The notes below will likely become outdated once both text-generation-webui and GPTQ-for-LLaMa receive the appropriate bug fixes.
|
28 |
+
- If this model produces very slow answers (1 token/s), it means you are not using Cuda for bitsandbytes or that your hardware needs an upgrade.
|
29 |
+
- If this model produces answers with weird characters, it means you are not using the correct version of qwopqwop200/GPTQ-for-LLaMa as mentioned below.
|
30 |
+
- If this model produces answers that are out of topic or if it talks to itself, it means you are not using the correct checkout 508de42 of qwopqwop200/GPTQ-for-LLaMa as mentioned below.
|
31 |
+
|
32 |
+
Cuda (Slow tokens/s):
|
33 |
+
```
|
34 |
+
git clone https://github.com/oobabooga/text-generation-webui
|
35 |
+
cd text-generation-webui
|
36 |
+
pip install -r requirements.txt
|
37 |
+
|
38 |
+
mkdir repositories
|
39 |
+
cd repositories
|
40 |
+
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b cuda # Make sure you obtain the qwopqwop200 version, not the oobabooga one! (because "act-order: yes")
|
41 |
+
cd GPTQ-for-LLaMa
|
42 |
+
pip install -r requirements.txt
|
43 |
+
python setup_cuda.py install
|
44 |
+
```
|
45 |
+
|
46 |
+
Triton (Fast tokens/s) - Works on Windows with WSL (what I've used) or Linux:
|
47 |
+
```
|
48 |
+
git clone https://github.com/oobabooga/text-generation-webui
|
49 |
+
cd text-generation-webui
|
50 |
+
git fetch origin pull/1229/head:triton # This is the version that supports Triton - https://github.com/oobabooga/text-generation-webui/pull/1229
|
51 |
+
git checkout triton
|
52 |
+
pip install -r requirements.txt
|
53 |
+
|
54 |
+
mkdir repositories
|
55 |
+
cd repositories
|
56 |
+
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git # -b cuda
|
57 |
+
cd GPTQ-for-LLaMa
|
58 |
+
git checkout 508de42 # Before qwopqwop200 broke everything... - https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/183
|
59 |
+
pip install -r requirements.txt
|
60 |
+
```
|
61 |
+
|
62 |
+
<br>
|
63 |
+
<br>
|
64 |
+
|
65 |
+
# Testbench detail and results
|
66 |
+
|
67 |
+
- Latest version of oobabooga + https://github.com/oobabooga/text-generation-webui/pull/1229
|
68 |
+
|
69 |
+
- NVIDIA GTX 3090
|
70 |
+
- 32BG DDR4
|
71 |
+
- i9-7980XE OC @4.6Ghz
|
72 |
+
|
73 |
+
- 11 tokens/s on average with Triton
|
74 |
+
- Preliminary observations: better results than --load-in-8bits (To Be Confirmed)
|
75 |
+
- Tested and working in both chat mode and text generation mode
|
76 |
+
|
77 |
+

|
78 |
+
|
79 |
+

|
80 |
+
|
81 |
+

|
82 |
+
|
83 |
+

|
84 |
+
|
85 |
+

|
86 |
+
|
87 |
+

|
88 |
+
|
89 |
+

|
90 |
+
|
91 |
+
<br>
|
92 |
+
<br>
|
93 |
+
|
94 |
+
# Vicuna Model Card
|
95 |
+
|
96 |
+
## Model details
|
97 |
+
|
98 |
+
**Model type:**
|
99 |
+
Vicuna is an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT.
|
100 |
+
It is an auto-regressive language model, based on the transformer architecture.
|
101 |
+
|
102 |
+
**Model date:**
|
103 |
+
Vicuna was trained between March 2023 and April 2023.
|
104 |
+
|
105 |
+
**Organizations developing the model:**
|
106 |
+
The Vicuna team with members from UC Berkeley, CMU, Stanford, and UC San Diego.
|
107 |
+
|
108 |
+
**Paper or resources for more information:**
|
109 |
+
https://vicuna.lmsys.org/
|
110 |
+
|
111 |
+
**License:**
|
112 |
+
Apache License 2.0
|
113 |
+
|
114 |
+
**Where to send questions or comments about the model:**
|
115 |
+
https://github.com/lm-sys/FastChat/issues
|
116 |
+
|
117 |
+
## Intended use
|
118 |
+
**Primary intended uses:**
|
119 |
+
The primary use of Vicuna is research on large language models and chatbots.
|
120 |
+
|
121 |
+
**Primary intended users:**
|
122 |
+
The primary intended users of the model are researchers and hobbyists in natural language processing, machine learning, and artificial intelligence.
|
123 |
+
|
124 |
+
## Training dataset
|
125 |
+
70K conversations collected from ShareGPT.com.
|
126 |
+
|
127 |
+
## Evaluation dataset
|
128 |
+
A preliminary evaluation of the model quality is conducted by creating a set of 80 diverse questions and utilizing GPT-4 to judge the model outputs. See https://vicuna.lmsys.org/ for more details.
|
129 |
+
|
130 |
+
## Major updates of weights v1.1
|
131 |
+
- Refactor the tokenization and separator. In Vicuna v1.1, the separator has been changed from `"###"` to the EOS token `"</s>"`. This change makes it easier to determine the generation stop criteria and enables better compatibility with other libraries.
|
132 |
+
- Fix the supervised fine-tuning loss computation for better model quality.
|