File size: 4,133 Bytes
3478b14
 
 
0fd5a8a
3478b14
0fd5a8a
 
3478b14
4e685a5
eb74822
 
cbc6134
ac82c46
0ae60fb
4f67f3f
c56e918
4f67f3f
 
 
c56e918
dcbeb2e
47dd4d9
b8185c1
f80614d
f8879cf
f85aad8
f8879cf
fb47ca5
d333185
 
63b33a1
d333185
 
63b33a1
 
7554009
d333185
0fd5a8a
 
 
 
fb47ca5
01aa3c8
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
---
license: other
license_name: databricks-open-model-license
library_name: gguf
license_link: https://www.databricks.com/legal/open-model-license
pipeline_tag: text-generation
base_model: databricks/dbrx-instruct
---
**Quants from @phymbert (author of the support for this model in llama.cpp) are posted [here](https://huggingface.co/models?sort=created&search=gguf+phymbert)**  
The quants here are meant to test imatrix quantized weights.  
<i>If you run metal, you may need this [PR](https://github.com/ggerganov/llama.cpp/pull/6662)</i>

**Added `ggml-dbrx-instruct-16x12b-f16_imatrix-wiki.dat` which is a 2K batches (1M tokens) on FP16 weights using wiki.train.**

| Quant | IMatrix Quant/Dataset/Chunks | Size (GiB) | PPL (wiki.test) |
| -- | -- | -- | -- |
| IQ4_XS | Q8_0/wiki.train/200 | 65.29 | 5.2260 +/- 0.03558 |
| IQ4_XS | FP16/wiki.train/2000 | 65.29 | 5.2241 +/- 0.03559 |
| IQ4_XS | - | 66.05 | 5.2546 +/- 0.03570 |

**2024-04-13**: Support for this model has just being merged - [`PR #6515`](https://github.com/ggerganov/llama.cpp/pull/6515).  
**<u>You will need this llama.cpp commit [`4bd0f93e`](https://github.com/ggerganov/llama.cpp/commit/4bd0f93e4ab4fe6682e7d0241c1bdec1397e954a) to run this model</u>**

Quants in this repo are tested running the following command (quants under IQ3 are very sensitive and unreliable so far - the imatrix may require to be trained on FP16 weights rather than Q8_0 and for longer than 200 chunks):
```
./build/bin/main -ngl 41 -c 4096 -s 0 -e -p "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWrite an essay about AI.<|im_end|>\n<|im_start|>assistant\n" -m ggml-dbrx-instruct-16x12b-<<quant-to-test>>.gguf
```

* GGUF importance matrix (imatrix) quants for https://huggingface.co/databricks/dbrx-instruct
* The importance matrix is trained for ~100K tokens (200 batches of 512 tokens) using [wiki.train.raw](https://huggingface.co/datasets/wikitext).
* [Which GGUF is right for me? (from Artefact2)](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9) - X axis is file size and Y axis is perplexity (lower perplexity is better quality).
* The [imatrix is being used on the K-quants](https://github.com/ggerganov/llama.cpp/pull/4930) as well (only for < Q6_K).
* You can merge GGUFs with `gguf-split --merge <first-chunk> <output-file>` although this is not required since [f482bb2e](https://github.com/ggerganov/llama.cpp/commit/f482bb2e4920e544651fb832f2e0bcb4d2ff69ab).
* What is importance matrix (imatrix)? You can [read more about it from the author here](https://github.com/ggerganov/llama.cpp/pull/4861).
* How do I use imatrix quants? Just like any other GGUF, the `.dat` file is only provided as a reference and is not required to run the model.
* If you need to use IQ1, then use IQ1_M as IQ1_S is very unstable.

> DBRX is a transformer-based decoder-only large language model (LLM) that was trained using next-token prediction. It uses a fine-grained mixture-of-experts (MoE) architecture with 132B total parameters of which 36B parameters are active on any input. It was pre-trained on 12T tokens of text and code data. Compared to other open MoE models like Mixtral-8x7B and Grok-1, DBRX is fine-grained, meaning it uses a larger number of smaller experts. DBRX has 16 experts and chooses 4, while Mixtral-8x7B and Grok-1 have 8 experts and choose 2. This provides 65x more possible combinations of experts and we found that this improves model quality. DBRX uses rotary position encodings (RoPE), gated linear units (GLU), and grouped query attention (GQA). It uses the GPT-4 tokenizer as provided in the tiktoken repository. We made these choices based on exhaustive evaluation and scaling experiments.

| Layers | Context | Template |
| --- | --- | --- |
| <pre>40</pre> | <pre>32768</pre> | <pre>\<\|im_start\|\>system<br>{system}\<\|im_end\|\><br>\<\|im_start\|\>user<br>{prompt}\<\|im_end\|\><br>\<\|im_start\|\>assistant<br> </pre> |

* 16x12B MoE
* 16 experts (12B params per single expert; top_k=4 routing)
* 36B active params (132B total params)
* Trained on 12T tokens
* 32k sequence length training