Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,11 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
---
|
4 |
+
|
5 |
+
This is a GGUF conversion of https://huggingface.co/minghaoyan/Wide-Sheared-LLaMA-543M, based on the paper "Decoding Speculative Decoding" by Minghao Yan, Saurabh Agarwal, and Shivaram Venkataraman.
|
6 |
+
|
7 |
+
https://arxiv.org/pdf/2402.01528
|
8 |
+
|
9 |
+
For those not familiar with speculative decoding, it is a technique to accelerate inference of larger models by pairing them with a smaller draft model; the draft model is used to rapidly generate a possible predictive token sequence, which the large model then simultaneously verify. Wherever the drafted token sequence would differ from what the large model would have generated, the large model's token is used instead (the large model will always correct or add one token), and the small model then drafts new tokens from that point forward, with the process repeating. As a result, the same sequence is generated, but at a significantly accelerated rate.
|
10 |
+
|
11 |
+
The wide sheared LLaMA models by minghaoyan are optimized for use as speculative decoding draft models. To use these with llama.cpp, use the "-md <GGUF_filename>" option, and consider tuning the --draft parameter.
|