Nafnlaus commited on
Commit
a2b600d
1 Parent(s): 45d06fe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -6,6 +6,6 @@ This is a GGUF conversion of https://huggingface.co/minghaoyan/Wide-Sheared-LLaM
6
 
7
  https://arxiv.org/pdf/2402.01528
8
 
9
- For those not familiar with speculative decoding, it is a technique to accelerate inference of larger models by pairing them with a smaller draft model; the draft model is used to rapidly generate a possible predictive token sequence, which the large model then simultaneously verify. Wherever the drafted token sequence would differ from what the large model would have generated, the large model's token is used instead (the large model will always correct or add one token), and the small model then drafts new tokens from that point forward, with the process repeating. As a result, the same sequence is generated, but at a significantly accelerated rate.
10
 
11
  The wide sheared LLaMA models by minghaoyan are optimized for use as speculative decoding draft models. To use these with llama.cpp, use the "-md <GGUF_filename>" option, and consider tuning the --draft parameter.
 
6
 
7
  https://arxiv.org/pdf/2402.01528
8
 
9
+ For those not familiar with speculative decoding, it is a technique to accelerate inference of larger models based on the premise that some tokens are much easier to predict than others. A large model is paired with a smaller draft model. The draft model is used to rapidly generate a possible predictive token sequence, which the large model then simultaneously verifies. Wherever the drafted token sequence would differ from what the large model would have generated, the large model's token is used instead (the large model will always correct or add one token), and the small model then drafts new tokens from that point forward, with the process repeating. As a result, the same sequence is generated, but at a significantly accelerated rate.
10
 
11
  The wide sheared LLaMA models by minghaoyan are optimized for use as speculative decoding draft models. To use these with llama.cpp, use the "-md <GGUF_filename>" option, and consider tuning the --draft parameter.