README.md · Nafnlaus/Wide-Sheared-LLaMA-796M at main

metadata

license: apache-2.0

This is a GGUF conversion of https://huggingface.co/minghaoyan/Wide-Sheared-LLaMA-796M, based on the paper "Decoding Speculative Decoding" by Minghao Yan, Saurabh Agarwal, and Shivaram Venkataraman.

https://arxiv.org/pdf/2402.01528

For those not familiar with speculative decoding, it is a technique to accelerate inference of larger models based on the premise that some tokens are much easier to predict than others. A large model is paired with a smaller draft model. The draft model is used to rapidly generate a possible predictive token sequence, which the large model then simultaneously verifies. Wherever the drafted token sequence would differ from what the large model would have generated, the large model's token is used instead (the large model will always correct or add one token), and the small model then drafts new tokens from that point forward, with the process repeating. As a result, the same sequence is generated, but at a significantly accelerated rate.

The wide sheared LLaMA models by minghaoyan are optimized for use as speculative decoding draft models. To use these with llama.cpp, use the "-md " option, and consider tuning the --draft parameter.