Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
Installation
Method 1: With pip
pip install medusa-llm
Method 2: From source
git clone https://github.com/FasterDecoding/Medusa.git
cd Medusa
pip install -e .
Model Weights
Size | Chat Command | Hugging Face Repo |
---|---|---|
7B | python -m medusa.inference.cli --model FasterDecoding/medusa-vicuna-7b-v1.3 |
FasterDecoding/medusa-vicuna-33b-v1.3 |
13B | python -m medusa.inference.cli --model FasterDecoding/medusa-vicuna-13b-v1.3 |
FasterDecoding/medusa-vicuna-13b-v1.3 |
33B | python -m medusa.inference.cli --model FasterDecoding/medusa-vicuna-33b-v1.3 |
FasterDecoding/medusa-vicuna-33b-v1.3 |
Inference
We currently support inference in the single GPU and batch size 1 setting, which is the most common setup for local model hosting. We are actively working to extend Medusa's capabilities by integrating it into other inference frameworks, please don't hesitate to reach out if you are interested in contributing to this effort.
You can use the following command for lauching a CLI interface:
python -m medusa.inference.cli --model [path of medusa model]
You can also pass --load-in-8bit
or --load-in-4bit
to load the base model in quantized format.