tianlecai's picture
Create README.md
d53bafc
Medusa

 Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

| Blog | Codebase |


Installation

Method 1: With pip

pip install medusa-llm

Method 2: From source

git clone https://github.com/FasterDecoding/Medusa.git
cd Medusa
pip install -e .

Model Weights

Size Chat Command Hugging Face Repo
7B python -m medusa.inference.cli --model FasterDecoding/medusa-vicuna-7b-v1.3 FasterDecoding/medusa-vicuna-33b-v1.3
13B python -m medusa.inference.cli --model FasterDecoding/medusa-vicuna-13b-v1.3 FasterDecoding/medusa-vicuna-13b-v1.3
33B python -m medusa.inference.cli --model FasterDecoding/medusa-vicuna-33b-v1.3 FasterDecoding/medusa-vicuna-33b-v1.3

Inference

We currently support inference in the single GPU and batch size 1 setting, which is the most common setup for local model hosting. We are actively working to extend Medusa's capabilities by integrating it into other inference frameworks, please don't hesitate to reach out if you are interested in contributing to this effort.

You can use the following command for lauching a CLI interface:

python -m medusa.inference.cli --model [path of medusa model]

You can also pass --load-in-8bit or --load-in-4bit to load the base model in quantized format.