example code doesn't work at all
output is: pad only
Prompt: Write me a poem about Machine Learning.
mlx 0.15.2
mlx-lm 0.15.0
The example code should work fine:
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/gemma-2-27b-it-8bit")
response = generate(model, tokenizer, prompt="hello", verbose=True)
Reproducible here:
% mlx_lm.generate --model "mlx-community/gemma-2-27b-it-8bit" --prompt "Hello"
Fetching 11 files: 100%|█████████████████████| 11/11 [00:00<00:00, 31152.83it/s]
==========
Prompt: <bos><start_of_turn>user
Hello<end_of_turn>
<start_of_turn>model
<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>
==========
Prompt: 0.538 tokens-per-sec
Generation: 1.840 tokens-per-sec
% python3 prince.py
Fetching 11 files: 100%|█████████████████████| 11/11 [00:00<00:00, 34820.64it/s]
==========
Prompt: hello
<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>
==========
Prompt: 0.124 tokens-per-sec
Generation: 2.043 tokens-per-sec
yep, very bad exprience.
not work, but someone still tell you works.
The example code should work fine:
from mlx_lm import load, generate model, tokenizer = load("mlx-community/gemma-2-27b-it-8bit") response = generate(model, tokenizer, prompt="hello", verbose=True)
do you really test the code?
very bad exprience.
it not work, but someone still tell you it works.
I have previously noticed differences with mlx-vlm (and PaliGemma) vs. the official demo on HF as well - but didn't have time to pursue this further. Perhaps there is an underlying MLX issue? I am using macOS 14.3 on M3 Max.
By contrast, the 9B-FP16 variant does work:
% mlx_lm.generate --model "mlx-community/gemma-2-9b-it-fp16" --prompt "Hello"
Fetching 9 files: 100%|████████████████████████| 9/9 [00:00<00:00, 17614.90it/s]
Prompt: user
Hello
model
Hello! 👋
How can I help you today? 😊
==========
Prompt: 6.337 tokens-per-sec
Generation: 13.758 tokens-per-sec
It was an oversight on my part,
There is a tiny bug with the 27B version, and should be fixed soon:
https://github.com/ml-explore/mlx-examples/pull/857
Fixed ✅
pip install -U mlx-lm