example code doesn't work at all

by cloudyu - opened Jun 28

Discussion

cloudyu

Jun 28

•

edited Jun 28

output is: pad only
Prompt: Write me a poem about Machine Learning.

cloudyu

Jun 28

mlx 0.15.2
mlx-lm 0.15.0

prince-canuma

MLX Community org Jun 28

The example code should work fine:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/gemma-2-27b-it-8bit")
response = generate(model, tokenizer, prompt="hello", verbose=True)

prince-canuma changed discussion status to closed Jun 28

ndurner

Jun 29

Reproducible here:

% mlx_lm.generate --model "mlx-community/gemma-2-27b-it-8bit" --prompt "Hello"
Fetching 11 files: 100%|█████████████████████| 11/11 [00:00<00:00, 31152.83it/s]
==========
Prompt: <bos><start_of_turn>user
Hello<end_of_turn>
<start_of_turn>model

<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>
==========
Prompt: 0.538 tokens-per-sec
Generation: 1.840 tokens-per-sec

% python3 prince.py 
Fetching 11 files: 100%|█████████████████████| 11/11 [00:00<00:00, 34820.64it/s]
==========
Prompt: hello
<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>
==========
Prompt: 0.124 tokens-per-sec
Generation: 2.043 tokens-per-sec

cloudyu

Jun 29

yep, very bad exprience.
not work, but someone still tell you works.

The example code should work fine:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/gemma-2-27b-it-8bit")
response = generate(model, tokenizer, prompt="hello", verbose=True)

do you really test the code?

very bad exprience.
it not work, but someone still tell you it works.

ndurner

Jun 29

I have previously noticed differences with mlx-vlm (and PaliGemma) vs. the official demo on HF as well - but didn't have time to pursue this further. Perhaps there is an underlying MLX issue? I am using macOS 14.3 on M3 Max.

By contrast, the 9B-FP16 variant does work:

% mlx_lm.generate --model "mlx-community/gemma-2-9b-it-fp16" --prompt "Hello"
Fetching 9 files: 100%|████████████████████████| 9/9 [00:00<00:00, 17614.90it/s]

Prompt: user
Hello
model

Hello! 👋

How can I help you today? 😊

==========
Prompt: 6.337 tokens-per-sec
Generation: 13.758 tokens-per-sec

prince-canuma

MLX Community org Jun 30

I'm sorry @cloudyu @ndurner ,

It was an oversight on my part,

There is a tiny bug with the 27B version, and should be fixed soon:
https://github.com/ml-explore/mlx-examples/pull/857

prince-canuma changed discussion status to open Jun 30

prince-canuma

MLX Community org Jul 4

Fixed ✅

pip install -U mlx-lm

prince-canuma changed discussion status to closed Jul 4

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

example code doesn't work at all

% mlx_lm.generate --model "mlx-community/gemma-2-9b-it-fp16" --prompt "Hello"Fetching 9 files: 100%|████████████████████████| 9/9 [00:00<00:00, 17614.90it/s]

% mlx_lm.generate --model "mlx-community/gemma-2-9b-it-fp16" --prompt "Hello"
Fetching 9 files: 100%|████████████████████████| 9/9 [00:00<00:00, 17614.90it/s]