MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF · And where is the GGUF file itself?

And where is the GGUF file itself?

by Anonimus12345678902 - opened Jul 18

Discussion

Anonimus12345678902

Jul 18

And where is the GGUF file itself?

Henk717

Jul 18

•

edited Jul 18

Give the guy some time, new repo's always get made first so the upload scripts can do their job.
It may not even be possible to convert it yet.

MaziyarPanahi

Owner Jul 19

Sorry, I launched this last night thinking it's the exact same model as Mistral 7B so it should be all fine. However, they are using a slightly different but not that different tokenizer.

I am helping testing this PR, once it's resolved it should be pretty quick :)
https://github.com/ggerganov/llama.cpp/pull/8579

Abdelhak

Jul 20

@MaziyarPanahi Kindly let us know when the quants are ready :)
Thank you.

MaziyarPanahi

Owner Jul 20

Of course, the PR is ready to be merged. So hopefully it will be ready today :)

mirek190

Jul 20

is merged ;)

MaziyarPanahi

Owner Jul 20

The PR seems to be just a piece of support for Mistral-Nemo-Instruct-2407. It may need a few more PRs.
I'll keep an eye on and upload the quants the moment it's possible

Abdelhak

Jul 20

The PR seems to be just a piece of support for Mistral-Nemo-Instruct-2407. It may need a few more PRs.
I'll keep an eye on and upload the quants the moment it's possible

@MaziyarPanahi Does that mean it's just a workaround and not a fix?

segaa

Jul 20

•

edited Jul 20

@MaziyarPanahi Does that mean it's just a workaround and not a fix?

It's not a workaround, it's just one part of the whole Support Mistral-Nemo-Instruct-2407 128K issue solution.
If you try to use this part only, the model will start loading, but then will fail with a wrong tensor shape error because Mistral-Nemo uses non-standard tensor shapes.
The Llama.cpp team is currently working on this part of the issue.

MaziyarPanahi

Owner Jul 22

Last PR is merge and models are being uploaded!

saishf

Jul 22

Last PR is merge and models are being uploaded!

Can confirm that they work :3
I've tested Q4_K_S with b3437 and it's coherent to 16K, with cache quant too

MaziyarPanahi

Owner Jul 22

Nice!!!! Love to see how far we can go with the context length here! :D

ubergarm

Jul 22

•

edited Jul 22

Thanks for the fine quants!

I threw a friend's 450 page Ph.D. dissertation (just over ~50k tokens) at the Q8_0 and it generally returned a rough summary. Can almost fit 128k context on my 3090TI 24GB VRAM GPU (had to dial it back just a bit to not OOM when offloading all layers).

I'll likely use this model to experiment quickly generating summaries of medium sized chunks of text (up to 16k or 32k).

Runtime

$ ./llama-server --version
version: 3441 (081fe431)
built with cc (GCC) 14.1.1 20240522 for x86_64-pc-linux-gnu

$ ./llama-server \
    --model "../models/MaziyarPanahi/Mistral-Nemo-Instruct-2407-GGUF/Mistral-Nemo-Instruct-2407.Q8_0.gguf" \
    --n-gpu-layers 41 \
    --ctx-size 102400 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --threads 8 \
    --flash-attn \
    --mlock \
    --n-predict -1 \
    --host 127.0.0.1 \
    --port 8080

Client Config

{
    "temperature": 0.2,
    "top_k": 40,
    "top_p": 0.95,
    "min_p": 0.05,
    "repeat_penalty": 1.1,
    "n_predict": -1,
    "seed": -1,
}

Mixtral Prompt Format

Make sure to use the correct Mixtral Prompt Format being mindful of preserving white spaces and how to fudge in a "system prompt" or not.

Using the wrong prompt format e.g. ChatML it sometimes evaluates the entire prompt and immediately returns end of string generating nothing.

[INST] Just tell it what to do here without system prompt and keep the space in front. [/INST]

Example Timings

INFO [           print_timings] prompt eval time     =   34172.36 ms / 51617 tokens (    0.66 ms per token,  1510.49 tokens per second) | tid="125836361121792" timestamp=1721676458 id_slot=0 id_task=1010 t_prompt_processing=34172.36 n_prompt_tokens_processed=51617 t_token=0.6620369258190131 n_tokens_second=1510.489764242212
INFO [           print_timings] generation eval time =   25648.80 ms /   557 runs   (   46.05 ms per token,    21.72 tokens per second) | tid="125836361121792" timestamp=1721676458 id_slot=0 id_task=1010 t_token_generation=25648.798 n_decoded=557 t_token=46.04811131059246 n_tokens_second=21.716417276162417
INFO [           print_timings]           total time =   59821.16 ms | tid="125836361121792" timestamp=1721676458 id_slot=0 id_task=1010 t_prompt_processing=34172.36 t_token_generation=25648.798 t_total=59821.157999999996

Cheers!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment