when will have a ggml version?

'[INST]\nWrite a poem about cats\n[\INST]\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n',

mapa17

Aug 25, 2023

•

edited Aug 25, 2023

I tried different prompts and as well only get long sequences of "\n". Could it be that something breaks in the tokenization of the input?
Can someone with access to the unquantized model verify if the token sequence for the following?

m.tokenize("[INST]\nWrite a poem about cats\n[/INST]\n\n".encode('utf8'))
[1, 29961, 25580, 29962, 13, 6113, 263, 26576, 1048, 274, 1446, 13, 29961, 29914, 25580, 29962, 13, 13]

rozek

Aug 31, 2023

Based on my experiences, Q2...Q4 quantizations are too small for proper outputs - even when generating "useful" texts (rather than just newlines) these models hallucinate far too much. The Q8_0 quantization, however, works pretty well - and, when using llama.cpp, 16GB RAM allow for context lengths up to 16k, 24GB RAM for lengths up to 32k (tested on a Macbook Air 15" with 24GB unified RAM).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment