32K GGUF of LLAMA3-8B-INSTRUCT πŸš€

THIS IS NOT A FINETUNE IT JUST WORKS GREAT VIA YARN SCALING

imatrix custom edge-quants tested ok at 4,3 & 2bit

You have to set context with -c 32000 in llama.cpp to take advantage of this when you run it.

How to run the model in interactive mode using llama.cpp with a long prompt inside a textfile with -f

git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make -j

./main -m llama3ins-8b-32k-q4ns.gguf --temp 0.3 --color -f mylongprompt.txt -ngl 33 -n 2000 -i -c 32000

Prompt format - paste up to 32000 token long prompt inside the user{} brackets

put this inside your longprompt.txt file or copy from below and add to above command like this -p "<|im_start....."

<|im_start|>system{You are a hyperintelligent hilarious raccoon that solves everything via first-principles based resoning.}<|im_end|>
<|im_start|>user{How to build a city on mars via aldrin cycler orbits DUMP THE BIG LONG PROMPT HERE.}
<|im_end|>assistant

Perplexity Benchmarks

./perplexity -m ../llama3ins-8b-32k-f16.gguf -ngl 99 -f wiki.test.raw --chunks 16
perplexity: 2.10 seconds per pass - ETA 0.13 minutes
[1]6.1736,[2]6.8769,[3]7.4226,[4]8.0199,[5]8.4531,[6]8.7808,[7]9.3213,[8]10.0461,[9]10.7468,[10]11.0909,[11]11.2691,[12]11.4318,[13]11.9160,[14]11.4038,[15]11.2641,[16]10.9073,
Final estimate: PPL = 10.9073 +/- 0.50026

./perplexity -m ../llama3ins-8b-32k-q8.gguf -ngl 99 -f wiki.test.raw --chunks 16 YES 8BIT IS BETTER THAN BF16 - F16 conversion
perplexity: 2.38 seconds per pass - ETA 0.15 minutes
[1]6.1454,[2]6.8672,[3]7.4109,[4]8.0148,[5]8.4472,[6]8.7771,[7]9.3182,[8]10.0466,[9]10.7509,[10]11.0836,[11]11.2563,[12]11.4218,[13]11.9095,[14]11.4000,[15]11.2587,[16]10.9028,
Final estimate: PPL = 10.9028 +/- 0.49958

./perplexity -m ../llama3ins-8b-32k-q6.gguf -ngl 99 -f wiki.test.raw --chunks 16
perplexity: 2.36 seconds per pass - ETA 0.15 minutes
[1]6.0654,[2]6.7806,[3]7.3319,[4]7.9600,[5]8.3961,[6]8.7512,[7]9.2932,[8]10.0314,[9]10.7402,[10]11.0786,[11]11.2597,[12]11.4410,[13]11.9342,[14]11.4223,[15]11.2818,[16]10.9354,
Final estimate: PPL = 10.9354 +/- 0.50190

./perplexity -m ../llama3ins-8b-32k-q5km.gguf -ngl 99 -f wiki.test.raw --chunks 16
perplexity: 2.40 seconds per pass - ETA 0.15 minutes
[1]6.0044,[2]6.8263,[3]7.3989,[4]8.0044,[5]8.4508,[6]8.7716,[7]9.3220,[8]10.0606,[9]10.7709,[10]11.1098,[11]11.2956,[12]11.4743,[13]11.9661,[14]11.4569,[15]11.3028,[16]10.9474,
Final estimate: PPL = 10.9474 +/- 0.50185

./perplexity -m ../llama3ins-8b-32k-q4ns.gguf -ngl 99 -f wiki.test.raw --chunks 16
perplexity: 2.40 seconds per pass - ETA 0.15 minutes
[1]6.5618,[2]7.1233,[3]7.5647,[4]8.1198,[5]8.5365,[6]8.8386,[7]9.4233,[8]10.1359,[9]10.8601,[10]11.1981,[11]11.3705,[12]11.5619,[13]12.0492,[14]11.5287,[15]11.3823,[16]11.0269,
Final estimate: PPL = 11.0269 +/- 0.50623

IQ4_XS - NON IMATRIX FOR REFERENCE is quite a bit worse than my imat one
perplexity: 7.41 seconds per pass - ETA 0.48 minutes
[1]6.9103,[2]7.4907,[3]7.9577,[4]8.3949,[5]8.8029,[6]9.0275,[7]9.6252,[8]10.2914,[9]10.9833,[10]11.3498,[11]11.5059,[12]11.7275,[13]12.1804,[14]11.6848,[15]11.5226,[16]11.1761,
Final estimate: PPL = 11.1761 +/- 0.51803

./perplexity -m ../llama3ins-8b-32k-q3ns.gguf -ngl 99 -f wiki.test.raw --chunks 16
perplexity: 2.43 seconds per pass - ETA 0.15 minutes
[1]6.6955,[2]7.2732,[3]7.9483,[4]8.5310,[5]9.0020,[6]9.3664,[7]9.9324,[8]10.7019,[9]11.4163,[10]11.6981,[11]11.8420,[12]12.1191,[13]12.6709,[14]12.1222,[15]11.9778,[16]11.5624,
Final estimate: PPL = 11.5624 +/- 0.53444

./perplexity -m ../llama3ins-8b-32k-q2ns.gguf -ngl 99 -f wiki.test.raw --chunks 16 SUPRISINGLY USABLE
perplexity: 2.48 seconds per pass - ETA 0.15 minutes
[1]7.0861,[2]7.8057,[3]8.5360,[4]9.1910,[5]9.6240,[6]10.0848,[7]10.7928,[8]11.4729,[9]12.3032,[10]12.5115,[11]12.7422,[12]13.1224,[13]13.7716,[14]13.1772,[15]13.0020,[16]12.5578,
Final estimate: PPL = 12.5578 +/- 0.57323

./perplexity -m ../llama3ins-8b-32k-q1ns.gguf -ngl 99 -f wiki.test.raw --chunks 16  ONE BIT TURNS TO JUNK
perplexity: 2.41 seconds per pass - ETA 0.15 minutes
[1]15.1640,[2]16.2585,[3]17.8912,[4]18.2226,[5]18.4974,[6]19.2407,[7]20.0085,[8]21.6465,[9]22.7656,[10]22.7903,[11]23.2208,[12]24.2318,[13]25.7172,[14]24.5111,[15]23.8096,[16]22.7933,
Final estimate: PPL = 22.7933 +/- 1.05192

Yes 8bit q8_0 is slightly better than f16 because converting fom bf16 to f16 reduces bits in the mantisa. The ns quants are custom nisten quants and work well down to 2 bit. 1.75bit quant is included for reference however perplexity tanks and is incoherent.

Built with Meta Llama 3

Downloads last month
258
GGUF
Model size
8.03B params
Architecture
llama
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for nisten/llama3-8b-instruct-32k-gguf

Quantized
(207)
this model