Importance-Matrix quantizations of HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1

first mixtral8x22b finetune 💫

This is a handrolled quantization off of a custom but backwards compatible fork of llama.cpp Hoping to push edgequants to main llama.cpp repo soon

MAKE SURE TO MERGE TOGETHER THE TWO PARTS AFTER DOWNLOADING

I.e. Download the 3bit orpo4ns.gguf.part0 & orpo4ns.gguf.part1 files then:

cd ~/Downloads

cat orpo4ns.gguf.part* > orpo4ns.gguf

cd llamacppFolderLocaltion 

./server -m ~/Downloads/orpor4ns.gguf -ngl 56

careful this can take 5 minutes or up to 10-15 on slow instances, check progress with ls -la

For lmStudio you need to copy the full orpo3ns.gguf file to your ~/.cache/lm-studio/models/YourNAME/

orpo4ns.gguf is the fastest, recommended, 2bit also done but not recommended.

the imatrix.dat file was calcuated over 1000 chunks with wikitext.train.raw( included )

Wrote a bit of custom c++ to avoid quantizing certain layers, tested fully compatible with llama.cpp as of 10April2024.

I'm no longer using the gguf-split tensor sharding because the memory swapping slows down GPU inference a lot.

Run with llama.cpp

git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp/ && make -j

./main -m orpo4ns.gguf -n 256 -t 64 --temp 0.2 --color -p "How to build a city on mars via aldrin cycler orbit shipments?"

Perplexity benchmarks

Command I used to run these on 48 core CPU only machine, you can add -ngl 16 to offload 16 layers or more to gpu on your own.

./perplexity -m orpo4ns.gguf -f wiki.test.raw --chunks 12 -t 48

Lower is Better. F16 baseline is ~2.3 , the 3bit 58GB version however is surprisingly not far

orpor4ns.gguf is the fastest because of 4bit/8bit optimizations in most hardware.

orpor4ns.gguf FILESIZE: 71260 MB 
[1]2.6970,[2]3.1781,[3]3.7390,[4]3.4159,[5]2.8977,[6]2.7126,[7]2.5597,[8]2.5013,[9]2.5279,[10]2.5175,[11]2.5315,[12]2.5455,
Final estimate: PPL = 2.5455 +/- 0.07697

orpo3ns.gguf FILESIZE: 58536 MB
[1]2.8042,[2]3.3418,[3]3.9400,[4]3.5859,[5]3.2042,[6]3.0524,[7]2.9738,[8]2.9695,[9]3.0232,[10]3.0099,[11]3.0510,[12]3.0589,
Final estimate: PPL = 3.0589 +/- 0.09882

orpo3nm.gguf FILESIZE: 60828 MB
[1]2.8435,[2]3.2998,[3]3.8984,[4]3.4821,[5]3.1084,[6]2.9597,[7]2.8[9]2.9155,[10]2.9218,[11]2.9613,[12]2.9709,
Final estimate: PPL = 2.9709 +/- 0.09419

orpo3nl.gguf FILESIZE: 65405 MB
[1]2.8175,[2]3.2506,[3]3.8241,[4]3.4152,[5]2.9970,[6]2.8455,[7]2.7358,[8]2.7120,[9]2.7955,[10]2.8003,[11]2.8254,[12]2.8371,
Final estimate: PPL = 2.8371 +/- 0.08781

orpo2n.gguf FILESIZE: 49420 MB
[1]3.0082,[2]3.5829,[3]4.1414,[4]4.1671,[5]3.8567,[6]3.7209,[7]3.7150,[8]3.7210,[9]3.8445,[10]3.9332,[11]4.0879,[12]4.0884,
Final estimate: PPL = 4.0884 +/- 0.1499

orpo2ns.gguf FILESIZE: 44026 MB
[1]3.0077,[2]3.5575,[3]4.1028,[4]4.4088,[5]4.2206,[6]4.1056,[7]4.1029,[8]4.1305,[9]4.1791,[10]4.3247,[11]4.4759,[12]4.4659,
Final estimate: PPL = 4.4659 +/- 0.16582

People on twitter seem very happy with 4bit version. Getting 3x higher speeds(13.01tps on an M3 Max macbook) than 4bit MLX(4.5tps)

The 3bit version is surprisingly usable even though only 58GB. Use 3ns or 3nm if you have a 64gb mac.

nisten
/

orpo-zephyr-8x22-EdgeQuants-gguf