File size: 3,407 Bytes

b00b931
8fda89e
b00b931
 
bbad105
 
 
 
 
 
 
 
 
6875282
 
 
bbad105
6875282
ea0a077
 
bbad105
6875282
bbad105
6875282
 
 
bbad105
8fda89e
 
 
 
 
bbad105
8fda89e
 
 
 
 
 
bbad105
8fda89e
 
 
 
 
 
bbad105
434fbe3
e774605
701252c
e774605
434fbe3
99ba75f
434fbe3
 
 
99ba75f
 
 
 
 
 
 
434fbe3
99ba75f
 
 
 
 
434fbe3
 
e774605
99ba75f
 
 
6875282
bbad105
 
 
6875282
bbad105

---
base_model: HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1
license: apache-2.0
---

# Importance-Matrix quantizations of HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1 
# first mixtral8x22b finetune 💫

This is a handrolled quantization off of a custom but backwards compatible fork of llama.cpp 
Hoping to push edgequants to main llama.cpp repo soon

## MAKE SURE TO MERGE TOGETHER THE TWO PARTS AFTER DOWNLOADING
## I.e. Download the 3bit orpo4ns.gguf.part0 & orpo4ns.gguf.part1 files then:
``` 
cd ~/Downloads

cat orpo4ns.gguf.part* > orpo4ns.gguf

cd llamacppFolderLocaltion 

./server -m ~/Downloads/orpor4ns.gguf -ngl 56 
```
careful this can take 5 minutes or up to 10-15 on slow instances, check progress with ls -la

For lmStudio you need to copy the full orpo3ns.gguf file to your ~/.cache/lm-studio/models/YourNAME/ 

## orpo4ns.gguf is the fastest, recommended, 2bit also done but not recommended.

the imatrix.dat file was calcuated over 1000 chunks with wikitext.train.raw( included )

Wrote a bit of custom c++ to avoid quantizing certain layers, tested fully compatible with llama.cpp as of 10April2024.

I'm no longer using the gguf-split tensor sharding because the memory swapping slows down GPU inference a lot.

# Run with llama.cpp 

```
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp/ && make -j

./main -m orpo4ns.gguf -n 256 -t 64 --temp 0.2 --color -p "How to build a city on mars via aldrin cycler orbit shipments?"

```
# Perplexity benchmarks

Command I used to run these on 48 core CPU only machine, you can add -ngl 16 to offload 16 layers or more to gpu on your own.

```./perplexity -m orpo4ns.gguf -f wiki.test.raw --chunks 12 -t 48 ``` 

# Lower is Better. F16 baseline is ~2.3 , the 3bit 58GB version however is surprisingly not far
# orpor4ns.gguf is the fastest because of 4bit/8bit optimizations in most hardware.

```bash
orpor4ns.gguf FILESIZE: 71260 MB 
[1]2.6970,[2]3.1781,[3]3.7390,[4]3.4159,[5]2.8977,[6]2.7126,[7]2.5597,[8]2.5013,[9]2.5279,[10]2.5175,[11]2.5315,[12]2.5455,
Final estimate: PPL = 2.5455 +/- 0.07697

orpo3ns.gguf FILESIZE: 58536 MB
[1]2.8042,[2]3.3418,[3]3.9400,[4]3.5859,[5]3.2042,[6]3.0524,[7]2.9738,[8]2.9695,[9]3.0232,[10]3.0099,[11]3.0510,[12]3.0589,
Final estimate: PPL = 3.0589 +/- 0.09882

orpo3nm.gguf FILESIZE: 60828 MB
[1]2.8435,[2]3.2998,[3]3.8984,[4]3.4821,[5]3.1084,[6]2.9597,[7]2.8[9]2.9155,[10]2.9218,[11]2.9613,[12]2.9709,
Final estimate: PPL = 2.9709 +/- 0.09419

orpo3nl.gguf FILESIZE: 65405 MB
[1]2.8175,[2]3.2506,[3]3.8241,[4]3.4152,[5]2.9970,[6]2.8455,[7]2.7358,[8]2.7120,[9]2.7955,[10]2.8003,[11]2.8254,[12]2.8371,
Final estimate: PPL = 2.8371 +/- 0.08781

orpo2n.gguf FILESIZE: 49420 MB
[1]3.0082,[2]3.5829,[3]4.1414,[4]4.1671,[5]3.8567,[6]3.7209,[7]3.7150,[8]3.7210,[9]3.8445,[10]3.9332,[11]4.0879,[12]4.0884,
Final estimate: PPL = 4.0884 +/- 0.1499

orpo2ns.gguf FILESIZE: 44026 MB
[1]3.0077,[2]3.5575,[3]4.1028,[4]4.4088,[5]4.2206,[6]4.1056,[7]4.1029,[8]4.1305,[9]4.1791,[10]4.3247,[11]4.4759,[12]4.4659,
Final estimate: PPL = 4.4659 +/- 0.16582
```
People on twitter seem very happy with 4bit version. Getting 3x higher speeds(13.01tps on an M3 Max macbook) than 4bit MLX(4.5tps)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6379683a81c1783a4a2ddba8/id32eagz3KNxiK3NC6cTv.png)


# The 3bit version is surprisingly usable even though only 58GB. Use 3ns or 3nm if you have a 64gb mac.