Kooten
/

FlatDolphinMaid-8x7B-4bpw-exl2

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Kooten commited on Jan 3, 2024

Commit

bddb985

·

1 Parent(s): 18e1afd

Update README.md

Files changed (1) hide show

README.md +12 -1

README.md CHANGED Viewed

@@ -5,7 +5,18 @@ license: cc-by-nc-4.0
 # FlatDolphinMaid-8x7B 4bpw
 Exllama quant of [Undi95/FlatDolphinMaid-8x7B](https://huggingface.co/Undi95/FlatDolphinMaid-8x7B)
-3.5bwp just barely fits in 24gb vram if i lower context (tried with 12288)
 ### Promt format:

 # FlatDolphinMaid-8x7B 4bpw
 Exllama quant of [Undi95/FlatDolphinMaid-8x7B](https://huggingface.co/Undi95/FlatDolphinMaid-8x7B)
+You probably want the [3.5bpw](https://huggingface.co/Kooten/FlatDolphinMaid-8x7B-3.5bpw-exl2) version. It just fits in 24gb of vram at half context (16384).
+If you really want the larger context [3bpw](https://huggingface.co/Kooten/FlatDolphinMaid-8x7B-3bpw-exl2) should do it but you are probably better of with the gguf version with higher quants.
+I did make a [4bpw](https://huggingface.co/Kooten/FlatDolphinMaid-8x7B-4bpw-exl2), it might work in a headless or multigpu setup.
+Other BPW's [3.0bpw](https://huggingface.co/Kooten/FlatDolphinMaid-8x7B-3bpw-exl2), [3.5bpw](https://huggingface.co/Kooten/FlatDolphinMaid-8x7B-3.5bpw-exl2), [4.0bpw](https://huggingface.co/Kooten/FlatDolphinMaid-8x7B-4bpw-exl2)
+Make sure you **enable 8bit cache**.
 ### Promt format: