Any chance for an EXL2 version?
@LoneStriker and @Panchovix quantized EXL2 versions of Goliath 120B. Any chance we could get an EXL2 version of this model, too? It's tuned on top of Goliath and in my tests it's also at the top of my rankings!
@LoneStriker and @Panchovix quantized EXL2 versions of Goliath 120B. Any chance we could get an EXL2 version of this model, too? It's tuned on top of Goliath and in my tests it's also at the top of my rankings!
I can put it in the queue. 120Bs are monsters to chew through though.
If LoneStriker takes in his queue then nice! And I can confirm, it takes a good while to do a quant exl2 of a 120B model.
Thanks to both of you! 120B is definitely a beast, but even down to 3-bit it's still beating all the 70Bs and with EXL2 it even runs nicely fast. So looking forward to this, thanks a lot!
Various bit rates should appear here (3.0 and 4.5 are done, other variants still quantizing):
https://huggingface.co/models?search=lonestriker%20tess-xl
Awesome job! Can we also get a sample script on how to run it with EXL2?
Awesome job! Can we also get a sample script on how to run it with EXL2?
Most people will just run it under ooba text gen webui using the exllamav2
loader. The one setting you may need, however, is to unset this option:
For low-bit-rate quants, the ooba web interface settings will spit out gibberish if left checked (it's on by default.)
If you want to use it programmatically in Python, you can use Turboderp's exllamav2 project. Script here:
https://github.com/turboderp/exllamav2/blob/master/examples/chat.py
You can run the 3.0bpw quant on 2x 3090s/4090s at a reasonably fast inference speed.
At least what I do, for easier testing is, using exui (same developer of exllama) and after getting the model directly, loading it with that UI.
https://github.com/turboderp/exui
The original exllamav2 project to do some benchmarks for example, or to install from source to use in other backends: https://github.com/turboderp/exllamav2
exui is recommended (though very few people know about it yet.). One benefit is that you can use speculative decoding as @Panchovix shows above. You basically get 50-100% speedup for a small bit of VRAM to run the draft model.
The 3.0 has been saving my butt. They give normal 70b speeds and what feels like 80-90% of the quality of 4KS/4KM GGUF. Only 3500 or so context though.
Wow, that was faster than I expected. Thanks a lot, @LoneStriker !
And thanks for the recommendation of exui, @Panchovix - speculative decoding sounds interesting and useful. But that UI doesn't have an API, does it? My frontend is SillyTavern so I need a backend that can be used with it, either OpenAI API-compatible or e. g. ooba text gen webui (which is now OpenAI API-compatible, too).
@wolfram in that case I suggest to use tabbyAPI https://github.com/theroyallab/tabbyAPI
It is a very lightweight API loader for exllamav2/gptq models, and it will work with ST.
Though it isn't as easy as ooba.
@Panchovix Why not ooba? Does tabbyAPI support speculative decoding or what would be the advantage?
ooba does not support exl2 speculative decoding. exui and tabbyAPI both support it.
3.0bpw barely fits in 48gb. Even the tiniest model won't work with that.
Yup, 3.0 is a tight fit, even with 8-bit cache enabled. If you want to go lower, grab a 2.18, 2.4 or 2.85 bpw model:
https://huggingface.co/LoneStriker?search_models=tess-xl