Text Generation
Transformers
PyTorch
English
llama
Inference Endpoints
text-generation-inference

exl2-2 please?

#1
by Thireus - opened

Would you be able to issue a exl2-2 version of this 6.0bpw model please? :)

I'll put it no my list. I'm not sure how much improvements we'll see for the 6bpw model. For the lower bpw models, things are definitely better.

Indeed, I don't expect much improvements either but very curious to see the results. Thank you.

It's up, but with a caveat: the quant enhancements have not been finalized, so there's a chance we may have to redo the quants. Worth a test though to compare:
https://huggingface.co/LoneStriker/dolphin-2.2-70b-6.0bpw-h6-exl2-2

Some improvement on wikitext ppl:

  • dolphin-2.2-70b-6.0bpw-h6-exl2: 3.9869189262390137
  • dolphin-2.2-70b-6.0bpw-h6-exl2-2: 3.9655632972717285

When you say the quant enhancements have not been finalized, which step of the conversion process do you mean? Measuring quantization impact...? Would it be worth redoing the quants?

When you say the quant enhancements have not been finalized, which step of the conversion process do you mean? Measuring quantization impact...? Would it be worth redoing the quants?

Turboderp was still finalizing the quantization enhancements. Initially, the new quants showed improvements but had certain instabilities at certain model sizes (like 13B.) He's gone back to using measurements, but faster. Both methods improve perplexity substantially, particularly at lower bpw. At this point, ~5bpw is nearly indistinguishable from fp16. I believe that the quants I've re-done should be good. Going forward, I'll be using his latest measurement method. It should be merged into main shortly (if it hasn't already).

@LoneStriker do I understand it correctly that the new quantization method should be in newly released 0.0.11 (since dev branch completely merged into it)?
And that it's enabled when you don't specify a callibration dataset?

Yes, that's correct. You will use the default calibration dataset that is constructed from a diverse set of different texts. You can still specify the calibration dataset if you wish, if for example your model uses a language not covered by the built-in one. But, Turbo tried to include lots of different text types, so this is almost never needed.

Sign up or log in to comment