request awq version

#1
by Komposter43 - opened

request awq version

I tried creating an AWQ version but seams harder than I thought. First doing so requires 800 GB of RAM. This wouldn't an issue, but it seems all this RAM needs to be on a single node. The machine with the most amount of RAM I have only has 512 GB. It also seems to not be possible to stream the model from SSD or use a swap. Maybe swap could work but at least for me it froze my system, so I gave up after it not doing anything for half an hour. I then moved to RunPod and rented a 4x L40 container and followed https://huggingface.co/hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4#quantization-reproduction. I was then affected by https://github.com/casper-hansen/AutoAWQ/issues/558 (PR: https://github.com/huggingface/transformers/pull/33742) which I fixed using pip install git+https://github.com/davedgd/transformers@patch-1. I then unfortunately had to cancel my attempt when I realized that running the AWQ quantization would take 10 hours which would cost me around $50 which seems a bit much. If you have a very good reason why you need an AWQ version I could consider making one despite the cost.

I recommend you instead us the i1 GGUF version from mradermacher available under https://huggingface.co/mradermacher/Meta-Llama-3.1-405B-Instruct-Uncensored-i1-GGUF. I really don't see why you would want to use AWQ over GGUF.

Sign up or log in to comment