Quantum Entanglement and the Sentient Toaster: Revolutionizing LLM Training

#3
by mradermacher - opened

I'm downloading the Q6_K for snowflake - remember, it often scores better at the correct_token metric than the source model :) But if you insist on the Q8_0 we can do that as well.

-rw------- 1 root root 509G Dec 7 13:01 snowflake-arctic-instruct.Q8_0.gguf

I assume that is in GB and not GiB. In which case 474 GiB might fit as we have 503 GiB of RAM (after subtracting RAM reserved for hardware) but would be extremely tight given the RAM required for context.

I'm downloading the Q6_K for snowflake - remember, it often scores better at the correct_token metric than the source model :) But if you insist on the Q8_0 we can do that as well.

Q6_K is fine for me. Q8_0 might not fit without offloading and it is unclear if offloading is even possible. I don't think it's worth using RPC if Q6_K fits. As a bonus there will be enough RAM left to let quantization tasks running if we do Q6_K. If you already have Q8_0 locally you should give it a try and see if it fits but if not Q6_K is fine for me.

I just checked and you do have it locally under /tmp/snowflake-arctic-instruct.Q8_0.gguf so please give it a try to see if it fits. I believe it should fit if nothing else is running as the model has such a small number of layers. If it doesn't fit use Q6_K instead.

474G Dec 7 13:01 snowflake-arctic-instruct.Q8_0.gguf

I'll try an offload of 1 and 0, then Q6. hopefully it does not crash.

I think you have to finish or kill the frozen quantisation tasks first. They are using a lot of reserved RAM (not cached RAM that can be taked away).

So, despite it listing both cpus, it only allocated something on cpu 0 (19GB). Otherwise, top says the process uses 435.6g, which is good, because I forgot to resume/stop the running quantize. I'd say we can even quantize, and if I manipulate the job a bit more, we might even do small imatrix calculations.

457.4g after warming up.

So, despite it listing both GPUs, it only allocated something on GPU0 (19GB)

llama.cpp uses booth GPUs for imatrix but only offloaded to one because you set -ngl 1 and it can only offload on a per-layer bases. Also ince when are quantisation tasks using the GPUs?

grafik.png

I'd say we can even quantize, and if I manipulate the job a bit more, we might even do small imatrix calculations.

I'm not so sure about that. Keep in mind that imatrix uses mmap memory that can be taken away by other processes like quantisation tasks that use reserved memory.

grafik.png

dstat shows a relatively high disk read rate so imatrix might now be streaming from SSD:

grafik.png

Yes it is clearly streaming from SSD now:

grafik.png

Once the quantisation tasks are interrupted it should work without SSD streaming again.

Don't know if I can do it, but I plan to queue more models before the queue dries out -. otherwise, I'll have to tell richard that soon his box will be idle and needs to be taken over, and then a short time later, I will beg to get exclusive access again :)

In other news, my main home server (that I need for breathing and basic survial, figuratively speaking :) is restore to a state where I can actually use it in read-write again. Doesn't mean much to you, but the last weeks were... unpleasant, I practically couldn't do anything.

And if we don't talk to each other much, merry christmas and a happy new year :)

I think I'll make an attempt at a huggingface-cli replacement tjhat doesn't call create_repo.

it seems to work. that means we will soon be down to exactly one create repo call per repo we create. However, I have the suspicion that maybe only successful calls are counted, in which case we just hit the limit. The only thing we could do is then mix static and imatrix quants to half creation rate. Or sit it out and hope for the best.

At the very least, though, we re-gain the ability to upload quants separately, and without being rate-limited by those calls.

(D'oh, forgot the README patcher)

Not a single rate limit hit tonight, seems it worked, and it was the total calls to create repo, whether successful or not, but fortunately not the rate-limited calls.

On the other hand, we had some big models, so less repo creations overall. But we didn't even get through the newly queued models yesterday due to the rate limit.

Maybe finally I can find some time to link the download page before it becomes obsolete.

gee, found another locking bug that kept jobs from being started all night.

Since I am not in such great shape still, I opted to kill all processes holding loocks and this got it going again, but without post-mortem. We'll have to do this a few more times, I guess, to find the issue ...
gee, found another locking bug that kept jobs from being started all night.

Awesome to hear that you were able to find and fix another locking bug. I can only imagine how complex maintaining this entire system must be. I wrote a distributed system for the satellite project I'm doing together with Richard where we have around concurrent 30 workers often only staying for a few hours and there where so many edge cases to consider.

Don't know if I can do it, but I plan to queue more models before the queue dries out -. otherwise, I'll have to tell richard that soon his box will be idle and needs to be taken over, and then a short time later, I will beg to get exclusive access again :)

Richard would for sure appreciate it if you can keep fully utilizing his server and don't run out of models for him to quant. If the queue gets too small you can maybe make it so all the remaining models are getting priority scheduled to rich1 so marco and nico1 run dry first which to my knowledge are the only server where someone has to pay for electricity.
Just so you know currently we are also using the same server that hosts rich1 for a satellite project worker so when we had that rich1 LXC outage we just scaled up satellite to use all resources and downscaled it again once your LXC container was fixed. I'm sure Richard will always find some other temporary use for this server should the queue ever run dry. I also have quite close contact with him so don’t worry about it.

In other news, my main home server (that I need for breathing and basic survial, figuratively speaking :) is restore to a state where I can actually use it in read-write again. Doesn't mean much to you, but the last weeks were... unpleasant, I practically couldn't do anything.

I'm so glad to hear that. This for sure must have been a really bad time for you.

I think I'll make an attempt at a huggingface-cli replacement tjhat doesn't call create_repo.

That sounds like a great idea.

it seems to work. that means we will soon be down to exactly one create repo call per repo we create. However, I have the suspicion that maybe only successful calls are counted, in which case we just hit the limit. The only thing we could do is then mix static and imatrix quants to half creation rate. Or sit it out and hope for the best.
At the very least, though, we re-gain the ability to upload quants separately, and without being rate-limited by those calls.
Not a single rate limit hit tonight, seems it worked, and it was the total calls to create repo, whether successful or not, but fortunately not the rate-limited calls.
On the other hand, we had some big models, so less repo creations overall. But we didn't even get through the newly queued models yesterday due to the rate limit.

Wow thanks a lot! This is awesome. I'm so happy we managed to find a workaround to avoid the rate limit.

Maybe finally I can find some time to link the download page before it becomes obsolete.

It would be really cool if you could do so. I love your download page! It would be great if you can show me an example before you do all of them as this might be the last time we change all the model cards so it needs to be as good as possible. Something else I noticed is that sometimes our quants appear as "Finetunes" instead of "Quantizations" in the parent model as can be seen in https://huggingface.co/models?other=base_model:finetune:nicoboss/Meta-Llama-3.1-405B-Instruct-Uncensored - maybe this can be fixed as well when we have to update all model cards anyways.

And if we don't talk to each other much, merry christmas and a happy new year :)

I wish you a happy new year as well!

I can only imagine how complex maintaining this entire system must be.

The problem is that code is constantly added and algorithms changed while the system is running :-)

[download page] It would be great if you can show me an example before

I hope I can do it incrementally, e.g. for new models only at first. But yeah, I'll try to ask for your opinion. If you wish, you can can even make a suggestion - I want to use some custom css to make a small box with the link only, and some very short explanation, such as "Compare static/imatrix quants, download and search on our... [[[Model summary page for this model]]]" or so. Suggestions or even outright examples are welcome :*)

so all the remaining models are getting priority scheduled to rich1 so marco and nico1 run dry first

The problem is in the specifics. Unless requested, models get queued in batches, and then we have two choices: leave a model in the queue, or queue it somewhere. In the latter case, we can choose where to queue.

If rich1 simply has priority, it would simply accept all models till the budget is full or the queue size limit is reached, neither of which is likely for daily batches, and also not desirable. At the moment, it is kind of distributed by speed, as nodes do static quants first, so faster nodes gobble up more jobs.

We need some kind of back pressure/weighting. And something like this is implemented (differently for models with nice <= 50), but it wouldn't be able to avoid scheduling on nico1 or marco. The current scheduling restrictions on nico1 are nice, because they mostly answer the question at night, and I will put a different scheduling restriction on marco (basically take it out completely once our queue is usually empty).

The only solution, I am afraid, is to essentially block nico1 completely (other than imatrix generation). And that might be doable, after all, we did this for many months. Or we only manually schedule jobs on nico1. Or only bigger jobs, which would be delayed on the much slower rich1 (which also might conceivably busy with other jobs, as it is effectively a shared server). Something like that. Share your thoughts :)

gpus@nico1

As an unrelated side note, for a few days now, I was using only one graphics card on purpose, except when I was in trouble (because of scheduling or downtime issues unrelated to the actual model workload), and at the moment, one gfx card is sufficient.

I really do plan to queue a few more months before the queue runs dry, though.

Update: Yeah, I think that's it - disable automatic quanting on nico1 except maybe for requested models (<= -1000), hand-queued models and very big models.

Sign up or log in to comment