Join the force
Hello @mradermacher , as you noticed we have been competing for the amount of models for quite a while. So instead of competing, want to join forces? I talked to @nicoboss , he is up for it, and I have my quant server for you with 2 big bananas (E5-2697Av4), 64 gigs of ram, and a 10gbps line ready for you!
Well, "take what I have" and "join forces" are not exactly the same thing. When we talked last about it, I realised we were doing very different things and thought diversity is good, especially when I actually saw what models you quantize and how :) BTW, I am far from beating your amount of models (remember, I have roughly two repos per model, so you have twice the amount), and wasn't in the business of competing, as it was clear I couldn't :)
But of course, I won't say no to such an offer, especially not at this moment (if you have seen my queue recently...).
So how do we go about it? Nico runs some virtualisation solution, and we decided on a linux container to be able to access his graphics cards, but since direct hardware access is not a concern, a more traditional VM would probably be the simplest option. I could give you an image, or you could create a VM with debian 12/bookworm and my ssh key on it (nico can just copy the authorized_kleys file).
Or, if you have any other ideas, let's talk.
Oh, and how much diskspace are you willing to give me? :)
Otherwise, welcome to team mradermacher. Really should have called it something else in the beginning.
Ah, and as for network access, I only need some port to reach ssh, and be able to get a tunnel out (wireguard, udp). having a random port go to the vm ssh port and forward udp port 7103 to the same vm port would be ideal. I can help with all that, and am open to alternative arrangements, but I have total trust in you that you can figure everything out :)
No worries I will help him setting up everything infrastructure wise. He already successfully created a Debian 12 LXC container. While a VMs might be easier those few percentages of lost performance bother me but if you prefer a VM I can also help him with that.
LXC sits perfectly well with me.
this brings me joy
@mradermacher Your new server "richard1" is ready. Make sure to abuse the internet as hard as you can. Details were provided by email by @nicoboss , so check it please as soon as you can
Oh, and how much diskspace are you willing to give me? :)
2 TB of SSD as this is all he has. Some resources are currently still in use by his own quantize tasks but should be gone by tomorrow once the models that are currently being processed are done but just already start your own tasks once the container is ready. He is also running a satellite imagery data processing project for me for the next few weeks but its resource usage will be minimal. Just go all in and try to use as much resources as you can on this server. For his quantization tasks he usually runs 10 models in parallel and uses an increased number of connections to download them in order to optimally make use of all resources available.
I'm on it. Wow, load average of 700 :)
HBahaha, switching to bbr instantly reduced bandwidth to 600kBps :)
I've reduced the disk budget to 500GB+500 extra, which reflects the current abilities (1TB disk). That might mean it will take quite a while for quants to be able to start again but it will probably not run out of diskspace. At least the speed has increased to ~40MB/s and will probably increase furthzer while I am asleep.
Unfortunately, I can't really fix the situation, now that rich1 has been paused in a weird way. That's really bad timing.
Right, the pause script wasn't fully updated for rich1. I've fixed it and interrupted the jobs, as that seems to have been the intent of pausing. Of course, jobs will not easily clear at the moment. Very unfortunate timing.
Maybe the frustration and tiredness is speaking out of me right now, but I am close to say, thanks for the generous offer (it is generous), but I don't think I can get enough use out of it. The constant unexplained/uncoordinated tinkering makes rich1 a fast moving and ever-changing target. Right now, my container is practically idle and the box swaps at >100MBps, even though supposedly it should be idle. The unreasonable requirements put on me, the obvious inability of the hardware to fulfill the requirements, and now the uncommunicated pause just at the very moment where things could have cleared up... I don't think I can handle that.
Sorry I paused it because 1 hour ago there was 800 Mbit/s of incoming traffic for over 15 minutes from AWS CloudFront with the LXC container as destination despite no HuggingFace downloading task or any other process downloading a significant amount of data running. This only happened from the hosts perspective and was invisible when investigating from within the LXC container meaning the traffic was likely just discarded as there was no process in the LXC container to receive it. We wanted to see if pausing the tasks or even rebooting the LXC container fixes this issue as we thought this could be a possible cause of today’s internet issues but in the meantime the incoming traffic stopped by itself.
I thought you made the pause script to be used in such situations and wasn't aware of the troubles pausing a host causes for you. Good that I'm aware of it now as I even intended it to use it to pause tasks while doing performance measurements for the eval project on StormPeak (the host of nico1). I really thought pausing/resuming a host is a relatively safe operation and never thought of it causing so much troubles. Sorry this was all so avoidable and we could have just waited for the issue to fix itself but that obviously wasn't known at the time. I resumed rich1 again.
unreasonable requirements put on me
the obvious inability of the hardware to fulfill the requirements
The idea of keeping the server CPU busy is, I think, not doable
I agree. It was a mistake pushing you so hard to optimize it. The server rent is relatively expensive, so Richard wants to see it used as much as possible and with him being so generous of letting us use it I wanted to satisfy him. I was not aware that fully utilizing it was impossible as on paper the hardware sounds great but in reality, it turned out worse than expected. Please forget about any requirements and just make use of it as good as you can. Generally please don't take my "requirements" so serious. See them more as suggestions/recommendations which you can always decline. This is not a job, and we are all just doing what be believe is the best for this project with the time and resources we are willing to invest. You don't have to impress me with your awesome abilities. I already know that you are fantastic.
The constant unexplained/uncoordinated tinkering makes rich1 a fast moving and ever-changing target.
I'm so sorry for this. I only wanted to help but maybe it is better if I just leave rich1 alone as otherewise there are just too many working on it at the same time.
Right now, my container is practically idle and the box swaps at >100MBps, even though supposedly it should be idle.
This was probably because at this exact moment Richard was testing an improved versions of the satellite script on the host. A project we should likely just abandon to not make rich1 even less stable than it already is.
We decided to indefinitely pause to satellite project for the sake of rich1 stability.
I thought you made the pause script to be used in such situations and wasn't aware of the troubles pausing a host causes for you.
I was just very frustrated and tired and had to vent it. The pause script was made for this, and thanks to trying it out I was able to find a bug wiht it (interrupting didn't work because the paths are different and the script was not adjusted).
intended it to use it to pause tasks while doing performance measurements for the eval project on StormPeak (the host of nico1)
I had zero issues with this on nico1, and of course you can use it and do it.
The frustrating aspect is that rich1 requires constant babysitting - partially because it has unique and hard to fulfill constraints. For example, the low-nice Athene* job
was sitting in the front of the queue and couldn't proceed due to the low speed of model transfer. Unfortunately, all the rules of the queueing were made for high-nice-level/interactive tasks, most of which have been changed, but not all.
So what happened in this case is that the next job was just a few GB too large to fit, and the scheduler would not skip it and scheduler a smaller, later job. This is to avoid priority inversion (a single large model in front of the queue would constantly be overtaken by smaller models), and to some extent, this is also important for the low-prioirty jobs, but the result would have been that rich1 would run out of jobs. Since I didn't want this, I let some models through, but unfortunately, too many went through. Not too many by the scheduling rules, but too many because there was about 1TB of queued uploads, with consequently about 800GB of unaccounted extra space missing. This kind of would have worked, but badly, so I tried to schedule a few jobs manually, didn't notice that the queue was stopped, and altogether this was extremely frustrating for me.
The result was that very thing I wanted to avoid, rich1 becoming completely idle.
800 Mbit/s of incoming traffic for over 15 minutes from AWS CloudFront with the LXC container
That is very strange - some process somewhere must have accepted it, though - aws wouldn't keep sending (tcp) traffic if it got neither an ack, or if it got a rst, I would assume.
It was a mistake pushing you so hard to optimize it.
I think it was asking for the impossible. The server is simply not up to the task, at least some of the days. If there was endless network bandwidth, this wouldn't have happened. If the disk was infinitely large, we could queue upload tasks as much as we want. If the disk was infnitely fast and memory was infinite we could run many quant tasks in parallel.
Even the most expensive server will be limited in what it can do without internet. However, rich1 can be extremely useful, if its limitations are accepted.
I think the problem is a problem of coordination - on nico1, there is a lot you can do without disturbing me, and it's pretty painless to pause things, slow down things etc. And you have more than half of the day where activity is very limited by itself (the night). And not least, you usually communicate with me before big changes.
On rich1, coexistance is much more limited - if I run two (big) quant jobs, it essentially monopolises the memory. The shifting conditions (the network connection slowing down to <10MBps every weekday afternoon (apparently)) provides unique challenges. And the whole outside being a blackbox makes it hard to adjust.
We would need to coordinate more effectively. For example, if richard finds a task that requires less network usage than quanting, it would make total sense for me to run at most one quant task, to reduce memory usage and leave half or more of the memory for these other tasks, which could in turn fill the gaps in cpu usage that are caused by "bad weather" etc.
Another issue is understanding - rich1 is a relatively old cpu with rather bad hyperthreading, so when linux says it is 50% idle, it's probably more like 5% idle. When understanding that, idle time really isn't so bad, from my interpretation - the 32 cores are usually busy, what does cause problems is many uploads (which are not free in terms of cpu and other resources such as disk) and other jobs such as noquant (which slows down the disk), which ion turn makes iot hard for rich1 to keep the cpus busy.
Also, the disk, even if it "only" does ~1GBps, is very, very good, given the constant hammering and writes it gets. One just mustn't expect the impossible of the hardware and see what it can actually do.
A project we should likely just abandon to not make rich1 even less stable than it already is.
Depends. I think whats needed is more coordination - what resources does the satellite job need? If it is memory, we could limit the quanting to fewer jobs, or we could queue smaller models (but of course, the value of rich1 is that is has reasonable speeds for big models - it's likely faster than db1 and db2 together). And if then there is 10% idle time, or a few minutes of higher idle time because of disk I/O, so be it.
Or we could schedule smaller models again on rich1. It's not as fast as it might look like - nico1 is probably about 5 times as fast, to put it into perspective. Smaller models would reduce memory pressure, freeing it for other tasks.
And if there are no other task for a while, we can adjust parameters again. I can even make the adjustable dynamically by you. But again, even if adjusted, it can take quite as while for the queue to clear etc., so some planning/coordination is required.
I certainly do not have to hog this hardware completely. Or at all times.
Also, you mentioning it so often makes me super curious, want to share what that cool-sounding satellite project is to satisfy my morbid curiosity? :)