Inference with Dolly v2 on CPU using GGML
I have been able to run Dolly v2 on my computer using only CPU with 16GB of RAM using the ggml model format. Someone has put together an example of converting the original Dolly model file to the ggml file format.
The original example is missing the Context portion of the prompt, but I don't think that will be very difficult to add in.
If anyone is looking to run Dolly v2 on their CPU, this has been a viable solution for me. You may still encounter OOM errors running the conversion script on anything with less than 32GB ram however, but after finishing the conversion it has been smooth sailing. Is it possible that Databricks could provide an "official" copy of the ggml Dolly model?
Oh nice, how fast does it run? you can already run it as-is on CPUs, but is this faster or more memory efficient?
There are so many possible variations to maintain (more sizes, quantized, safetensors, ONNX?, this, etc) that I don't think we'd do it, but others can.
The GGML model binaries I'm using are a quantized version of Dolly V2 to FP16 and to FP5 and implemented in C. The quantized models of course are more memory efficient and my experience is that running these in C is faster. I don't have access to my machine this week, but I promise to update my discussion post here with more metrics soon (TM).
The reasoning not to support every single variation of Dolly makes sense, and I'm not sure why I didn't reason that out first haha. I'll ask the GGML maintainers if they can add the files to their own HF repository.
This could be an important variation if it's very plausibly performant on a CPU, esp if it works on Macs, where the GPU is hard to get working.
Not out of the question for sure, I'm not the Decider either.
It could start as a separate model here.