GGML & GGUF

#1
by gotzmann - opened

@TheBloke Please, this model family looks very promising

@gotzmann have you been using this? I've been using it for STEM related tasks and it's been a pleasant surprise!

I've been done thorough research of latest 70B models (Upstage, Samantha, Nous Hermes, ...) with our own benchmark and I should say that Synthia got most scores out all of them!

We are going to use it within our chat system instead of Upstage Instruct model.

BTW, I've converted Synthia v1.2 into Q_4_K_M format (which we are using), so it fits on a pair of 3090 / 4090 cards or one A6000:

https://huggingface.co/gotzmann/Synthia-70B-v1.2-GGML
https://huggingface.co/gotzmann/Synthia-70B-v1.2-GGUF

Awesome!

Share your app too when you're ready. I can also help with some LinkedIn/Twitter re-sharing. :)

How do I use GGML or GGUF models? What's the best and fastest way to use them for inference? Do you have a suggested library? How many tokens/second can you achieve with those?

You can use them in Llama.cpp but I personally use it in Obaboga Web UI. GGUF it's like GGML V2.0 with increased metadata length to store more info about model and some additional improvements. GGML becoming deprecated unfortunately pretty fast and GGUF is a new formal now. GGML format was designed to run large AI modes in CPU mode so we can use ddr ram instead vram. I usually getting 25-30 tokens per sec.

Is anyone solving 70B inferences to match GPT-3.5's generation time? My 70B models are served with Transformers text generation in 4-bit, and it's super slow!

Is anyone solving 70B inferences to match GPT-3.5's generation time? My 70B models are served with Transformers text generation in 4-bit, and it's super slow!

Ideally You need to run it on very large and powerful GPUs to have inference speed of GPT-3.5. And by running it in CPU mode like I do well don't expect it to be close to GPT-3.5 inference speed but I'm pretty sure some new optimizations on it's way to improve inference speed (for a CPU mode we already can partially offload GGML and GGUF models to GPUs to increase performance) maybe we actually would able to squeeze more speed in the near future though. Or you can try to rent some powerful cloud GPUs to try and experiment with inference speed.
Also you can try to use this to run your AI models:
https://github.com/huggingface/text-generation-inference
But I don't tried it yet.

Yeah that’s what I’m using. @gotzmann any thoughts?

migtissera changed discussion status to closed

Sign up or log in to comment