Very fast: 3.12 tok/s on llama-2-openhermes.gguf.q5_1

#2
by boqsc - opened

If the models in this repository are named after their quantisation correctly, then they are quite fast, almost twice as fast in comparison to https://huggingface.co/s3nh/teknium-OpenHermes-13B-GGUF/tree/main

In addition: way less RAM is required to utilize this model release.

Update:
And finally the size seems to suggest that this might be the 7b version of OpenHermes and not 13b like the one mentioned before.
So this is not suprising at all that it is very fast at generating tokens.
This specific release aside, what's really surprising is how well it performs in story telling, at least as I'm currently testing it.

Sign up or log in to comment