Use Ollama with any GGUF Model on Hugging Face Hub

cover

🆕 You can now also run private GGUFs from the Hugging Face Hub.

Ollama is an application based on llama.cpp to interact with LLMs directly through your computer. You can use any GGUF quants created by the community (bartowski, MaziyarPanahi and many more) on Hugging Face directly with Ollama, without creating a new Modelfile. At the time of writing there are 45K public GGUF checkpoints on the Hub, you can run any of them with a single ollama run command. We also provide customisations like choosing quantization type, system prompt and more to improve your overall experience.

Getting started is as simple as:

Enable ollama under your Local Apps settings.
On a model page, choose ollama from Use this model dropdown. For example: bartowski/Llama-3.2-1B-Instruct-GGUF.

The snippet would be in format:

ollama run hf.co/{username}/{repository}

Please note that you can use both hf.co and huggingface.co as the domain name.

Here are some models you can try:

ollama run hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF
ollama run hf.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF
ollama run hf.co/arcee-ai/SuperNova-Medius-GGUF
ollama run hf.co/bartowski/Humanish-LLama3-8B-Instruct-GGUF

Custom Quantization

By default, the Q4_K_M quantization scheme is used, when it’s present inside the model repo. If not, we default to picking one reasonable quant type present inside the repo.

To select a different scheme, simply:

From Files and versions tab on a model page, open GGUF viewer on a particular GGUF file.
Choose ollama from Use this model dropdown.

The snippet would be in format (quantization tag added):

ollama run hf.co/{username}/{repository}:{quantization}

For example:

ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:IQ3_M
ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0

# the quantization name is case-insensitive, this will also work
ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:iq3_m

# you can also directly use the full filename as a tag
ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:Llama-3.2-3B-Instruct-IQ3_M.gguf

Custom Chat Template and Parameters

By default, a template will be selected automatically from a list of commonly used templates. It will be selected based on the built-in tokenizer.chat_template metadata stored inside the GGUF file.

If your GGUF file doesn’t have a built-in template or if you want to customize your chat template, you can create a new file called template in the repository. The template must be a Go template, not a Jinja template. Here’s an example:

{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>

To know more about the Go template format, please refer to this documentation

You can optionally configure a system prompt by putting it into a new file named system in the repository.

To change sampling parameters, create a file named params in the repository. The file must be in JSON format. For the list of all available parameters, please refer to this documentation.

Run Private GGUFs from the Hugging Face Hub

You can run private GGUFs from your personal account or from an associated organisation account in two simple steps:

Copy your Ollama SSH key, you can do so via: cat ~/.ollama/id_ed25519.pub | pbcopy
Add the corresponding key to your Hugging Face account by going to your account settings and clicking on Add new SSH key.
That’s it! You can now run private GGUFs from the Hugging Face Hub: ollama run hf.co/{username}/{repository}.

References

< > Update on GitHub