Gemini-Distill-Qwen2.5-0.5B-ead GGUF Quantized Versions (Distilled from Gemini-2.0-Flash-Thinking-Exp)
Model Description
This repository contains quantized versions of the fine-tuned Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead model, which was trained via knowledge distillation from Gemini-2.0-Flash-Thinking-Exp. The fine-tuning process teaches the model to reason through and generate Encoded Archival Description (EAD/XML) outputs, ensuring structured reasoning before final archival XML generation.
This repository provides various GGUF quantized formats, allowing efficient inference on different hardware setups, including CPUs and GPUs.
Available GGUF Files
The following quantized versions of the model were generated using llama.cpp:
File Name | Description |
---|---|
Gemini-Distill-Qwen2.5-0.5B-ead-Q2_K.gguf |
Ultra-low precision (2-bit) for extreme compression |
Gemini-Distill-Qwen2.5-0.5B-ead-Q3_K_M.gguf |
3-bit quantization with mixed precision |
Gemini-Distill-Qwen2.5-0.5B-ead-Q4_K_M.gguf |
4-bit quantization with mixed precision |
Gemini-Distill-Qwen2.5-0.5B-ead-Q5_K_M.gguf |
5-bit quantization with mixed precision |
Gemini-Distill-Qwen2.5-0.5B-ead-Q6_K.gguf |
6-bit quantization |
Gemini-Distill-Qwen2.5-0.5B-ead-Q8_0.gguf |
8-bit quantization for balance between speed and accuracy |
Gemini-Distill-Qwen2.5-0.5B-ead-fp16.gguf |
16-bit floating point (fp16) version |
Gemini-Distill-Qwen2.5-0.5B-ead-fp32.gguf |
Full precision (fp32) version |
How to Use the Quantized Model
Running the Model with llama.cpp
To run the model using llama.cpp
, use the following command:
./main -m Gemini-Distill-Qwen2.5-0.5B-ead-Q4_K_M.gguf -p "Convert the following archival information into EAD/XML: ..."
For optimal performance, ensure you select the right quantized version based on your hardware capabilities.
Running the Model with GPT4All
If using GPT4All, load the GGUF model with:
from gpt4all import GPT4All
model_path = "Gemini-Distill-Qwen2.5-0.5B-ead-Q4_K_M.gguf"
model = GPT4All(model_path)
response = model.generate("Convert the following archival information into EAD/XML:")
print(response)
Running the Model with Ollama
If using Ollama, load the GGUF model with:
ollama run hf.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-GGUF:Q8_0
import requests
import json
url = "http://localhost:11434/v1/chat/completions"
payload = json.dumps({
"model": "hf.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-GGUF:Q8_0",
"messages": [
{
"role": "system",
"content": "You are an archivist expert in EAD/XML format for archival records metadata."
},
{
"role": "user",
"content": "Give me an example of <controlaccess> content."
}
],
"option": {
"num_ctx": 4096,
"temperature": 0.1
},
"stream": False
})
headers = {
'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)
Choosing the Right Quantization Format
- Lower-bit models (Q2_K, Q3_K_M, Q4_K_M): Best for low-memory devices, but may lose some accuracy.
- Mid-range (Q5_K_M, Q6_K): Good trade-off between speed and precision.
- Higher precision (Q8_0, fp16, fp32): Best for accuracy but requires more memory.
For CPU inference, Q4_K_M or Q5_K_M is recommended for a balance between efficiency and performance.
Limitations & Future Improvements
- Inference Speed: Ensure Sliding Window Attention (SWA) is disabled, as it may slow down inference.
- To disable:
model.config.sliding_window = None
- To disable:
- Future Work:
- Further optimizations for CPU inference
- Additional fine-tuning on larger datasets
- Exploring LoRA/QLoRA for low-rank adaptation
Citation & Acknowledgments
If you use this model in research or production, please cite:
@misc{your-citation,
author = {Géraldine Geoffroy},
title = {Gemini-Distill-Qwen2.5-0.5B-ead GGUF Quantized Versions},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-GGUF}
}
- Downloads last month
- 557
Model tree for Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-GGUF
Base model
Qwen/Qwen2.5-0.5B