Gemini-Distill-Qwen2.5-0.5B-ead GGUF Quantized Versions (Distilled from Gemini-2.0-Flash-Thinking-Exp)

Model Description

This repository contains quantized versions of the fine-tuned Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead model, which was trained via knowledge distillation from Gemini-2.0-Flash-Thinking-Exp. The fine-tuning process teaches the model to reason through and generate Encoded Archival Description (EAD/XML) outputs, ensuring structured reasoning before final archival XML generation.

This repository provides various GGUF quantized formats, allowing efficient inference on different hardware setups, including CPUs and GPUs.

Available GGUF Files

The following quantized versions of the model were generated using llama.cpp:

File Name	Description
`Gemini-Distill-Qwen2.5-0.5B-ead-Q2_K.gguf`	Ultra-low precision (2-bit) for extreme compression
`Gemini-Distill-Qwen2.5-0.5B-ead-Q3_K_M.gguf`	3-bit quantization with mixed precision
`Gemini-Distill-Qwen2.5-0.5B-ead-Q4_K_M.gguf`	4-bit quantization with mixed precision
`Gemini-Distill-Qwen2.5-0.5B-ead-Q5_K_M.gguf`	5-bit quantization with mixed precision
`Gemini-Distill-Qwen2.5-0.5B-ead-Q6_K.gguf`	6-bit quantization
`Gemini-Distill-Qwen2.5-0.5B-ead-Q8_0.gguf`	8-bit quantization for balance between speed and accuracy
`Gemini-Distill-Qwen2.5-0.5B-ead-fp16.gguf`	16-bit floating point (fp16) version
`Gemini-Distill-Qwen2.5-0.5B-ead-fp32.gguf`	Full precision (fp32) version

How to Use the Quantized Model

Running the Model with llama.cpp

To run the model using llama.cpp, use the following command:

./main -m Gemini-Distill-Qwen2.5-0.5B-ead-Q4_K_M.gguf -p "Convert the following archival information into EAD/XML: ..."

For optimal performance, ensure you select the right quantized version based on your hardware capabilities.

Running the Model with GPT4All

If using GPT4All, load the GGUF model with:

from gpt4all import GPT4All

model_path = "Gemini-Distill-Qwen2.5-0.5B-ead-Q4_K_M.gguf"
model = GPT4All(model_path)
response = model.generate("Convert the following archival information into EAD/XML:")
print(response)

Running the Model with Ollama

If using Ollama, load the GGUF model with:

ollama run hf.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-GGUF:Q8_0

import requests
import json

url = "http://localhost:11434/v1/chat/completions"

payload = json.dumps({
  "model": "hf.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-GGUF:Q8_0",
  "messages": [
    {
      "role": "system",
      "content": "You are an archivist expert in EAD/XML format for archival records metadata."
    },
    {
      "role": "user",
      "content": "Give me an example of <controlaccess> content."
    }
  ],
  "option": {
    "num_ctx": 4096,
    "temperature": 0.1
  },
  "stream": False
})
headers = {
  'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)

Choosing the Right Quantization Format

Lower-bit models (Q2_K, Q3_K_M, Q4_K_M): Best for low-memory devices, but may lose some accuracy.
Mid-range (Q5_K_M, Q6_K): Good trade-off between speed and precision.
Higher precision (Q8_0, fp16, fp32): Best for accuracy but requires more memory.

For CPU inference, Q4_K_M or Q5_K_M is recommended for a balance between efficiency and performance.

Limitations & Future Improvements

Inference Speed: Ensure Sliding Window Attention (SWA) is disabled, as it may slow down inference.
- To disable: model.config.sliding_window = None
Future Work:
- Further optimizations for CPU inference
- Additional fine-tuning on larger datasets
- Exploring LoRA/QLoRA for low-rank adaptation

Citation & Acknowledgments

If you use this model in research or production, please cite:

@misc{your-citation,
  author = {Géraldine Geoffroy},
  title = {Gemini-Distill-Qwen2.5-0.5B-ead GGUF Quantized Versions},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-GGUF}
}

Geraldine
/

Gemini-Distill-Qwen2.5-0.5B-ead-GGUF