Gemini-Distill-Qwen2.5-0.5B-ead GGUF Quantized Versions (Distilled from Gemini-2.0-Flash-Thinking-Exp)

Model Description

This repository contains quantized versions of the fine-tuned Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead model, which was trained via knowledge distillation from Gemini-2.0-Flash-Thinking-Exp. The fine-tuning process teaches the model to reason through and generate Encoded Archival Description (EAD/XML) outputs, ensuring structured reasoning before final archival XML generation.

This repository provides various GGUF quantized formats, allowing efficient inference on different hardware setups, including CPUs and GPUs.


Available GGUF Files

The following quantized versions of the model were generated using llama.cpp:

File Name Description
Gemini-Distill-Qwen2.5-0.5B-ead-Q2_K.gguf Ultra-low precision (2-bit) for extreme compression
Gemini-Distill-Qwen2.5-0.5B-ead-Q3_K_M.gguf 3-bit quantization with mixed precision
Gemini-Distill-Qwen2.5-0.5B-ead-Q4_K_M.gguf 4-bit quantization with mixed precision
Gemini-Distill-Qwen2.5-0.5B-ead-Q5_K_M.gguf 5-bit quantization with mixed precision
Gemini-Distill-Qwen2.5-0.5B-ead-Q6_K.gguf 6-bit quantization
Gemini-Distill-Qwen2.5-0.5B-ead-Q8_0.gguf 8-bit quantization for balance between speed and accuracy
Gemini-Distill-Qwen2.5-0.5B-ead-fp16.gguf 16-bit floating point (fp16) version
Gemini-Distill-Qwen2.5-0.5B-ead-fp32.gguf Full precision (fp32) version

How to Use the Quantized Model

Running the Model with llama.cpp

To run the model using llama.cpp, use the following command:

./main -m Gemini-Distill-Qwen2.5-0.5B-ead-Q4_K_M.gguf -p "Convert the following archival information into EAD/XML: ..."

For optimal performance, ensure you select the right quantized version based on your hardware capabilities.

Running the Model with GPT4All

If using GPT4All, load the GGUF model with:

from gpt4all import GPT4All

model_path = "Gemini-Distill-Qwen2.5-0.5B-ead-Q4_K_M.gguf"
model = GPT4All(model_path)
response = model.generate("Convert the following archival information into EAD/XML:")
print(response)

Running the Model with Ollama

If using Ollama, load the GGUF model with:

ollama run hf.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-GGUF:Q8_0
import requests
import json

url = "http://localhost:11434/v1/chat/completions"

payload = json.dumps({
  "model": "hf.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-GGUF:Q8_0",
  "messages": [
    {
      "role": "system",
      "content": "You are an archivist expert in EAD/XML format for archival records metadata."
    },
    {
      "role": "user",
      "content": "Give me an example of <controlaccess> content."
    }
  ],
  "option": {
    "num_ctx": 4096,
    "temperature": 0.1
  },
  "stream": False
})
headers = {
  'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)

Choosing the Right Quantization Format

  • Lower-bit models (Q2_K, Q3_K_M, Q4_K_M): Best for low-memory devices, but may lose some accuracy.
  • Mid-range (Q5_K_M, Q6_K): Good trade-off between speed and precision.
  • Higher precision (Q8_0, fp16, fp32): Best for accuracy but requires more memory.

For CPU inference, Q4_K_M or Q5_K_M is recommended for a balance between efficiency and performance.


Limitations & Future Improvements

  • Inference Speed: Ensure Sliding Window Attention (SWA) is disabled, as it may slow down inference.
    • To disable: model.config.sliding_window = None
  • Future Work:
    • Further optimizations for CPU inference
    • Additional fine-tuning on larger datasets
    • Exploring LoRA/QLoRA for low-rank adaptation

Citation & Acknowledgments

If you use this model in research or production, please cite:

@misc{your-citation,
  author = {Géraldine Geoffroy},
  title = {Gemini-Distill-Qwen2.5-0.5B-ead GGUF Quantized Versions},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-GGUF}
}
Downloads last month
557
GGUF
Model size
494M params
Architecture
qwen2

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-GGUF

Base model

Qwen/Qwen2.5-0.5B
Quantized
(2)
this model