Gemini-Distill-Qwen2.5-0.5B-ead-ONNX
Model Description
This repository contains ONNX-optimized versions of the Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead model, distilled from Gemini-2.0-Flash-Thinking-Exp. This fine-tuned model is specifically designed for structured Encoded Archival Description (EAD/XML) reasoning and generation.
ONNX conversion enables faster inference on a variety of hardware, including CPUs, GPUs, and specialized inference accelerators.
Available ONNX Model Versions
The following ONNX quantized versions are provided for different inference needs:
File Name | Description |
---|---|
model.onnx |
Full precision (fp32) version |
model_fp16.onnx |
Half precision (fp16) for optimized GPU inference |
model_bnb4.onnx |
Bitsandbytes 4-bit quantization |
model_int8.onnx |
8-bit integer quantization for efficient CPU inference |
model_q4.onnx |
4-bit quantization (for low-memory scenarios) |
model_q4f16.onnx |
4-bit quantization with fp16 fallback |
model_uint8.onnx |
Unsigned 8-bit quantization |
model_quantized.onnx |
General quantized model for mixed precision |
How to Use the ONNX Model
1. Install Dependencies
Ensure you have the required dependencies for ONNX inference:
pip install onnxruntime
For GPU acceleration, install:
pip install onnxruntime-gpu
2. Load and Run Inference
You can use onnxruntime
to load and run inference with the model:
import onnxruntime as ort
import numpy as np
# Load the ONNX model
session = ort.InferenceSession("model_fp16.onnx", providers=["CUDAExecutionProvider"])
# Prepare input data (example)
input_data = {"input_ids": np.array([[...]])} # Replace with tokenized input
# Run inference
outputs = session.run(None, input_data)
# Print output
print(outputs)
Why ONNX?
- Faster Inference: Optimized execution across different hardware.
- Cross-Platform Compatibility: Run on CPUs, GPUs, and specialized accelerators.
- Reduced Memory Usage: Quantized versions provide significant efficiency gains.
Citation & Acknowledgments
If you use this model in research or production, please cite:
@misc{your-citation,
author = {Géraldine Geoffroy},
title = {Gemini-Distill-Qwen2.5-0.5B-ead-ONNX},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-ONNX}
}
- Downloads last month
- 14
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The HF Inference API does not support text-generation models for transformers.js library.
Model tree for Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-ONNX
Base model
Qwen/Qwen2.5-0.5B
Finetuned
Qwen/Qwen2.5-0.5B-Instruct