Gemini-Distill-Qwen2.5-0.5B-ead-ONNX

Model Description

This repository contains ONNX-optimized versions of the Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead model, distilled from Gemini-2.0-Flash-Thinking-Exp. This fine-tuned model is specifically designed for structured Encoded Archival Description (EAD/XML) reasoning and generation.

ONNX conversion enables faster inference on a variety of hardware, including CPUs, GPUs, and specialized inference accelerators.


Available ONNX Model Versions

The following ONNX quantized versions are provided for different inference needs:

File Name Description
model.onnx Full precision (fp32) version
model_fp16.onnx Half precision (fp16) for optimized GPU inference
model_bnb4.onnx Bitsandbytes 4-bit quantization
model_int8.onnx 8-bit integer quantization for efficient CPU inference
model_q4.onnx 4-bit quantization (for low-memory scenarios)
model_q4f16.onnx 4-bit quantization with fp16 fallback
model_uint8.onnx Unsigned 8-bit quantization
model_quantized.onnx General quantized model for mixed precision

How to Use the ONNX Model

1. Install Dependencies

Ensure you have the required dependencies for ONNX inference:

pip install onnxruntime

For GPU acceleration, install:

pip install onnxruntime-gpu

2. Load and Run Inference

You can use onnxruntime to load and run inference with the model:

import onnxruntime as ort
import numpy as np

# Load the ONNX model
session = ort.InferenceSession("model_fp16.onnx", providers=["CUDAExecutionProvider"])

# Prepare input data (example)
input_data = {"input_ids": np.array([[...]])}  # Replace with tokenized input

# Run inference
outputs = session.run(None, input_data)

# Print output
print(outputs)

Why ONNX?

  • Faster Inference: Optimized execution across different hardware.
  • Cross-Platform Compatibility: Run on CPUs, GPUs, and specialized accelerators.
  • Reduced Memory Usage: Quantized versions provide significant efficiency gains.

Citation & Acknowledgments

If you use this model in research or production, please cite:

@misc{your-citation,
  author = {Géraldine Geoffroy},
  title = {Gemini-Distill-Qwen2.5-0.5B-ead-ONNX},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-ONNX}
}
Downloads last month
14
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support text-generation models for transformers.js library.

Model tree for Geraldine/Gemini-Distill-Qwen2.5-0.5B-ead-ONNX

Base model

Qwen/Qwen2.5-0.5B
Quantized
(2)
this model