Qari-OCR-0.1-VL-2B-Instruct Model
Model Overview
This model is a fine-tuned version of unsloth/Qwen2-VL-2B-Instruct on an Arabic OCR dataset. It is optimized to perform Arabic Optical Character Recognition (OCR) for full-page text.

Model Details
- Base Model: Qwen2 VL
- Fine-tuning Dataset: Arabic OCR dataset
- Objective: Extract full-page Arabic text with high accuracy
- Languages: Arabic
- Tasks: OCR (Optical Character Recognition)
Performance Evaluation
The model has been evaluated on standard OCR metrics, including Word Error Rate (WER), Character Error Rate (CER), and BLEU score.
Metrics
Model |
WER β |
CER β |
BLEU β |
Qari v0.1 Model |
0.068 |
0.019 |
0.860 |
Qwen2 VL 2B |
1.344 |
1.191 |
0.201 |
EasyOCR |
0.908 |
0.617 |
0.152 |
Tesseract OCR |
0.428 |
0.226 |
0.410 |
Key Results
- WER: 0.068 (93.2% word accuracy)
- CER: 0.019 (98.1% character accuracy)
- BLEU: 0.860
Performance Comparison
The Fine-Tuned Model outperforms other solutions with:
- 95% reduction in WER compared to Base Model
- 98% reduction in CER compared to Base Model
- 328% improvement in BLEU score compared to Base Model
- 84% lower WER than Tesseract OCR
- 92% lower WER than EasyOCR
Performance Comparison Charts
WER & CER Comparison
BLEU Score Comparison
How to Use
Try Qari - Google Colab
You can load this model using the transformers
and qwen_vl_utils
library:
!pip install transformers qwen_vl_utils accelerate>=0.26.0 PEFT -U
!pip install -U bitsandbytes
from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
import os
from qwen_vl_utils import process_vision_info
model_name = "NAMAA-Space/Qari-OCR-0.1-VL-2B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)
max_tokens = 2000
prompt = "Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. Just return the plain text representation of this document as if you were reading it naturally. Do not hallucinate."
image.save("image.png")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": f"file://{src}"},
{"type": "text", "text": prompt},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=max_tokens)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
os.remove(src)
print(output_text)
License
This model follows the licensing terms of the original Qwen2 VL model. Please review the terms before using it commercially.
Citation
If you use this model in your research, please cite:
@misc{QariOCR2025,
title={Qari-OCR: A High-Accuracy Model for Arabic Optical Character Recognition},
author={NAMAA},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/NAMAA-Space/Qari-OCR-0.1-VL-2B-Instruct}},
note={Accessed: 2025-03-03}
}