Qari-OCR-0.1-VL-2B-Instruct Model

Model Overview

This model is a fine-tuned version of unsloth/Qwen2-VL-2B-Instruct on an Arabic OCR dataset. It is optimized to perform Arabic Optical Character Recognition (OCR) for full-page text.

Model Details

Base Model: Qwen2 VL
Fine-tuning Dataset: Arabic OCR dataset
Objective: Extract full-page Arabic text with high accuracy
Languages: Arabic
Tasks: OCR (Optical Character Recognition)

Performance Evaluation

The model has been evaluated on standard OCR metrics, including Word Error Rate (WER), Character Error Rate (CER), and BLEU score.

Metrics

Model	WER ↓	CER ↓	BLEU ↑
Qari v0.1 Model	0.068	0.019	0.860
Qwen2 VL 2B	1.344	1.191	0.201
EasyOCR	0.908	0.617	0.152
Tesseract OCR	0.428	0.226	0.410

Key Results

WER: 0.068 (93.2% word accuracy)
CER: 0.019 (98.1% character accuracy)
BLEU: 0.860

Performance Comparison

The Fine-Tuned Model outperforms other solutions with:

95% reduction in WER compared to Base Model
98% reduction in CER compared to Base Model
328% improvement in BLEU score compared to Base Model
84% lower WER than Tesseract OCR
92% lower WER than EasyOCR

Performance Comparison Charts

WER & CER Comparison

BLEU Score Comparison

How to Use

Try Qari - Google Colab

You can load this model using the transformers and qwen_vl_utils library:

!pip install transformers qwen_vl_utils accelerate>=0.26.0 PEFT -U
!pip install -U bitsandbytes

from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
import os
from qwen_vl_utils import process_vision_info



model_name = "NAMAA-Space/Qari-OCR-0.1-VL-2B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(
                model_name,
                torch_dtype="auto",
                device_map="auto"
            )
processor = AutoProcessor.from_pretrained(model_name)
max_tokens = 2000

prompt = "Below is the image of one page of a document, as well as some raw textual content that was previously extracted for it. Just return the plain text representation of this document as if you were reading it naturally. Do not hallucinate."
image.save("image.png")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": f"file://{src}"},
            {"type": "text", "text": prompt},
        ],
    }
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=max_tokens)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
os.remove(src)
print(output_text)

License

This model follows the licensing terms of the original Qwen2 VL model. Please review the terms before using it commercially.

Citation

If you use this model in your research, please cite:

@misc{QariOCR2025,
  title={Qari-OCR: A High-Accuracy Model for Arabic Optical Character Recognition},
  author={NAMAA},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/NAMAA-Space/Qari-OCR-0.1-VL-2B-Instruct}},
  note={Accessed: 2025-03-03}
}

NAMAA-Space
/

Qari-OCR-0.1-VL-2B-Instruct