Tamil OCR Model (Vit + Tamil RoBERTa)

Model Description

This model is a Vision Encoder-Decoder-based OCR model for recognizing Tamil text from images. The encoder uses a Vision Transformer (ViT) architecture, and the decoder is based on a pre-trained Tamil RoBERTa model. The model is capable of processing image inputs and generating corresponding text, specifically optimized for Tamil script.

Model Architecture

Encoder: google/vit-base-patch16-224-in21k
- A Vision Transformer (ViT) model pre-trained on ImageNet21k, used for encoding image inputs.
Decoder: d42kw01f/Tamil-RoBERTa
- A RoBERTa model pre-trained on Tamil text data, fine-tuned to generate text based on visual features from the encoder.

Use Cases

The model is designed to perform Optical Character Recognition (OCR) on images containing Tamil text. Some potential use cases include:

Extracting Tamil text word from scanned documents.

How to Use

You can use this model with Hugging Face's transformers library to extract text from images. Below is a sample usage script:

from PIL import Image
from transformers import AutoFeatureExtractor, AutoTokenizer, TrOCRProcessor, VisionEncoderDecoderModel

# Load the model and processor
encoder_model = 'google/vit-base-patch16-224-in21k'
decoder_model = 'd42kw01f/Tamil-RoBERTa'
trained_model_path = '.model/'  # Path to the fine-tuned model

# Initialize the processor and model
feature_extractor = AutoFeatureExtractor.from_pretrained(encoder_model)
tokenizer = AutoTokenizer.from_pretrained(decoder_model)
processor = TrOCRProcessor(feature_extractor=feature_extractor, tokenizer=tokenizer)
model = VisionEncoderDecoderModel.from_pretrained(trained_model_path)

# Load and preprocess the image
image_path = 'path_to_your_image.jpg'
image = Image.open(image_path).convert('RGB')

# Generate text
pixel_values = processor(image, return_tensors='pt').pixel_values
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print("Generated Text:", generated_text)

Inputs

Image: The input is a single image containing Tamil text. Supported formats include .jpg, .png, and .jpeg.

Outputs

Text: The model generates a string of text in Tamil extracted from the input image.

Example

# Input: Image containing Tamil text.
# Output: Extracted text from the image.

Training

Dataset: The model was fine-tuned using a dataset of scanned Tamil text and printed documents.
Loss Function: Cross-Entropy Loss was used during training.
Optimization: The Adam optimizer was employed with a learning rate of 5e-5.

Limitations

Language Specificity: This model is optimized for Tamil script recognition. Performance on other languages or mixed-language documents may not be ideal.
Image Quality: The model's performance is dependent on the quality of the input image. Images that are too blurry, noisy, or have poor lighting may produce less accurate results.
Text Length: The model is optimized for extracting text with a maximum length of 64 characters. Longer texts might be truncated or inaccurately predicted.
Small Text: The model may struggle with images containing very small or intricate fonts.

Evaluation

The model was evaluated using standard OCR benchmarks with an emphasis on Tamil text recognition. The primary evaluation metric was character-level accuracy and Word Error Rate (WER).

Character Accuracy: Achieved ~79% accuracy on validation sets.
Train Loss: 0.063800
Validation Loss: 0.172539
CER: 0.072717

Ethical Considerations

This model, while useful for Tamil text extraction, should be applied with caution in contexts where incorrect text extraction could lead to harmful outcomes, such as legal or medical document analysis.

License

This model is distributed under the MIT license. Please check the Hugging Face repository for specific terms.

widget:

src: "./samples/72.jpg" example_title: "Example Image" outputs:
- label: "Text" content: "செகுவேரா"

sabaridsnfuji
/

Tamil_Offline_Handwritten_OCR