license: mit
language:
- ta
metrics:
- cer
Tamil OCR Model (Vit + Tamil RoBERTa)
Model Description
This model is a Vision Encoder-Decoder-based OCR model for recognizing Tamil text from images. The encoder uses a Vision Transformer (ViT) architecture, and the decoder is based on a pre-trained Tamil RoBERTa model. The model is capable of processing image inputs and generating corresponding text, specifically optimized for Tamil script.
Model Architecture
Encoder:
google/vit-base-patch16-224-in21k
- A Vision Transformer (ViT) model pre-trained on ImageNet21k, used for encoding image inputs.
Decoder:
d42kw01f/Tamil-RoBERTa
- A RoBERTa model pre-trained on Tamil text data, fine-tuned to generate text based on visual features from the encoder.
Use Cases
The model is designed to perform Optical Character Recognition (OCR) on images containing Tamil text. Some potential use cases include:
- Extracting Tamil text word from scanned documents.
How to Use
You can use this model with Hugging Face's transformers
library to extract text from images. Below is a sample usage script:
from PIL import Image
from transformers import AutoFeatureExtractor, AutoTokenizer, TrOCRProcessor, VisionEncoderDecoderModel
# Load the model and processor
encoder_model = 'google/vit-base-patch16-224-in21k'
decoder_model = 'd42kw01f/Tamil-RoBERTa'
trained_model_path = '.model/' # Path to the fine-tuned model
# Initialize the processor and model
feature_extractor = AutoFeatureExtractor.from_pretrained(encoder_model)
tokenizer = AutoTokenizer.from_pretrained(decoder_model)
processor = TrOCRProcessor(feature_extractor=feature_extractor, tokenizer=tokenizer)
model = VisionEncoderDecoderModel.from_pretrained(trained_model_path)
# Load and preprocess the image
image_path = 'path_to_your_image.jpg'
image = Image.open(image_path).convert('RGB')
# Generate text
pixel_values = processor(image, return_tensors='pt').pixel_values
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print("Generated Text:", generated_text)
Inputs
- Image: The input is a single image containing Tamil text. Supported formats include
.jpg
,.png
, and.jpeg
.
Outputs
- Text: The model generates a string of text in Tamil extracted from the input image.
Example
# Input: Image containing Tamil text.
# Output: Extracted text from the image.
Training
- Dataset: The model was fine-tuned using a dataset of scanned Tamil text and printed documents.
- Loss Function: Cross-Entropy Loss was used during training.
- Optimization: The Adam optimizer was employed with a learning rate of
5e-5
.
Limitations
- Language Specificity: This model is optimized for Tamil script recognition. Performance on other languages or mixed-language documents may not be ideal.
- Image Quality: The model's performance is dependent on the quality of the input image. Images that are too blurry, noisy, or have poor lighting may produce less accurate results.
- Text Length: The model is optimized for extracting text with a maximum length of 64 characters. Longer texts might be truncated or inaccurately predicted.
- Small Text: The model may struggle with images containing very small or intricate fonts.
Evaluation
The model was evaluated using standard OCR benchmarks with an emphasis on Tamil text recognition. The primary evaluation metric was character-level accuracy and Word Error Rate (WER).
- Character Accuracy: Achieved ~79% accuracy on validation sets.
- Train Loss: 0.063800
- Validation Loss: 0.172539
- CER: 0.072717
Ethical Considerations
This model, while useful for Tamil text extraction, should be applied with caution in contexts where incorrect text extraction could lead to harmful outcomes, such as legal or medical document analysis.
License
This model is distributed under the MIT license. Please check the Hugging Face repository for specific terms.
widget:
- src: "./samples/72.jpg"
example_title: "Example Image"
outputs:
- label: "Text" content: "செகுவேரா"