README.md · priyank-m/m

metadata

language:
  - en
  - zh
  - multilingual
tags:
  - Image-to-Text
  - OCR
  - Image-Captioning
  - Text-Recognition
datasets:
  - priyank-m/text_recognition_en_zh_clean
metrics:
  - cer

Multilingual OCR (m_OCR) is a VisionEncoderDecoder model based on the concept of TrOCR for English and Chinese document text-recognition. It uses a pre-trained Vision encoder and a pre-trained Language model as decoder.

Encoder model used: facebook/vit-mae-large

Decoder model used: xlm-roberta-base

Notes and observations:

TrOCR used the open source trained models but also mentions that it was trained for the text-recognition task as pre-training on 684 Million samples. There was a second stage training where additional data was used.
TrOCR was pre-trained using 32 V100 GPUs having 32GB memory and 8 V100 GPUs for fine-tuning. Batch size was 2,048 and learning rate was 5e-5
The diagram on the paper is a bit misleading as the image is first resized and then divided into 16x16 patches, but the diagram did not show re-sizing.
First idea was to use DiT as it was trained on 41 Million document images, which could have provided a good boost to the performance in theory, but the performance was extremely bad so discarded the model.
Tried to use several other models, but not all models fit with each other, the VisionEncoderDecoder model throws error.
Another idea was to use Bloom as it was recently released at the time of writing but it actullay requires a value to be passed to indicate which language we are processing, therefore not suitable for building a multi-lingual OCR.
The models which worked best together for me were ViT and Roberta.
TrOCR paper did not mention what happens if you use large Vision model and a base Language model, mOCR model uses this configuration.
Large amount of data covering a wide variety of variations is required to get a good performance. Trained mOCR on 200K dataset and the performance was very poor. Training the model on approx 1.4 Million samples increased the performance to good levels.
Using large datasets, for example close to 1 Million samples, starts posing additional difficulties in downloading and uploading datasets and even cleaning becomes quite slow if using only free resources on the internet.
Using set_transform function to transform the samples on-the-fly was a good idea as it didn't require to save the transformed dataset.
Streaming dataset might be another good option if the dataset size were to increase any further.
Free GPU on colab seem not enough for this experiment, as keeping two models in GPU and training forces to keep batch size small and also the free GPUs (T4) are not fast enough.
A very important data cleaning step was to just check if the sample image and text can be converted to the input format expected by the model, the text should be non-empty value when converted back from the input IDs to text (some characters are not identified by the tokenizer and get converted to special token and we usually skip the special tokens when converting input IDs back to text) as it is required to be non-empty while doing the CER calculation.
Resuming model training was taking almost 1 or sometimes 2 hours in just skipping the batches, to avoid this wastage one possible solution would be to shuffle the training dataset before starting the training and then avoid the skipping of batches. This would be particularly useful when we increse the dataset size further.