agomberto's picture
Update README.md
6001660
|
raw
history blame
3.43 kB
metadata
license: mit
datasets:
  - agomberto/FrenchCensus-handwritten-texts
language:
  - fr
pipeline_tag: image-to-text
tags:
  - pytorch
  - transformers
  - trocr
widget:
  - src: >-
      https://raw.githubusercontent.com/agombert/trocr-base-printed-fr/main/sample_imgs/4.png
    example_title: Example 1
  - src: >-
      https://raw.githubusercontent.com/agombert/trocr-base-printed-fr/main/sample_imgs/5.jpg
    example_title: Example 2
metrics:
  - cer
  - wer

TrOCR base handwritten for French

Overview

TrOCR handwritten has not yet released for French, so we trained a French model for PoC purpose. Based on this model, it is recommended to collect more data to additionally train the 1st stage or perform fine-tuning as the 2nd stage.

It's a special case of the English large handwritten trOCR model introduced in the paper TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models by Li et al. and first released in this repository as a TrOCR model fine-tuned on the IAM dataset.

We decided to fine-tuned on two datasets:

  1. French Census dataset from Constum et al. We created a dataset on the hub too.
  2. A dataset soon to come on French archives

Model description

The TrOCR model is an encoder-decoder model, consisting of an image Transformer as encoder, and a text Transformer as decoder. The image encoder was initialized from the weights of BEiT, while the text decoder was initialized from the weights of RoBERTa.

Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder. Next, the Transformer text decoder autoregressively generates tokens.

Intended uses & limitations

You can use the raw model for optical character recognition (OCR) on single text-line images.

Parameters

We used heuristic parameters without separate hyperparameter tuning.

  • learning_rate = 4e-5
  • epochs = 10
  • fp16 = True
  • max_length = 32
  • split train/dev: 90/10

Metrics

For the dev set we got those results

  • size of the test set: 1550 examples
  • CER: 0.07
  • WER: 0.20

For the test set (from French Census only) we got those results

  • size of the test set: 730 examples
  • CER: 0.11
  • WER: 0.25

How to use

Here is how to use this model in PyTorch:

from transformers import TrOCRProcessor, VisionEncoderDecoderModel, AutoTokenizer
from PIL import Image
import requests

url = "https://github.com/agombert/trocr-base-printed-fr/blob/main/sample_imgs/5.jpg"
response = requests.get(url)
img = Image.open(BytesIO(response.content))

processor = TrOCRProcessor.from_pretrained('microsoft/trocr-large-handwritten')
model = VisionEncoderDecoderModel.from_pretrained('agomberto/trocr-large-handwritten-fr')
tokenizer = AutoTokenizer.from_pretrained('agomberto/trocr-large-handwritten-fr')

pixel_values = (processor(images=image, return_tensors="pt").pixel_values)
generated_ids = model.generate(pixel_values)
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]