agomberto commited on
Commit
495cbae
·
1 Parent(s): 3b69b8c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -0
README.md CHANGED
@@ -1,3 +1,85 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ datasets:
4
+ - agomberto/FrenchCensus-handwritten-texts
5
+ language:
6
+ - fr
7
+ pipeline_tag: image-to-text
8
+ tags:
9
+ - pytorch
10
+ - transformers
11
+ - trocr
12
+ widget:
13
+ - src: >-
14
+ https://raw.githubusercontent.com/agombert/trocr-base-printed-fr/main/sample_imgs/4.png
15
+ example_title: Example 1
16
+ - src: >-
17
+ https://raw.githubusercontent.com/agombert/trocr-base-printed-fr/main/sample_imgs/5.jpg
18
+ example_title: Example 2
19
+ metrics:
20
+ - cer
21
+ - wer
22
  ---
23
+
24
+ # TrOCR base handwritten for French
25
+
26
+ ## Overview
27
+
28
+ TrOCR handwritten has not yet released for French, so we trained a French model for PoC purpose. Based on this model, it is recommended to collect more data to additionally train the 1st stage or perform fine-tuning as the 2nd stage.
29
+
30
+ It's a special case of the [English large handwritten trOCR model](https://huggingface.co/microsoft/trocr-large-handwritten) introduced in the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Li et al. and first released in [this repository](https://github.com/microsoft/unilm/tree/master/trocr) as a TrOCR model fine-tuned on the [IAM dataset](https://fki.tic.heia-fr.ch/databases/iam-handwriting-database).
31
+
32
+ We decided to fine-tuned on two datasets:
33
+ 1. [French Census dataset](https://zenodo.org/record/6581158) from Constum et al. We created a [dataset on the hub](https://huggingface.co/datasets/agomberto/FrenchCensus-handwritten-texts) too.
34
+ 2. A dataset soon to come on French archives
35
+
36
+ ## Model description
37
+
38
+ The TrOCR model is an encoder-decoder model, consisting of an image Transformer as encoder, and a text Transformer as decoder. The image encoder was initialized from the weights of BEiT, while the text decoder was initialized from the weights of RoBERTa.
39
+
40
+ Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder. Next, the Transformer text decoder autoregressively generates tokens.
41
+
42
+ ## Intended uses & limitations
43
+
44
+ You can use the raw model for optical character recognition (OCR) on single text-line images.
45
+
46
+ ## Parameters
47
+ We used heuristic parameters without separate hyperparameter tuning.
48
+ - learning_rate = 4e-5
49
+ - epochs = 10
50
+ - fp16 = True
51
+ - max_length = 32
52
+
53
+ ## Results on dev set
54
+
55
+ For the dev set we got those results
56
+ - size of the test set: 1300 examples
57
+ - CER: 0.12
58
+ - WER: 0.37
59
+
60
+ For the test set (from French Census only) we got those results
61
+ - size of the test set: 700 examples
62
+ - CER: 0.16
63
+ - WER: 0.81
64
+
65
+ ### How to use
66
+
67
+ Here is how to use this model in PyTorch:
68
+
69
+ ```python
70
+ from transformers import TrOCRProcessor, VisionEncoderDecoderModel, AutoTokenizer
71
+ from PIL import Image
72
+ import requests
73
+
74
+ url = "https://github.com/agombert/trocr-base-printed-fr/blob/main/sample_imgs/5.jpg"
75
+ response = requests.get(url)
76
+ img = Image.open(BytesIO(response.content))
77
+
78
+ processor = TrOCRProcessor.from_pretrained('microsoft/trocr-large-handwritten')
79
+ model = VisionEncoderDecoderModel.from_pretrained('agomberto/trocr-large-handwritten-fr')
80
+ tokenizer = AutoTokenizer.from_pretrained('agomberto/trocr-large-handwritten-fr')
81
+
82
+ pixel_values = (processor(images=image, return_tensors="pt").pixel_values)
83
+ generated_ids = model.generate(pixel_values)
84
+ generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
85
+ ```