File size: 3,366 Bytes
0627094
 
 
 
 
 
 
 
 
e42a587
0627094
e42a587
0627094
e42a587
0627094
e42a587
0627094
e42a587
0627094
e42a587
0627094
e42a587
 
 
 
bc20f5a
 
e42a587
 
 
 
 
bc20f5a
 
e42a587
 
 
 
 
bc20f5a
0627094
 
e42a587
bc20f5a
0627094
e42a587
 
 
0627094
e42a587
 
bc20f5a
0627094
e42a587
 
0627094
 
bc20f5a
e42a587
 
5656000
a7f5ba2
e42a587
 
bc20f5a
e42a587
 
bc20f5a
 
 
e42a587
a3558db
 
 
e42a587
bc20f5a
0627094
e42a587
 
 
bc20f5a
0627094
e42a587
 
 
 
 
bc20f5a
0627094
e42a587
 
 
4016776
e42a587
4016776
e42a587
4016776
e42a587
8421fa2
e42a587
8421fa2
e42a587
4016776
e42a587
 
bc20f5a
e42a587
0627094
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
---
license: apache-2.0
language:
- fi
- sv
metrics:
- cer
pipeline_tag: image-to-text
---
# Model description

**Model Name:** multicentury-htr-model

**Model Type:** Transformer-based OCR (TrOCR)

**Base Model:** microsoft/trocr-large-handwritten

**Purpose:** Handwritten text recognition

**Languages:** Swedish, Finnish

**License:** Apache 2.0

This model is a fine-tuned version of the microsoft/trocr-large-handwritten model, specialized for recognizing handwritten text. It has been trained on various dataset from 17th to 20th centuries and can be used for applications such as document digitization, form recognition, or any task involving handwritten text extraction.

# Model Architecture

The model is based on a Transformer architecture (TrOCR) with an encoder-decoder setup:

- The encoder processes images of handwritten text.
- The decoder generates corresponding text output.

# Intended Use

This model is designed for handwritten text recognition and is intended for use in:

- Document digitization (e.g., archival work, historical manuscripts)
- Handwritten notes transcription

# Training data

The training datasetincludes more than 760 000 samples of handwritten text rows, covering a wide variety of handwriting styles and text samples.

# Evaluation

The model was evaluated on test dataset. Below are key metrics:

**Character Error Rate (CER):** 3.2

**Test Dataset Description:** size ~94 900 text rows

# How to Use the Model

You can use the model directly with Hugging Face’s pipeline function or by manually loading the processor and model.

```python
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image

# Load the model and processor
processor = TrOCRProcessor.from_pretrained("Kansallisarkisto/multicentury-htr-model/processor")
model = VisionEncoderDecoderModel.from_pretrained("Kansallisarkisto/multicentury-htr-model")

# Open an image of handwritten text
image = Image.open("path_to_image.png")

# Preprocess and predict
pixel_values = processor(image, return_tensors="pt").pixel_values
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(generated_text)

```

# Limitations and Biases

The model was trained primarily on handwritten text that uses basic Latin characters (A-Z, a-z) and includes Nordic special characters (å, ä, ö). It has not been trained on non-Latin alphabets, such as Chinese characters, Cyrillic script, or other writing systems like Arabic or Hebrew.
The model may not generalize well to any other languages than Finnish, Swedish or English.

# Future Work

Potential improvements for this model include:

- Expanding training data: Incorporating more diverse handwriting styles and languages.
- Optimizing for specific domains: Fine-tuning the model on domain-specific handwriting.

# Citation

If you use this model in your work, please cite it as:

@misc{multicentury_htr_model_2024,

  author = {Kansallisarkisto},

  title = {Multicentury HTR Model: Handwritten Text Recognition},
  
  year = {2024},
  
  publisher = {Hugging Face},
  
  howpublished = {\url{https://huggingface.co/Kansallisarkisto/multicentury-htr-model/}},

}

## Model Card Authors

Author: Kansallisarkisto
Contact Information: riikka.marttila@kansallisarkisto.fi, ilkka.jokipii@kansallisarkisto.fi