Update README.md
Browse files
README.md
CHANGED
@@ -1,102 +1,41 @@
|
|
1 |
---
|
2 |
license: mit
|
3 |
metrics:
|
4 |
-
- mean_iou
|
5 |
datasets:
|
6 |
-
- Riksarkivet/placeholder_region_segmentation
|
7 |
tags:
|
8 |
-
- mmdet
|
9 |
-
-
|
10 |
-
- instance segmentation
|
11 |
-
library_name:
|
12 |
library_version: 0.0.1
|
13 |
inference: false
|
14 |
-
language:
|
15 |
-
- sv
|
16 |
pipeline_tag: image-segmentation
|
17 |
---
|
18 |
|
19 |
-
|
20 |
## Model Description
|
21 |
-
The Swedish National Archives presents an end-to-end Handwritten Text Recognition (HTR) pipeline for running-text documents ranging from the mid 17th century to the late 19th century. The pipeline consists of the following components:
|
22 |
|
23 |
-
|
24 |
|
25 |
-
|
26 |
|
27 |
-
|
|
|
|
|
28 |
|
29 |
## Evaluation
|
30 |
|
31 |
-
|
32 |
-
|
33 |
-
The reported performance metrics are obtained on several test-sets from archives that weren't included in the training-set, ranging the entire time-period the model was trained on. So these error rates are what you should expect if you run the pipeline out-of-the-box on your own documents given that the documents contain running-text and are from the model's time-period-domain. It is important to note that the actual performance may vary depending on the specific layout and handwriting styles encountered in the document.
|
34 |
-
|
35 |
-
| Model | train-eval | 1661-testset | 1664-testset | 1688-testset-unusual-layout | 1735-testset | 1740-1793-testset | 1777-testset | 1840-1890-testset | 1861-testset |
|
36 |
-
|----------------|-------------|--------------|--------------|-----------------------------|--------------|-------------------|--------------|-------------------|--------------|
|
37 |
-
|SATRN_1650_1900 | 0.033 | 0.096 | 0.078 | 0.215 | 0.079 | 0.066 | 0.074 | 0.037 | 0.043 |
|
38 |
-
|SATRN_1650_1800 | 0.039 | 0.109 | 0.085 | 0.243 | 0.079 | 0.079 | 0.087 | 0.239 | 0.157 |
|
39 |
-
|SATRN_1800_1900 | 0.031 | 0.455 | 0.382 | 0.381 | 0.309 | 0.252 | 0.182 | 0.046 | 0.051 |
|
40 |
-
|
41 |
-
The lower two rows are for comparison only. You can see that the model trained exclusively on the 19th century actually performed worse on 19th century testsets than the model trained on the entire time-period. This was the reason we only published the aggregated model rather than models specialized on a specific century.
|
42 |
-
|
43 |
-
Regular evaluations are conducted to monitor and improve the performance of the pipeline. As new evaluation results become available, this table will be updated to reflect the most recent performance metrics.
|
44 |
-
|
45 |
-
We also did some fine-tuning experiments to give an idea of the performance benefits of finetuning the model on domain-specific material, as well as a rough estimate of how many pages one needs to transcribe to do the fine-tuning.
|
46 |
-
|
47 |
-
| Model | 16th-century-testsets-combined | 17th-century-testsets-combined | 18th-century-testsets-combined |
|
48 |
-
|---------------------|--------------------------------|--------------------------------|--------------------------------|
|
49 |
-
| SATRN_1650_1900 | 0.124 | 0.095 | 0.038 |
|
50 |
-
| SATRN_1650_1900_ft | 0.064 | 0.084 | 0.026 |
|
51 |
-
| Number of pages | 57 | 28 | 29 |
|
52 |
-
|
53 |
-
As seen 50-60 transcribed pages is enough to halve the CER on 17th century documents. 30 pages of transcribed text gives significant improvements on 18th and 19th century text, but the improvement are not as steep. Our recommendation, if you have a large domain you want to run the pipeline on, is to transcribe 50-100 pages, and finetune the text-recognition model on this data. Guides on how to do this will be forthcoming.
|
54 |
-
|
55 |
-
|
56 |
-
## Intended Use
|
57 |
-
The Swedish National Archives HTR pipeline is intended to be used for the following purposes:
|
58 |
-
|
59 |
-
- Handwritten Text Recognition: The pipeline enables the automatic recognition of handwritten text in running-text documents from the 17th to the 19th century. It can be utilized by researchers, historians, and archivists to efficiently transcribe and analyze historical texts.
|
60 |
-
|
61 |
-
- Document Digitization: The pipeline aids in the process of digitizing archival documents by automating the extraction and transcription of handwritten text. This facilitates broader accessibility and preservation of historical materials.
|
62 |
-
|
63 |
-
It's important to note that the pipeline is optimized for running-text documents from the specified time period and may not perform optimally for other types of documents or handwriting styles.
|
64 |
-
Additionally, it is currently more suitable for documents from books rather than complex layouts from either tables or newspapers.
|
65 |
-
|
66 |
-
## Performance and Limitations
|
67 |
-
The performance of the Swedish National Archives HTR pipeline is influenced by several factors:
|
68 |
-
|
69 |
-
- **Accuracy**: The pipeline achieves high accuracy in segmenting text regions and lines, as well as recognizing the text content accurately. However, the recognition accuracy may vary depending on the quality of the original document, handwriting style, and legibility.
|
70 |
-
|
71 |
-
- **Speed**: The pipeline aims to provide real-time or near real-time performance for efficient processing of handwritten text documents. The speed may vary depending on the hardware used for inference.
|
72 |
-
|
73 |
-
- **Document Specificity**: The pipeline is specifically trained for running-text documents from the 17th to the 19th century. It may not perform optimally for documents outside this time period or for documents with non-typical layouts.
|
74 |
-
|
75 |
-
- **Language Limitations**: The pipeline is mainly for Swedish text recognition. While it may handle other languages to some extent, Finnish for example, its performance may not be as accurate as for Swedish.
|
76 |
-
|
77 |
-
- **Handwriting Style**: The pipeline is optimized for the cursive handwriting style prevalent in the historical documents of the Swedish National Archives. It may not perform as well for other handwriting styles, such as block letters or highly stylized scripts.
|
78 |
|
79 |
## Training Data
|
80 |
-
The Swedish National Archives HTR pipeline was trained using a diverse dataset of binarized, running-text documents from the 17th to the 19th century. The training data includes various types of historical texts, such as letters, manuscripts, and official records.
|
81 |
|
82 |
-
|
83 |
-
|
84 |
-
The training data was annotated to provide ground truth for text region and line segmentation, as well as text transcription. Expert archivists and historians contributed to the annotation process to ensure accurate labeling.
|
85 |
-
|
86 |
-
The data can be find here: (WIP will be added soon)
|
87 |
-
|
88 |
-
## Caveats and Future Work
|
89 |
-
Although the Swedish National Archives HTR pipeline has been trained and optimized for running-text documents from the specified time period, there are a few caveats and considerations to keep in mind:
|
90 |
-
|
91 |
-
Continuous Improvement: The pipeline is continuously being updated and improved as new training data becomes available and advancements in OCR technology occur. With access to more training data, the models will be updated to further enhance their performance and adaptability.
|
92 |
-
|
93 |
-
User Feedback: Users are encouraged to provide feedback on the pipeline's performance, identify issues, and report any potential biases or limitations. This feedback is highly valuable in refining the pipeline, addressing concerns, and informing future updates.
|
94 |
|
95 |
## References
|
|
|
96 |
If you would like to learn more about the Swedish National Archives HTR pipeline or access the training data, please refer to the following resources:
|
97 |
|
98 |
-
- [Swedish National Archives](https://
|
99 |
- [MMDetection](https://github.com/open-mmlab/mmdetection)
|
100 |
-
- [
|
101 |
-
- [SATRN Paper](https://arxiv.org/abs/2012.05483)
|
102 |
-
- [OpenMMLab OCR Toolbox](https://openmmlab.com/mmocr/)
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
metrics:
|
4 |
+
- mean_iou
|
5 |
datasets:
|
6 |
+
- Riksarkivet/placeholder_region_segmentation
|
7 |
tags:
|
8 |
+
- mmdet
|
9 |
+
- htrflow_core
|
10 |
+
- instance segmentation
|
11 |
+
library_name: htrflow_core
|
12 |
library_version: 0.0.1
|
13 |
inference: false
|
|
|
|
|
14 |
pipeline_tag: image-segmentation
|
15 |
---
|
16 |
|
|
|
17 |
## Model Description
|
|
|
18 |
|
19 |
+
**RTMDet** is both an instance segmentation and object detection model from [OpenMMLab](https://mmyolo.readthedocs.io/en/latest/recommended_topics/algorithm_descriptions/rtmdet_description.html) and was trained using [MMDetection](https://mmdetection.readthedocs.io/en/latest/). This RTMDet model is fine-tuned to segment text regions within the documents, which enables a pre-localization text-line regions, which is a crucial step for current text-recognition models work at the text-line level.
|
20 |
|
21 |
+
## Usage
|
22 |
|
23 |
+
```python
|
24 |
+
#WIP
|
25 |
+
```
|
26 |
|
27 |
## Evaluation
|
28 |
|
29 |
+
(WIP)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
|
31 |
## Training Data
|
|
|
32 |
|
33 |
+
(WIP)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
|
35 |
## References
|
36 |
+
|
37 |
If you would like to learn more about the Swedish National Archives HTR pipeline or access the training data, please refer to the following resources:
|
38 |
|
39 |
+
- [The AI-lab at the Swedish National Archives](https://github.com/Swedish-National-Archives-AI-lab)
|
40 |
- [MMDetection](https://github.com/open-mmlab/mmdetection)
|
41 |
+
- [RTMDET Paper](https://paperswithcode.com/paper/rtmdet-an-empirical-study-of-designing-real)
|
|
|
|