Commit
·
9bf3620
1
Parent(s):
a3c4f82
Update README.md
Browse files
README.md
CHANGED
@@ -10,7 +10,7 @@ library_name: transformers
|
|
10 |
pipeline_tag: token-classification
|
11 |
---
|
12 |
|
13 |
-
## Finnish named entity recognition
|
14 |
|
15 |
The model performs named entity recognition from text input in Finnish.
|
16 |
It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1),
|
@@ -62,17 +62,15 @@ Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annot
|
|
62 |
dataset were filtered out from the dataset used for training the model. On the other hand, entities that were missing from the [NewsEye dataset](https://zenodo.org/record/4573313)
|
63 |
were added during the annotation process. The different data sources used in model training are listed below:
|
64 |
|
65 |
-
Dataset|Period covered by the texts|Text type
|
66 |
-
|
67 |
-
[Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)|2000s|Online texts
|
68 |
-
[NewsEye dataset](https://zenodo.org/record/4573313)|1850-1950|OCR'd newspaper articles
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
|
74 |
-
In addition to this dataset, OCR'd and annotated content of
|
75 |
-
digitized documents from Finnish public administration was also used for model training.
|
76 |
The number of entities belonging to the different
|
77 |
entity classes contained in training, validation and test datasets are listed below:
|
78 |
|
@@ -80,8 +78,8 @@ entity classes contained in training, validation and test datasets are listed be
|
|
80 |
Dataset|PERSON|ORG|LOC|GPE|PRODUCT|EVENT|DATE|JON|FIBC|NORP
|
81 |
-|-|-|-|-|-|-|-|-|-|-
|
82 |
Train|11691|30026|868|12999|7473|1184|14918|1360|1879|2068
|
83 |
-
Val|
|
84 |
-
Test|
|
85 |
|
86 |
## Training procedure
|
87 |
|
|
|
10 |
pipeline_tag: token-classification
|
11 |
---
|
12 |
|
13 |
+
## Finnish named entity recognition
|
14 |
|
15 |
The model performs named entity recognition from text input in Finnish.
|
16 |
It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1),
|
|
|
62 |
dataset were filtered out from the dataset used for training the model. On the other hand, entities that were missing from the [NewsEye dataset](https://zenodo.org/record/4573313)
|
63 |
were added during the annotation process. The different data sources used in model training are listed below:
|
64 |
|
65 |
+
Dataset|Period covered by the texts|Text type|Percentage of the data
|
66 |
+
-|-|-|-
|
67 |
+
[Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)|2000s|Online texts|23%
|
68 |
+
[NewsEye dataset](https://zenodo.org/record/4573313)|1850-1950|OCR'd digitized newspaper articles|3%
|
69 |
+
Diverse document data from Finnish public administration|1970s - 2000s|OCR'd digitized documents|69%
|
70 |
+
Finnish senate documents|1916|Part manually transcribed, part HTR'd digitized documents|3%
|
71 |
+
Finnish books from [Project Gutenberg](https://www.gutenberg.org)|Early 20th century|OCR'd texts|1%
|
72 |
+
Theses from Finnish polytechnic universities |2000s|OCR'd texts|1%
|
73 |
|
|
|
|
|
74 |
The number of entities belonging to the different
|
75 |
entity classes contained in training, validation and test datasets are listed below:
|
76 |
|
|
|
78 |
Dataset|PERSON|ORG|LOC|GPE|PRODUCT|EVENT|DATE|JON|FIBC|NORP
|
79 |
-|-|-|-|-|-|-|-|-|-|-
|
80 |
Train|11691|30026|868|12999|7473|1184|14918|1360|1879|2068
|
81 |
+
Val|2478|5360|127|2428|1202|234|2898|308|235|282
|
82 |
+
Test|2377|5470|178|2334|1098|185|2782|273|354|355
|
83 |
|
84 |
## Training procedure
|
85 |
|