MikkoLipsanen commited on
Commit
9bf3620
·
1 Parent(s): a3c4f82

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -13
README.md CHANGED
@@ -10,7 +10,7 @@ library_name: transformers
10
  pipeline_tag: token-classification
11
  ---
12
 
13
- ## Finnish named entity recognition ** WORK IN PROGRESS **
14
 
15
  The model performs named entity recognition from text input in Finnish.
16
  It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1),
@@ -62,17 +62,15 @@ Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annot
62
  dataset were filtered out from the dataset used for training the model. On the other hand, entities that were missing from the [NewsEye dataset](https://zenodo.org/record/4573313)
63
  were added during the annotation process. The different data sources used in model training are listed below:
64
 
65
- Dataset|Period covered by the texts|Text type
66
- -|-|-
67
- [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)|2000s|Online texts
68
- [NewsEye dataset](https://zenodo.org/record/4573313)|1850-1950|OCR'd newspaper articles
69
- Finnish document data digitized by the National Archives of Finland|1970s - 2000s|OCR'd
70
-
71
-
72
-
73
 
74
- In addition to this dataset, OCR'd and annotated content of
75
- digitized documents from Finnish public administration was also used for model training.
76
  The number of entities belonging to the different
77
  entity classes contained in training, validation and test datasets are listed below:
78
 
@@ -80,8 +78,8 @@ entity classes contained in training, validation and test datasets are listed be
80
  Dataset|PERSON|ORG|LOC|GPE|PRODUCT|EVENT|DATE|JON|FIBC|NORP
81
  -|-|-|-|-|-|-|-|-|-|-
82
  Train|11691|30026|868|12999|7473|1184|14918|1360|1879|2068
83
- Val|1542|4042|108|1654|879|160|1858|177|257|299
84
- Test|1267|3698|86|1713|901|137|1843|174|233|260
85
 
86
  ## Training procedure
87
 
 
10
  pipeline_tag: token-classification
11
  ---
12
 
13
+ ## Finnish named entity recognition
14
 
15
  The model performs named entity recognition from text input in Finnish.
16
  It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1),
 
62
  dataset were filtered out from the dataset used for training the model. On the other hand, entities that were missing from the [NewsEye dataset](https://zenodo.org/record/4573313)
63
  were added during the annotation process. The different data sources used in model training are listed below:
64
 
65
+ Dataset|Period covered by the texts|Text type|Percentage of the data
66
+ -|-|-|-
67
+ [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)|2000s|Online texts|23%
68
+ [NewsEye dataset](https://zenodo.org/record/4573313)|1850-1950|OCR'd digitized newspaper articles|3%
69
+ Diverse document data from Finnish public administration|1970s - 2000s|OCR'd digitized documents|69%
70
+ Finnish senate documents|1916|Part manually transcribed, part HTR'd digitized documents|3%
71
+ Finnish books from [Project Gutenberg](https://www.gutenberg.org)|Early 20th century|OCR'd texts|1%
72
+ Theses from Finnish polytechnic universities |2000s|OCR'd texts|1%
73
 
 
 
74
  The number of entities belonging to the different
75
  entity classes contained in training, validation and test datasets are listed below:
76
 
 
78
  Dataset|PERSON|ORG|LOC|GPE|PRODUCT|EVENT|DATE|JON|FIBC|NORP
79
  -|-|-|-|-|-|-|-|-|-|-
80
  Train|11691|30026|868|12999|7473|1184|14918|1360|1879|2068
81
+ Val|2478|5360|127|2428|1202|234|2898|308|235|282
82
+ Test|2377|5470|178|2334|1098|185|2782|273|354|355
83
 
84
  ## Training procedure
85