Kansallisarkisto
/

finbert-ner

Token Classification

Transformers

PyTorch

Finnish

bert

Inference Endpoints

Model card Files Files and versions Community

MikkoLipsanen commited on Sep 12, 2023

Commit

9bf3620

1 Parent(s): a3c4f82

Update README.md

Browse files

Files changed (1) hide show

README.md +11 -13

README.md CHANGED Viewed

@@ -10,7 +10,7 @@ library_name: transformers
 pipeline_tag: token-classification
 ---
-## Finnish named entity recognition ** WORK IN PROGRESS **
 The model performs named entity recognition from text input in Finnish.
 It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1),
@@ -62,17 +62,15 @@ Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annot
 dataset were filtered out from the dataset used for training the model. On the other hand, entities that were missing from the [NewsEye dataset](https://zenodo.org/record/4573313)
 were added during the annotation process. The different data sources used in model training are listed below:
-Dataset|Period covered by the texts|Text type
--|-|-
-[Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)|2000s|Online texts
-[NewsEye dataset](https://zenodo.org/record/4573313)|1850-1950|OCR'd newspaper articles
-Finnish document data digitized by the National Archives of Finland|1970s - 2000s|OCR'd
-In addition to this dataset, OCR'd and annotated content of
-digitized documents from Finnish public administration was also used for model training.
 The number of entities belonging to the different
 entity classes contained in training, validation and test datasets are listed below:
@@ -80,8 +78,8 @@ entity classes contained in training, validation and test datasets are listed be
 Dataset|PERSON|ORG|LOC|GPE|PRODUCT|EVENT|DATE|JON|FIBC|NORP
 -|-|-|-|-|-|-|-|-|-|-
 Train|11691|30026|868|12999|7473|1184|14918|1360|1879|2068
-Val|1542|4042|108|1654|879|160|1858|177|257|299
-Test|1267|3698|86|1713|901|137|1843|174|233|260
 ## Training procedure

 pipeline_tag: token-classification
 ---
+## Finnish named entity recognition
 The model performs named entity recognition from text input in Finnish.
 It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1),
 dataset were filtered out from the dataset used for training the model. On the other hand, entities that were missing from the [NewsEye dataset](https://zenodo.org/record/4573313)
 were added during the annotation process. The different data sources used in model training are listed below:
+Dataset|Period covered by the texts|Text type|Percentage of the data
+-|-|-|-
+[Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)|2000s|Online texts|23%
+[NewsEye dataset](https://zenodo.org/record/4573313)|1850-1950|OCR'd digitized newspaper articles|3%
+Diverse document data from Finnish public administration|1970s - 2000s|OCR'd digitized documents|69%
+Finnish senate documents|1916|Part manually transcribed, part HTR'd digitized documents|3%
+Finnish books from [Project Gutenberg](https://www.gutenberg.org)|Early 20th century|OCR'd texts|1%
+Theses from Finnish polytechnic universities |2000s|OCR'd texts|1%
 The number of entities belonging to the different
 entity classes contained in training, validation and test datasets are listed below:
 Dataset|PERSON|ORG|LOC|GPE|PRODUCT|EVENT|DATE|JON|FIBC|NORP
 -|-|-|-|-|-|-|-|-|-|-
 Train|11691|30026|868|12999|7473|1184|14918|1360|1879|2068
+Val|2478|5360|127|2428|1202|234|2898|308|235|282
+Test|2377|5470|178|2334|1098|185|2782|273|354|355
 ## Training procedure