MikkoLipsanen commited on
Commit
6174c03
·
1 Parent(s): 1a852b6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -1
README.md CHANGED
@@ -7,4 +7,68 @@ metrics:
7
  - accuracy
8
  library_name: transformers
9
  pipeline_tag: token-classification
10
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  - accuracy
8
  library_name: transformers
9
  pipeline_tag: token-classification
10
+ ---
11
+
12
+ ## Finnish named entity recognition
13
+
14
+ The model performs named entity recognition from text input in Finnish.
15
+ It was trained by fine-tuning [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1),
16
+ using 10 named entity categories. Training data contains the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
17
+ as well as an annotated dataset consisting of Finnish document daa from the 1970s onwards, digitized by the National Archives of Finland.
18
+ Since the latter dataset contains also sensitive data, it has not been made publicly available.
19
+
20
+
21
+ ## Intended uses & limitations
22
+
23
+ The model has been trained to recognize the following named entities from a text in Finnish:
24
+
25
+ - PERSON (person names)
26
+ - ORG (organizations)
27
+ - LOC (locations)
28
+ - GPE (geopolitical locations)
29
+ - PRODUCT (products)
30
+ - EVENT (events)
31
+ - DATE (dates)
32
+ - JON (Finnish journal numbers (diaarinumero))
33
+ - FIBC (Finnish business identity codes (y-tunnus))
34
+ - NORP (nationality, religious and political groups)
35
+
36
+ Some entities, like EVENT, LOC and JON, are less common in the training data than the others, which means that
37
+ recognition accuracy for these entities also tends to be lower.
38
+
39
+ The training data is relatively recent, so that the model might face difficulties when the input
40
+ contains for example old names or writing styles.
41
+
42
+ ## How to use
43
+
44
+ The easiest way to use the model is by utilizing the Transformers pipeline for token classification:
45
+
46
+ ```python
47
+ from transformers import pipeline
48
+
49
+ model_checkpoint = "Kansallisarkisto/finbert-ner"
50
+ token_classifier = pipeline(
51
+ "token-classification", model=model_checkpoint, aggregation_strategy="simple"
52
+ )
53
+ token_classifier("'Helsingistä tuli Suomen suuriruhtinaskunnan pääkaupunki vuonna 1812.")
54
+ ```
55
+
56
+ ## Training data
57
+
58
+ Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annotated in the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
59
+ dataset were filtered out from the dataset used for training the model. In addition to this dataset, OCR'd and annotated content of
60
+ digitized documents from Finnish public administration was also used for model training. The number of entities belonging to the different
61
+ entity classes contained in training, validation and test datasets are listed below:
62
+
63
+ Number of entity types in the data
64
+ Dataset|O|PERSON|ORG|LOC|GPE|PRODUCT|EVENT|DATE|JON|FIBC|NORP
65
+ -|-|-|-|-|-|-|-|-|-|-|-
66
+ Train|0|0|0|0|0|0|0|0|0|0|0
67
+ Val|0|0|0|0|0|0|0|0|0|0|0
68
+ Test|0|0|0|0|0|0|0|0|0|0|0
69
+
70
+ ## Training procedure
71
+
72
+ This model was trained using a NVIDIA RTX A6000 GPU with the following hyperparameters:
73
+
74
+ The training code with instructions is available [here](https://github.com/DALAI-hanke/BERT_NER).