MikkoLipsanen
commited on
Commit
·
373cfeb
1
Parent(s):
5afd4b8
Update README.md
Browse files
README.md
CHANGED
@@ -56,8 +56,11 @@ token_classifier("'Helsingistä tuli Suomen suuriruhtinaskunnan pääkaupunki vu
|
|
56 |
## Training data
|
57 |
|
58 |
Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annotated in the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
|
59 |
-
dataset were filtered out from the dataset used for training the model.
|
60 |
-
|
|
|
|
|
|
|
61 |
entity classes contained in training, validation and test datasets are listed below:
|
62 |
|
63 |
### Number of entity types in the data
|
@@ -67,6 +70,10 @@ Train|11691|30026|868|12999|7473|1184|14918|01360|1879|2068
|
|
67 |
Val|1542|4042|108|1654|879|160|1858|177|257|299
|
68 |
Test|1267|3698|86|1713|901|137|1843|174|233|260
|
69 |
|
|
|
|
|
|
|
|
|
70 |
## Training procedure
|
71 |
|
72 |
This model was trained using a NVIDIA RTX A6000 GPU with the following hyperparameters:
|
@@ -79,4 +86,9 @@ This model was trained using a NVIDIA RTX A6000 GPU with the following hyperpara
|
|
79 |
- maximum length of data sequence: 512
|
80 |
- patience: 2 epochs
|
81 |
|
82 |
-
|
|
|
|
|
|
|
|
|
|
|
|
56 |
## Training data
|
57 |
|
58 |
Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annotated in the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
|
59 |
+
dataset were filtered out from the dataset used for training the model.
|
60 |
+
|
61 |
+
In addition to this dataset, OCR'd and annotated content of
|
62 |
+
digitized documents from Finnish public administration was also used for model training.
|
63 |
+
The number of entities belonging to the different
|
64 |
entity classes contained in training, validation and test datasets are listed below:
|
65 |
|
66 |
### Number of entity types in the data
|
|
|
70 |
Val|1542|4042|108|1654|879|160|1858|177|257|299
|
71 |
Test|1267|3698|86|1713|901|137|1843|174|233|260
|
72 |
|
73 |
+
The annotation of the data was performed as a cooperation between the National Archives of Finland
|
74 |
+
and the [FIN-CLARIAH](https://www.kielipankki.fi/organization/fin-clariah/) research infrastructure
|
75 |
+
for Social Sciences and Humanities.
|
76 |
+
|
77 |
## Training procedure
|
78 |
|
79 |
This model was trained using a NVIDIA RTX A6000 GPU with the following hyperparameters:
|
|
|
86 |
- maximum length of data sequence: 512
|
87 |
- patience: 2 epochs
|
88 |
|
89 |
+
In the prerocessing stage, the input texts were split into chunks with a maximum length of 300 tokens,
|
90 |
+
in order to avoid the tokenized chunks exceeding the maximum length of 512. Tokenization was performed
|
91 |
+
using the tokenizer for the [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1)
|
92 |
+
model.
|
93 |
+
|
94 |
+
The training code with instructions will be available soon [here](https://github.com/DALAI-hanke/BERT_NER).
|