Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,44 @@
|
|
1 |
---
|
2 |
license: cc-by-sa-4.0
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: cc-by-sa-4.0
|
3 |
---
|
4 |
+
|
5 |
+
### xlm-roberta-base for register labeling, specifically fine-tuned for question-answer document identification
|
6 |
+
|
7 |
+
This is the `xlm-roberta-base`, fine-tuned on register annotated data in English (https://github.com/TurkuNLP/CORE-corpus) and Finnish (https://github.com/TurkuNLP/FinCORE_full) as well as unpublished versions of Swedish and French (https://github.com/TurkuNLP/multilingual-register-labeling). The model is trained to predict whether a text includes something related to questions and answers or not.
|
8 |
+
|
9 |
+
### Overview
|
10 |
+
Language model: xlm-roberta-base
|
11 |
+
|
12 |
+
Downstream-task: multi-class text classification
|
13 |
+
|
14 |
+
|
15 |
+
### Usage
|
16 |
+
|
17 |
+
the model can be used through a huggingface pipeline:
|
18 |
+
```
|
19 |
+
model = transformers.AutoModelForSequenceClassification.from_pretrained("TurkuNLP/xlmr-qa-register")
|
20 |
+
tokenizer = transformers.AutoTokenizer.from_pretrained("xlm-roberta-base")
|
21 |
+
pipe = transformers.pipeline(task="text-classification", model=model, tokenizer=tokenizer)
|
22 |
+
```
|
23 |
+
|
24 |
+
### Hyperparameters
|
25 |
+
```
|
26 |
+
batch_size = 8
|
27 |
+
epochs = 10 (trained for 4)
|
28 |
+
base_LM_model = "xlm-roberta-base"
|
29 |
+
max_seq_len = 512
|
30 |
+
learning_rate = 4e-6
|
31 |
+
```
|
32 |
+
|
33 |
+
### Performance
|
34 |
+
```
|
35 |
+
F1-micro = 0.98
|
36 |
+
F1-macro = 0.79
|
37 |
+
|
38 |
+
F1 QA label = 0.60
|
39 |
+
F1 not QA label = 0.99
|
40 |
+
Precision QA label = 0.82
|
41 |
+
Precision not QA label = 0.99
|
42 |
+
Recall QA label = 0.47
|
43 |
+
Recall not QA label = 1.00
|
44 |
+
```
|