Migrate model card from transformers-repo
Browse filesRead announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/iarfmoose/roberta-base-bulgarian-pos/README.md
README.md
ADDED
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: bg
|
3 |
+
---
|
4 |
+
|
5 |
+
# RoBERTa-base-bulgarian-POS
|
6 |
+
|
7 |
+
|
8 |
+
The RoBERTa model was originally introduced in [this paper](https://arxiv.org/abs/1907.11692). This model is a version of [RoBERTa-base-Bulgarian](https://huggingface.co/iarfmoose/roberta-base-bulgarian) fine-tuned for part-of-speech tagging.
|
9 |
+
|
10 |
+
## Intended uses
|
11 |
+
|
12 |
+
The model can be used to predict part-of-speech tags in Bulgarian text. Since the tokenizer uses byte-pair encoding, each word in the text may be split into more than one token. When predicting POS-tags, the last token from each word can be used. Using the last token was found to slightly outperform predictions based on the first token.
|
13 |
+
|
14 |
+
An example of this can be found [here](https://github.com/iarfmoose/bulgarian-nlp/blob/master/models/postagger.py).
|
15 |
+
|
16 |
+
## Limitations and bias
|
17 |
+
|
18 |
+
The pretraining data is unfiltered text from the internet and may contain all sorts of biases.
|
19 |
+
|
20 |
+
## Training data
|
21 |
+
|
22 |
+
In addition to the pretraining data used in [RoBERTa-base-Bulgarian]([RoBERTa-base-Bulgarian](https://huggingface.co/iarfmoose/roberta-base-bulgarian)), the model was trained on the UPOS tags from [UD_Bulgarian-BTB](https://github.com/UniversalDependencies/UD_Bulgarian-BTB).
|
23 |
+
|
24 |
+
## Training procedure
|
25 |
+
|
26 |
+
The model was trained for 5 epochs over the training set. The loss was calculated based on label predictions for the last POS-tag for each word. The model achieves 97% on the test set.
|