File size: 5,854 Bytes
23ca9d6 d5dfc8c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
---
license: mit
language:
- sk
datasets:
- oscar-corpus/OSCAR-2109
pipeline_tag: fill-mask
library_name: transformers
---
# Slovak BPE Baby Language Model (SK_BPE_BLM)
**SK_BPE_BLM** is a pretrained small language model for the Slovak language, based on the RoBERTa architecture. The model utilizes standard Byte-Pair Encoding (BPE) tokenization and is case-insensitive, meaning it operates in lowercase. While the pretrained model can be used for masked language modeling, it is primarily intended for fine-tuning on downstream NLP tasks.
## How to Use the Model
To use the SK_BPE_BLM model, follow these steps:
```python
from transformers import pipeline, RobertaTokenizer, AutoModelForMaskedLM
# Load the custom tokenizer and model
tokenizer = RobertaTokenizer.from_pretrained("daviddrzik/SK_BPE_BLM")
model = AutoModelForMaskedLM.from_pretrained("daviddrzik/SK_BPE_BLM")
# Create a pipeline with the custom model and tokenizer
unmasker = pipeline('fill-mask', model=model, tokenizer=tokenizer)
# Use the pipeline
result = unmasker("včera večer sme <mask> nový film v kine, ktorý mal premiéru iba pred týždňom.")
print(result)
[{'score': 0.2665567100048065,
'token': 18599,
'token_str': ' pozreli',
'sequence': 'včera večer sme pozreli nový film v kine, ktorý mal premiéru iba pred týždňom.'},
{'score': 0.23860174417495728,
'token': 1056,
'token_str': ' mali',
'sequence': 'včera večer sme mali nový film v kine, ktorý mal premiéru iba pred týždňom.'},
{'score': 0.1962040513753891,
'token': 6915,
'token_str': ' videli',
'sequence': 'včera večer sme videli nový film v kine, ktorý mal premiéru iba pred týždňom.'},
{'score': 0.03656836599111557,
'token': 26996,
'token_str': ' pozerali',
'sequence': 'včera večer sme pozerali nový film v kine, ktorý mal premiéru iba pred týždňom.'},
{'score': 0.030735589563846588,
'token': 9058,
'token_str': ' objavili',
'sequence': 'včera večer sme objavili nový film v kine, ktorý mal premiéru iba pred týždňom.'}]
```
## Training Data
The `SK_BPE_BLM` model was pretrained using a subset of the OSCAR 2019 corpus, specifically focusing on the Slovak language. The corpus underwent comprehensive preprocessing to ensure the quality and relevance of the data:
- **Language Filtering:** Non-Slovak text was removed to focus solely on the Slovak language.
- **Character Normalization:** Various types of spaces, quotes, dashes, and separators were standardized (e.g., replacing different types of spaces with a single space, or dashes with hyphens). Emoticons were replaced with spaces.
- **Symbol and Unwanted Text Removal:** Sentences containing mathematical symbols, pictograms, or characters from Asian and African languages were deleted. Duplicates of punctuation, special characters, and spaces were also removed.
- **URL and Text Normalization:** All web addresses were removed, and the text was converted to lowercase to simplify tokenization.
- **Content Cleanup:** Text that included irrelevant content from web crawling, such as keywords and HTML tags, was identified and removed.
Additionally, the preprocessing included further refinement steps to create the final dataset:
- **Parentheses Content Removal:** All content within parentheses was removed to reduce noise.
- **Selection of Text Segments:** Medium-length text paragraphs were selected to maintain consistency.
- **Similarity Filtering:** Paragraphs with at least 50% similarity to previous ones were removed to minimize redundancy.
- **Random Sampling:** Finally, 20% of the remaining paragraphs were randomly selected.
After preprocessing, the training corpus consisted of:
- **455 MB of text**
- **895,125 paragraphs**
- **64.6 million words**
- **1.13 million unique words**
- **119 unique characters**
## Pretraining
The `SK_BPE_BLM` model was trained with the following key parameters:
- **Architecture:** Based on RoBERTa, with 6 hidden layers and 12 attention heads.
- **Hidden size:** 576
- **Vocabulary size:** 50,264 tokens
- **Sequence length:** 256 tokens
- **Dropout:** 0.1
- **Number of parameters:** 58 million
- **Optimizer:** AdamW, learning rate 1×10^(-4), weight decay 0.01
- **Training:** 30 epochs, divided into 3 phases:
- **Phase 1:** 10 epochs on CPU (4x AMD EPYC 7542), batch size 64, 50 hours per epoch, 139,870 steps total.
- **Phase 2:** 5 epochs on GPU (1x Nvidia A100 40GB), batch size 64, 100 minutes per epoch, 69,935 steps total.
- **Phase 3:** 15 epochs on GPU (2x Nvidia A100 40GB), batch size 128, 60 minutes per epoch, 104,910 steps total.
The model was trained using the Hugging Face library, but without using the `Trainer` class—native PyTorch was used instead.
## Fine-Tuned Versions of the SK_BPE_BLM Model
Here are the fine-tuned versions of the `SK_BPE_BLM` model based on the folders provided:
- [`SK_BPE_BLM-ner`](https://huggingface.co/daviddrzik/SK_BPE_BLM-ner): Fine-tuned for Named Entity Recognition (NER) tasks.
- [`SK_BPE_BLM-pos`](https://huggingface.co/daviddrzik/SK_BPE_BLM-pos): Fine-tuned for Part-of-Speech (POS) tagging.
- [`SK_BPE_BLM-qa`](https://huggingface.co/daviddrzik/SK_BPE_BLM-qa): Fine-tuned for Question Answering tasks.
- [`SK_BPE_BLM-sentiment-csfd`](https://huggingface.co/daviddrzik/SK_BPE_BLM-sentiment-csfd): Fine-tuned for sentiment analysis on the CSFD (movie review) dataset.
- [`SK_BPE_BLM-sentiment-multidomain`](https://huggingface.co/daviddrzik/SK_BPE_BLM-sentiment-multidomain): Fine-tuned for sentiment analysis across multiple domains.
- [`SK_BPE_BLM-sentiment-reviews`](https://huggingface.co/daviddrzik/SK_BPE_BLM-sentiment-reviews): Fine-tuned for sentiment analysis on general review datasets.
- [`SK_BPE_BLM-topic-news`](https://huggingface.co/daviddrzik/SK_BPE_BLM-topic-news): Fine-tuned for topic classification in news articles.
|