herbert-base-cased / README.md
piotr-rybak's picture
Update README.md
aab9b76
|
raw
history blame
1.78 kB
---
language: pl
tags:
- herbert
license: cc-by-sa-4.0
---
# HerBERT
**[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert)** is a BERT-based Language Model trained on Polish Corpora
using MLM and SSO objectives with dynamic masking of whole words.
Model training and experiments were conducted with [transformers](https://github.com/huggingface/transformers) in version 2.9.
## Tokenizer
The training dataset was tokenized into subwords using ``CharBPETokenizer`` a character level byte-pair encoding with
a vocabulary size of 50k tokens. The tokenizer itself was trained with a [tokenizers](https://github.com/huggingface/tokenizers) library.
We kindly encourage you to use the **Fast** version of tokenizer, namely ``HerbertTokenizerFast``.
## HerBERT usage
Example code:
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-base-cased")
model = AutoModel.from_pretrained("allegro/herbert-base-cased")
output = model(
**tokenizer.batch_encode_plus(
[
(
"A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.",
"A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy."
)
],
padding='longest',
add_special_tokens=True,
return_tensors='pt'
)
)
```
## License
CC BY-SA 4.0
## Authors
Model was trained by **Machine Learning Research Team at Allegro** and **Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences**.
You can contact us at: <a href="mailto:klejbenchmark@allegro.pl">klejbenchmark@allegro.pl</a>