allegro
/

herbert-base-cased

Feature Extraction

Inference Endpoints

Model card Files Files and versions Community

herbert-base-cased / README.md

piotr-rybak's picture

Update README.md

aab9b76 over 3 years ago

|

1.78 kB

	---
	language: pl
	tags:
	- herbert
	license: cc-by-sa-4.0
	---

	# HerBERT
	[HerBERT](https://en.wikipedia.org/wiki/Zbigniew_Herbert) is a BERT-based Language Model trained on Polish Corpora
	using MLM and SSO objectives with dynamic masking of whole words.
	Model training and experiments were conducted with [transformers](https://github.com/huggingface/transformers) in version 2.9.

	## Tokenizer
	The training dataset was tokenized into subwords using ``CharBPETokenizer`` a character level byte-pair encoding with
	a vocabulary size of 50k tokens. The tokenizer itself was trained with a [tokenizers](https://github.com/huggingface/tokenizers) library.
	We kindly encourage you to use the Fast version of tokenizer, namely ``HerbertTokenizerFast``.

	## HerBERT usage


	Example code:
	```python
	from transformers import AutoTokenizer, AutoModel

	tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-base-cased")
	model = AutoModel.from_pretrained("allegro/herbert-base-cased")

	output = model(
	**tokenizer.batch_encode_plus(
	[
	(
	"A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.",
	"A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy."
	)
	],
	padding='longest',
	add_special_tokens=True,
	return_tensors='pt'
	)
	)
	```


	## License
	CC BY-SA 4.0


	## Authors
	Model was trained by Machine Learning Research Team at Allegro and Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences.

	You can contact us at: <a href="mailto:klejbenchmark@allegro.pl">klejbenchmark@allegro.pl</a>