|
--- |
|
language: pl |
|
|
|
license: cc-by-sa-4.0 |
|
|
|
datasets: |
|
|
|
- Polish subset of Open Subtitles |
|
- Polish subset of ParaCrawl |
|
- Polish Parliamentary Corpus |
|
- Polish Wikipedia - Feb 2020 |
|
- Expert-annotated Dataset for Automatic Cyberbullying Detection in Polish Laguage |
|
|
|
--- |
|
|
|
# Polbert-CB - Polish BERT trained for Automatic Cyberbullying Detection |
|
This is a Polish version of BERT language model, specifically, [Polbert](https://huggingface.co/dkleczek/bert-base-polish-uncased-v1), trained on a re-annotated and improved Dataset for Automatic Cyberbullying Detection in Polish Laguage. |
|
|
|
|
|
## Fine-tuning dataset |
|
The dataset used for fine-tuning this model was based on the original [Dataset for Automatic Cyberbullying Detection in Polish Laguage](https://huggingface.co/datasets/poleval2019_cyberbullying), which was recently additionally cleaned and re-annotated by experts from [Samurai Labs](https://www.samurailabs.ai/). The improved dataset and will be released separately later. |
|
|
|
|
|
## Acknowledgements |
|
* We would like to express our gratitude to the annotators of this dataset, including original annotators, and more recent expert annotators, for their invaluable time they spent on preparing the dataset. |
|
|
|
## Author |
|
Michal Ptaszynski - contact me on: |
|
- Twitter: [@mich_ptaszynski](https://twitter.com/mich_ptaszynski) |
|
- GitHub: [ptaszynski](https://github.com/ptaszynski) |
|
- LinkedIn: [michalptaszynski](https://jp.linkedin.com/in/michalptaszynski) |
|
- HuggingFace: [ptaszynski](https://huggingface.co/ptaszynski) |
|
|
|
|
|
## Licences |
|
The finetuned model with all attached files is licensed under [CC BY-SA 4.0](http://creativecommons.org/licenses/by-sa/4.0/), or Creative Commons Attribution-ShareAlike 4.0 International License. |
|
|
|
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a> |
|
|
|
|
|
|
|
## Citations |
|
Please, cite this model using the following citation. |
|
|
|
Model: |
|
``` |
|
@article{ptaszynski2022cyberbullyibng-bert-pl, |
|
title={Polish BERT trained for Automatic Cyberbullying Detection}, |
|
author={Ptaszynski, Michal and Pieciukiewicz, Agata and Dybala, Pawel and Skrzek, Pawel and Soliwoda, Kamil and Fortuna, Marcin and Leliwa, Gniewosz and Wroczynski, Michal}, |
|
year={2022}, |
|
publisher={HuggingFace}, |
|
url={https://github.com/ptaszynski/bert-base-polish-cyberbullying}" |
|
} |
|
``` |
|
|
|
Original dataset: |
|
``` |
|
@article{ptaszynski2019results, |
|
title={Results of the poleval 2019 shared task 6: First dataset and open shared task for automatic cyberbullying detection in polish twitter}, |
|
author={Ptaszynski, Michal and Pieciukiewicz, Agata and Dyba{\l}a, Pawe{\l}}, |
|
year={2019}, |
|
publisher={Warszawa: Institute of Computer Sciences. Polish Academy of Sciences} |
|
} |
|
``` |
|
|
|
Improved dataset: |
|
|
|
``` |
|
TBA |
|
``` |
|
|
|
## References |
|
* https://github.com/google-research/bert |
|
* https://github.com/ptaszynski/cyberbullying-Polish |
|
* https://huggingface.co/datasets/poleval2019_cyberbullying |
|
|