--- license: cc-by-sa-4.0 datasets: - cjvt/cc_gigafida - cjvt/solar3 - cjvt/sloleks language: - sl tags: - word spelling error annotator --- --- language: - sl license: cc-by-sa-4.0 --- # SloBERTa-Incorrect-Spelling-Annotator This SloBERTa model is designed to annotate incorrectly spelled words in text. It utilizes the following labels: - 1: Indicates incorrectly spelled words. - 2: Denotes cases where two words should be written together. - 3: Suggests that a word should be written separately. ## Model Output Example Imagine we have the following Slovenian text: _Model vbesedilu o znači besede, v katerih se najajajo napake._ If we convert input data to format acceptable by SloBERTa model: _Model <mask> vbesedilu <mask> o <mask> znači <mask> besede <mask> , <mask> v <mask> katerih <mask> se <mask> najajajo <mask> napake <mask> . <mask>_ The model might return the following predictions (note: predictions chosen for demonstration/explanation, not reproducibility!): _Model 0 vbesedilu 3 o 2 znači 2 besede 0 , 0 v 0 katerih 0 se 0 najajajo 1 napake 0 . 0_ We can observe the following: 1. In the input sentence, the word `najajajo` is spelled incorrectly, so the model marks it with the token (0). 2. The word `vbesedilu` should be written as two words `v` and `besedilu`, so the model marks it with the token (3). 3. The words `o` and `znači` should be written as one word `označi`, so the model marks them with the tokens (2). ## More details The model, along with its training and evaluation, is described in more detail in the following paper. ``` @inproceedings{neural-spell-checker, author = {Klemen, Matej and Bo\v{z}i\v{c}, Martin and Holdt, \v{S}pela Arhar and Robnik-\v{S}ikonja, Marko}, title = {Neural Spell-Checker: Beyond Words with Synthetic Data Generation}, year = {2024}, doi = {10.1007/978-3-031-70563-2_7}, booktitle = {Text, Speech, and Dialogue: 27th International Conference, TSD 2024, Brno, Czech Republic, September 9–13, 2024, Proceedings, Part I}, pages = {85–96}, numpages = {12} } ``` ## Acknowledgement The authors acknowledge the financial support from the Slovenian Research and Innovation Agency - research core funding No. P6-0411: Language Resources and Technologies for Slovene and research project No. J7-3159: Empirical foundations for digitally-supported development of writing skills. ## Authors Thanks to Martin Božič, Marko Robnik-Šikonja and Špela Arhar Holdt for developing these models.