ufal
/

Fine-tuned ByT5-small for MultiLexNorm (Danish version)

model image

This is the official release of the fine-tuned models for the winning entry to the W-NUT 2021: Multilingual Lexical Normalization (MultiLexNorm) shared task, which evaluates lexical-normalization systems on 12 social media datasets in 11 languages.

Our system is based on ByT5, which we first pre-train on synthetic data and then fine-tune on authentic normalization data. It achieves the best performance by a wide margin in intrinsic evaluation, and also the best performance in extrinsic evaluation through dependency parsing. In addition to these fine-tuned models, we also release the source files on GitHub and an interactive demo on Google Colab.

How to use

The model was not fine-tuned in a standard sentence-to-sentence setting – instead, it was tailored to the token-to-token definition of MultiLexNorm data. Please refer to the interactive demo on Colab notebook to learn how to use these models.

How to cite

@inproceedings{wnut-ufal,
  title= "{ÚFAL} at {MultiLexNorm} 2021: Improving Multilingual Lexical Normalization by Fine-tuning {ByT5}",
  author = "Samuel, David and Straka, Milan",
  booktitle = "Proceedings of the 7th Workshop on Noisy User-generated Text (W-NUT 2021)",
  year = "2021",
  publisher = "Association for Computational Linguistics",
  address = "Punta Cana, Dominican Republic"
}

ByT5 - Small

ByT5 is a tokenizer-free version of Google's T5 and generally follows the architecture of MT5.

ByT5 was only pre-trained on mC4 excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is useable on a downstream task.

ByT5 works especially well on noisy text data,e.g., google/byt5-small significantly outperforms mt5-small on TweetQA.

Paper: ByT5: Towards a token-free future with pre-trained byte-to-byte models

Authors: Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel

Downloads last month
19
Safetensors
Model size
300M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train ufal/byt5-small-multilexnorm2021-da