|
--- |
|
language: |
|
- nl |
|
tags: |
|
- text2text generation |
|
- spelling normalization |
|
- 19th-century Dutch |
|
license: apache-2.0 |
|
--- |
|
|
|
# 19th Century Dutch Spelling Normalization |
|
|
|
This repository contains a pretrained and finetuned model of the original __google/ByT5-small__. |
|
This model has been pretrained and finetuned for the task of 19th-century Dutch spelling normalization. |
|
We first pretrained the model with 2 million sentences from Dutch historical novels. |
|
Afterward, we finetuned the model with a 10k dataset consisting of 19th-century Dutch sentences; |
|
these sentences were automatically annotated by a rule-based system built for 19th-century Dutch spelling normalization (van Cranenburgh and van Noord, 2022). |
|
|
|
The model is only available in the TensorFlow format but can be converted to a Pytroch environment. |
|
The pretrained only weights are also available in the Flax environment; note that this model has to be finetuned first. |
|
The pretrained only weights are available in the directory _pretrained_ByT5_. |
|
The train and validation sets used for finetuning are available in the repository. |
|
For further information about the model, please see the [GitHub](https://github.com/Awolters123/Master-Thesis) repository. |
|
|
|
|
|
## How to use: |
|
|
|
``` |
|
from transformers import AutoTokenizer, TFT5ForConditionalGeneration |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('AWolters/ByT5_DutchSpellingNormalization') |
|
model = TFT5ForConditionalGeneration.from_pretrained('AWolters/ByT5_DutchSpellingNormalization') |
|
|
|
text = 'De menschen waren aan het werk.' |
|
tokenized = tokenizer(text, return_tensors='tf') |
|
|
|
prediction = model.generate(input_ids=tokenized['input_ids'], |
|
attention_mask=tokenized['attention_mask'], |
|
max_new_tokens=100) |
|
|
|
print(tokenizer.decode(prediction[0], text_target=True, skip_special_tokens=True)) |
|
``` |
|
|
|
## Setup: |
|
|
|
The model has been finetuned with the following (hyper)parameters values: |
|
|
|
_Learn rate_: 5e-5 |
|
_Batch size_: 32 |
|
_Optimizer_: AdamW |
|
_Epochs_: 30, with earlystopping |
|
|
|
To further finetune the model, use the _T5Trainer.py_ script. |
|
If you want to finetune the pretrained weights from scratch, you have to first convert the Flax file into a Pytorch or TensorFlow environment. |