File size: 2,117 Bytes
3df6709 4109a12 3df6709 4109a12 91c7b82 4109a12 f15cab5 e2af80f 91c7b82 4109a12 7cff747 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
---
language:
- nl
tags:
- text2text generation
- spelling normalization
- 19th-century Dutch
license: apache-2.0
---
# 19th Century Dutch Spelling Normalization
This repository contains a pretrained and finetuned model of the original __google/ByT5-small__.
This model has been pretrained and finetuned for the task of 19th-century Dutch spelling normalization.
We first pretrained the model with 2 million sentences from Dutch historical novels.
Afterward, we finetuned the model with a 10k dataset consisting of 19th-century Dutch sentences;
these sentences were automatically annotated by a rule-based system built for 19th-century Dutch spelling normalization (van Cranenburgh and van Noord, 2022).
The finetuned model is only available in the TensorFlow format but can be converted to a PyTorch environment.
The pretrained only weights are available in the PyTorch environment; note that this model has to be finetuned first.
The pretrained only weights are available in the directory __Pretrained_ByT5__.
The train and validation sets used for finetuning are available in the main repository.
For further information about the model, please see the [GitHub](https://github.com/Awolters123/Master-Thesis) repository.
## How to use:
```
from transformers import AutoTokenizer, TFT5ForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained('AWolters/ByT5_DutchSpellingNormalization')
model = TFT5ForConditionalGeneration.from_pretrained('AWolters/ByT5_DutchSpellingNormalization')
text = 'De menschen waren aan het werk.'
tokenized = tokenizer(text, return_tensors='tf')
prediction = model.generate(input_ids=tokenized['input_ids'],
attention_mask=tokenized['attention_mask'],
max_new_tokens=100)
print(tokenizer.decode(prediction[0], text_target=True, skip_special_tokens=True))
```
## Setup:
The model has been finetuned with the following (hyper)parameters values:
_Learn rate_: 5e-5
_Batch size_: 32
_Optimizer_: AdamW
_Epochs_: 30, with earlystopping
To further finetune the model, use the __T5Trainer.py__ script. |