File size: 2,117 Bytes
3df6709
4109a12
 
 
 
 
 
3df6709
 
4109a12
 
 
91c7b82
4109a12
 
 
 
 
f15cab5
 
e2af80f
 
91c7b82
4109a12
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7cff747
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
---
language:
- nl
tags:
- text2text generation
- spelling normalization
- 19th-century Dutch
license: apache-2.0
---

# 19th Century Dutch Spelling Normalization

This repository contains a pretrained and finetuned model of the original __google/ByT5-small__.
This model has been pretrained and finetuned for the task of 19th-century Dutch spelling normalization. 
We first pretrained the model with 2 million sentences from Dutch historical novels.
Afterward, we finetuned the model with a 10k dataset consisting of 19th-century Dutch sentences; 
these sentences were automatically annotated by a rule-based system built for 19th-century Dutch spelling normalization (van Cranenburgh and van Noord, 2022).

The finetuned model is only available in the TensorFlow format but can be converted to a PyTorch environment.
The pretrained only weights are available in the PyTorch environment; note that this model has to be finetuned first. 
The pretrained only weights are available in the directory __Pretrained_ByT5__.
The train and validation sets used for finetuning are available in the main repository. 
For further information about the model, please see the [GitHub](https://github.com/Awolters123/Master-Thesis) repository.


## How to use:

```
from transformers import AutoTokenizer, TFT5ForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained('AWolters/ByT5_DutchSpellingNormalization')
model = TFT5ForConditionalGeneration.from_pretrained('AWolters/ByT5_DutchSpellingNormalization')

text = 'De menschen waren aan het werk.'
tokenized = tokenizer(text, return_tensors='tf')

prediction = model.generate(input_ids=tokenized['input_ids'],
                            attention_mask=tokenized['attention_mask'],
                            max_new_tokens=100)

print(tokenizer.decode(prediction[0], text_target=True, skip_special_tokens=True))
```

## Setup:

The model has been finetuned with the following (hyper)parameters values:

_Learn rate_: 5e-5   
_Batch size_: 32   
_Optimizer_: AdamW   
_Epochs_: 30, with earlystopping

To further finetune the model, use the __T5Trainer.py__ script.