AWolters
/

ByT5_DutchSpellingNormalization

Text2Text Generation

text2text generation

spelling normalization

19th-century Dutch

Inference Endpoints

Model card Files Files and versions Community

AWolters commited on Jul 1, 2023

Commit

4109a12

·

1 Parent(s): 1896381

Update README.md

Files changed (1) hide show

README.md +51 -0

README.md CHANGED Viewed

@@ -1,3 +1,54 @@
 ---
 license: apache-2.0
 ---

 ---
+language:
+- nl
+tags:
+- text2text generation
+- spelling normalization
+- 19th-century Dutch
 license: apache-2.0
 ---
+# 19th Century Dutch Spelling Normalization
+This repository contains a pretrained and finetuned model of the original ByT5-small.
+This model has been pretrained and finetuned for the task of 19th-century Dutch spelling normalization.
+We first pretrained the model with 2 million sentences from Dutch historical novels.
+Afterward, we finetuned the model with a 10k dataset consisting of 19th-century Dutch sentences;
+these sentences were automatically annotated by a rule-based system built for 19th-century Dutch spelling normalization (van Cranenburgh and van Noord, 2022).
+The model is only available in the TensorFlow format but can be converted to a Pytroch environment.
+The pretrained only weights are also available in the Flax environment; note that this model has to be finetuned first.
+The pretrained only weights are available in the directory _pretrained_ByT5_.
+The train and validation sets used for finetuning are available in the repository.
+For further information about the model and data, please see the [GitHub](https://github.com/Awolters123/Master-Thesis) repository.
+## How to use:
+```
+from transformers import AutoTokenizer, TFT5ForConditionalGeneration
+tokenizer = AutoTokenizer.from_pretrained('AWolters/ByT5_DutchSpellingNormalization')
+model = TFT5ForConditionalGeneration.from_pretrained('AWolters/ByT5_DutchSpellingNormalization')
+text = 'De menschen waren aan het werk.'
+tokenized = tokenizer(text, return_tensors='tf')
+prediction = model.generate(input_ids=tokenized['input_ids'],
+                            attention_mask=tokenized['attention_mask'],
+                            max_new_tokens=100)
+print(tokenizer.decode(prediction[0], text_target=True, skip_special_tokens=True))
+```
+## Setup:
+The model has been finetuned with the following (hyper)parameters values:
+_Learn rate_: 5e-5
+_Batch size_: 32
+_Optimizer_: AdamW
+_Epochs_: 30, with earlystopping
+To further finetune the model, use the _T5Trainer.py_ script.
+If you want to finetune the pretrained weights from scratch, you have to first convert the Flax file into a Pytorch or TensorFlow environment.