AWolters
/

ByT5_DutchSpellingNormalization

Text2Text Generation

text2text generation

spelling normalization

19th-century Dutch

Inference Endpoints

Model card Files Files and versions Community

ByT5_DutchSpellingNormalization / README.md

AWolters's picture

Update README.md

91c7b82 over 1 year ago

|

2.24 kB

	---
	language:
	- nl
	tags:
	- text2text generation
	- spelling normalization
	- 19th-century Dutch
	license: apache-2.0
	---

	# 19th Century Dutch Spelling Normalization

	This repository contains a pretrained and finetuned model of the original __google/ByT5-small__.
	This model has been pretrained and finetuned for the task of 19th-century Dutch spelling normalization.
	We first pretrained the model with 2 million sentences from Dutch historical novels.
	Afterward, we finetuned the model with a 10k dataset consisting of 19th-century Dutch sentences;
	these sentences were automatically annotated by a rule-based system built for 19th-century Dutch spelling normalization (van Cranenburgh and van Noord, 2022).

	The model is only available in the TensorFlow format but can be converted to a Pytroch environment.
	The pretrained only weights are also available in the Flax environment; note that this model has to be finetuned first.
	The pretrained only weights are available in the directory _pretrained_ByT5_.
	The train and validation sets used for finetuning are available in the repository.
	For further information about the model, please see the [GitHub](https://github.com/Awolters123/Master-Thesis) repository.


	## How to use:

	```
	from transformers import AutoTokenizer, TFT5ForConditionalGeneration

	tokenizer = AutoTokenizer.from_pretrained('AWolters/ByT5_DutchSpellingNormalization')
	model = TFT5ForConditionalGeneration.from_pretrained('AWolters/ByT5_DutchSpellingNormalization')

	text = 'De menschen waren aan het werk.'
	tokenized = tokenizer(text, return_tensors='tf')

	prediction = model.generate(input_ids=tokenized['input_ids'],
	attention_mask=tokenized['attention_mask'],
	max_new_tokens=100)

	print(tokenizer.decode(prediction[0], text_target=True, skip_special_tokens=True))
	```

	## Setup:

	The model has been finetuned with the following (hyper)parameters values:

	_Learn rate_: 5e-5
	_Batch size_: 32
	_Optimizer_: AdamW
	_Epochs_: 30, with earlystopping

	To further finetune the model, use the _T5Trainer.py_ script.
	If you want to finetune the pretrained weights from scratch, you have to first convert the Flax file into a Pytorch or TensorFlow environment.