monsoon-nlp
/

byt5-dv

Text2Text Generation

Inference Endpoints

text-generation-inference

Model card Files Files and versions Community

byt5-dv / README.md

monsoon-nlp's picture

better finetuned run

6924d3e almost 3 years ago

|

raw history blame contribute delete

No virus

951 Bytes

	---
	language: dv
	---

	# byt5-dv

	Pretrained from scratch on Dhivei (language of the Maldives)
	with ByT5, Google's new byte-level tokenizer strategy.

	Corpus: dv.wikipedia.org as of March 2020 (TFDS)

	Notebook - Pretraining on Wikipedia: https://colab.research.google.com/drive/19Afq7CI6cOi1DaTpnQhBbEbnBzLSFHbH

	## Demo

	Notebook - Finetuning on Maldivian news classification task: https://colab.research.google.com/drive/11u5SafR4bKICmArgDl6KQ9vqfYtDpyWp

	Current performance:

	- mBERT: 52%
	- byt5-dv: 81%
	- dv-wave (ELECTRA): 89%
	- dv-muril: 90.7%
	- dv-labse: 91.3-91.5%

	Source of dataset: https://github.com/Sofwath/DhivehiDatasets

	## Work in progress - todos

	The Wikipedia corpus is too small for this language. In the future I would add
	OSCAR and Sofwath's Maldivian corpus, if I can rewrite the script to accept those
	as one TFDS dataset.

	This is based on ByT5-small ... we should try a larger model

	This needs more time for pretraining