Joeynmt-kin-en / README.md
Kleber's picture
Update README.md
f88a24b
|
raw
history blame
No virus
2.29 kB
metadata
library_name: JoeyNMT
task: Machine-translation
tags:
  - JoeyNMT
  - Machine-translation
language: rw
datasets:
  - DigitalUmuganda/kinyarwanda-english-machine-translation-dataset
widget:
  - text: Muraho neza, murakaza neza mu Rwanda.
    example_title: Muraho neza, murakaza neza mu Rwanda.

Kinyarwanda-to-English Machine Translation

This model is a Kinyarwanda-to-English machine translation model, it was built and trained using JoeyNMT framework. The translation model uses transformer encoder-decoder based architecture. It was trained on a 47,211-long English-Kinyarwanda bitext dataset prepared by Digital Umuganda.

Model architecture

Encoder && Decoder

Type: Transformer Num_layer: 6 Num_heads: 8 Embedding_dim: 256 ff_size: 1024 Dropout: 0.1 Layer_norm: post Initializer: xavier Total params: 12563968

Pre-processing

Tokenizer_type: subword-nmt
num_merges: 4000
BPE encoding learned on the bitext, separate vocabularies for each language
Pretokenizer: None
No lowercase applied

Training

Optimizer: Adam
Loss: crossentropy
Epochs: 30
Batch_size: 256
Number of GPUs: 1

Evaluation

Evaluation_metrics: Blue_score, chrf
Tokenization: None
Beam_width: 15
Beam_alpha: 1.0

Tools

* joeyNMT 2.0.0
* datasets
* pandas
* numpy
* transformers
* sentencepiece
* pytorch(with cuda)
* sacrebleu
* protobuf>=3.20.1

How to train

Use the following link for more information

Translation

To install joeyNMT run:

$ git clone https://github.com/joeynmt/joeynmt.git
$ cd joeynmt
$ pip install . -e

Interactive translation(stdin):

$ python -m joeynmt translate args.yaml

File translation:

$ python -m joeynmt translate args.yaml < src_lang.txt > hypothesis_trg_lang.txt

Accuracy measurement

Sacrebleu installation:

$ pip install sacrebleu

Measurement(bleu_score, chrf):

$ sacrebleu reference.tsv -i hypothesis.tsv -m bleu chrf 

To-do

  • Test the model using different datasets including the jw300
  • Use the Digital Umuganda dataset on some available State Of The Art(SOTA) models.
  • Expand the dataset

Result

The following result was obtained using sacrebleu.

Kinyarwanda-to-English:

Blue: 79.87
Chrf: 84.40