elan-mt-bt-en-ja / README.md
Mitsua's picture
Update README.md
02c48e7 verified
metadata
license: cc-by-sa-4.0
datasets:
  - Mitsua/wikidata-parallel-descriptions-en-ja
language:
  - ja
  - en
metrics:
  - bleu
  - chrf
library_name: transformers
pipeline_tag: translation

ElanMT

ElanMT-BT-en-ja is a English to Japanese translation model developed by ELAN MITSUA Project / Abstract Engine.

  • ElanMT-base-en-ja and ElanMT-base-ja-en are trained from scratch, exclusively on openly licensed corpora such as CC0, CC BY and CC BY-SA.
  • This model is a fine-tuned checkpoint of ElanMT-base-en-ja and is trained exclusively on openly licensed data and Wikipedia back translated data using ElanMT-base-ja-en.
  • Web crawled or other machine translated corpora are not used during the entire training procedure for the ElanMT models.

Despite the relatively low resource training, thanks to back-translation and a newly built CC0 corpus, the model achieved comparable performance to the currently available open translation models.

Model Details

This is a translation model based on Marian MT 6-layer encoder-decoder transformer architecture with sentencepiece tokenizer.

Usage

  1. Install the python packages

pip install transformers accelerate sentencepiece

  • This model is verified on transformers==4.40.2
  1. Run
from transformers import pipeline
translator = pipeline('translation', model='Mitsua/elan-mt-bt-en-ja')
translator('Hello. I am an AI.')
  1. For longer multiple sentences, using pySBD is recommended.

pip install transformers accelerate sentencepiece pysbd

import pysbd
seg_en = pysbd.Segmenter(language="en", clean=False)
txt = 'Hello. I am an AI. How are you doing?'
print(translator(seg_en.segment(txt)))

This idea is from FuguMT repo.

Training Data

We heavily referred FuguMT author's blog post for dataset collection.

*Even if the dataset itself is CC-licensed, we did not use it if the corpus contained in the dataset is based on web crawling, is based on unauthorized use of copyrighted works, or is based on the machine translation output of other translation models.

Training Procedure

We heavily referred "Beating Edinburgh's WMT2017 system for en-de with Marian's Transformer model" for training process and hyperparameter tuning.

  1. Trains a sentencepiece tokenizer 32k vocab on 4M lines openly licensed corpus.
  2. Trains ja-en back-translation model on 4M lines openly licensed corpus for 6 epochs. = ElanMT-base-ja-en
  3. Trains en-ja base translation model on 4M lines openly licensed corpus for 6 epochs. = ElanMT-base-en-ja
  4. Translates 20M lines ja Wikipedia to en using back-translation model.
  5. Trains 4 en-ja models, which is finetuned from ElanMT-base-en-ja checkpoint, on 24M lines training data augmented with back-translated data for 6 epochs.
  6. Merges 4 trained models that produces the best validation score on FLORES+ dev split.
  7. Finetunes the merged model on 1M lines high quality corpus subset for 5 epochs.

Evaluation

Dataset

  • FLORES+ (CC BY-SA 4.0) devtest split is used for evaluation.
  • NTREX (CC BY-SA 4.0)

Result

Model Params FLORES+ BLEU FLORES+ chrf NTREX BLEU NTREX chrf
ElanMT-BT 61M 29.96 38.43 25.63 35.41
ElanMT-base w/o back-translation 61M 26.55 35.28 23.04 32.94
ElanMT-tiny 15M 25.93 34.69 22.78 33.00
staka/fugumt-en-ja (*1) 61M 30.89 38.38 24.74 34.23
facebook/mbart-large-50-many-to-many-mmt 610M 26.31 34.37 23.35 32.66
facebook/nllb-200-distilled-600M 615M 17.09 27.32 14.92 26.26
facebook/nllb-200-3.3B 3B 20.04 30.33 17.07 28.46
google/madlad400-3b-mt 3B 24.62 33.89 23.64 33.48
google/madlad400-7b-mt 7B 25.57 34.59 24.60 34.43
  • *1 tested on transformers==4.29.2 and num_beams=4
  • *2 BLEU score is calculated by sacreBLEU with tokenize=ja-mecab

Disclaimer

  • The translated result may be very incorrect, harmful or biased. The model was developed to investigate achievable performance with only a relatively small, licensed corpus, and is not suitable for use cases requiring high translation accuracy. Under Section 5 of the CC BY-SA 4.0 License, ELAN MITSUA Project / Abstract Engine is not responsible for any direct or indirect loss caused by the use of the model.
  • 免責事項:翻訳結果は不正確で、有害であったりバイアスがかかっている可能性があります。本モデルは比較的小規模でライセンスされたコーパスのみで達成可能な性能を調査するために開発されたモデルであり、翻訳の正確性が必要なユースケースでの使用には適していません。絵藍ミツアプロジェクト及び株式会社アブストラクトエンジンはCC BY-SA 4.0ライセンス第5条に基づき、本モデルの使用によって生じた直接的または間接的な損失に対して、一切の責任を負いません。