metadata

license: cc-by-sa-4.0
datasets:
  - Mitsua/wikidata-parallel-descriptions-en-ja
language:
  - ja
  - en
metrics:
  - bleu
  - chrf
library_name: transformers
pipeline_tag: translation

ElanMT

ElanMT-BT-en-ja is a English to Japanese translation model developed by ELAN MITSUA Project / Abstract Engine.

ElanMT-base-en-ja and ElanMT-base-ja-en are trained from scratch, exclusively on openly licensed corpora such as CC0, CC BY and CC BY-SA.
This model is a fine-tuned checkpoint of ElanMT-base-en-ja and is trained exclusively on openly licensed data and Wikipedia back translated data using ElanMT-base-ja-en.
Web crawled or other machine translated corpora are not used during the entire training procedure for the ElanMT models.

Despite the relatively low resource training, thanks to back-translation and a newly built CC0 corpus, the model achieved comparable performance to the currently available open translation models.

Model Details

This is a translation model based on Marian MT 6-layer encoder-decoder transformer architecture with sentencepiece tokenizer.

Developed by: ELAN MITSUA Project / Abstract Engine
Model type: Translation
Source Language: English
Target Language: Japanese
License: CC BY-SA 4.0

Usage

Install the python packages

pip install transformers accelerate sentencepiece

This model is verified on transformers==4.40.2

from transformers import pipeline
translator = pipeline('translation', model='Mitsua/elan-mt-bt-en-ja')
translator('Hello. I am an AI.')

For longer multiple sentences, using pySBD is recommended.

pip install transformers accelerate sentencepiece pysbd

import pysbd
seg_en = pysbd.Segmenter(language="en", clean=False)
txt = 'Hello. I am an AI. How are you doing?'
print(translator(seg_en.segment(txt)))

This idea is from FuguMT repo.

Training Data

We heavily referred FuguMT author's blog post for dataset collection.

Mitsua/wikidata-parallel-descriptions-en-ja (CC0 1.0)
- We newly built this 1.5M lines wikidata parallel corpus to augment the training data. This greatly improved the vocabulary on a word basis.
The Kyoto Free Translation Task (KFTT) (CC BY-SA 3.0)
- Graham Neubig, "The Kyoto Free Translation Task," http://www.phontron.com/kftt, 2011.
Tatoeba (CC BY 2.0 FR / CC0 1.0)
- https://tatoeba.org/
wikipedia-interlanguage-titles (The MIT License / CC BY-SA 4.0)
- We built parallel titles based on 2024-05-06 wikipedia dump.
WikiMatrix (CC BY-SA 4.0)
- Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Francisco Guzmán, "WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia"
MDN Web Docs (The MIT / CC0 1.0 / CC BY-SA 2.5)
- https://github.com/mdn/translated-content
Wikimedia contenttranslation dump (CC BY-SA 4.0)
- 2024-5-10 dump is used.

*Even if the dataset itself is CC-licensed, we did not use it if the corpus contained in the dataset is based on web crawling, is based on unauthorized use of copyrighted works, or is based on the machine translation output of other translation models.

Training Procedure

We heavily referred "Beating Edinburgh's WMT2017 system for en-de with Marian's Transformer model" for training process and hyperparameter tuning.

Trains a sentencepiece tokenizer 32k vocab on 4M lines openly licensed corpus.
Trains ja-en back-translation model on 4M lines openly licensed corpus for 6 epochs. = ElanMT-base-ja-en
Trains en-ja base translation model on 4M lines openly licensed corpus for 6 epochs. = ElanMT-base-en-ja
Translates 20M lines ja Wikipedia to en using back-translation model.
Trains 4 en-ja models, which is finetuned from ElanMT-base-en-ja checkpoint, on 24M lines training data augmented with back-translated data for 6 epochs.
Merges 4 trained models that produces the best validation score on FLORES+ dev split.
Finetunes the merged model on 1M lines high quality corpus subset for 5 epochs.

Evaluation

Dataset

FLORES+ (CC BY-SA 4.0) devtest split is used for evaluation.
NTREX (CC BY-SA 4.0)

Result

Model	Params	FLORES+ BLEU	FLORES+ chrf	NTREX BLEU	NTREX chrf
ElanMT-BT	61M	29.96	38.43	25.63	35.41
ElanMT-base w/o back-translation	61M	26.55	35.28	23.04	32.94
ElanMT-tiny	15M	25.93	34.69	22.78	33.00
staka/fugumt-en-ja (*1)	61M	30.89	38.38	24.74	34.23
facebook/mbart-large-50-many-to-many-mmt	610M	26.31	34.37	23.35	32.66
facebook/nllb-200-distilled-600M	615M	17.09	27.32	14.92	26.26
facebook/nllb-200-3.3B	3B	20.04	30.33	17.07	28.46
google/madlad400-3b-mt	3B	24.62	33.89	23.64	33.48
google/madlad400-7b-mt	7B	25.57	34.59	24.60	34.43

*1 tested on transformers==4.29.2 and num_beams=4
*2 BLEU score is calculated by sacreBLEU with tokenize=ja-mecab

Disclaimer

The translated result may be very incorrect, harmful or biased. The model was developed to investigate achievable performance with only a relatively small, licensed corpus, and is not suitable for use cases requiring high translation accuracy. Under Section 5 of the CC BY-SA 4.0 License, ELAN MITSUA Project / Abstract Engine is not responsible for any direct or indirect loss caused by the use of the model.
免責事項：翻訳結果は不正確で、有害であったりバイアスがかかっている可能性があります。本モデルは比較的小規模でライセンスされたコーパスのみで達成可能な性能を調査するために開発されたモデルであり、翻訳の正確性が必要なユースケースでの使用には適していません。絵藍ミツアプロジェクト及び株式会社アブストラクトエンジンはCC BY-SA 4.0ライセンス第5条に基づき、本モデルの使用によって生じた直接的または間接的な損失に対して、一切の責任を負いません。