license: cc-by-sa-4.0
datasets:
- Mitsua/wikidata-parallel-descriptions-en-ja
language:
- ja
- en
metrics:
- bleu
- chrf
library_name: transformers
pipeline_tag: translation
ElanMT
ElanMT-BT-en-ja is a English to Japanese translation model developed by ELAN MITSUA Project / Abstract Engine.
- ElanMT-base-en-ja and ElanMT-base-ja-en are trained from scratch, exclusively on openly licensed corpora such as CC0, CC BY and CC BY-SA.
- This model is a fine-tuned checkpoint of ElanMT-base-en-ja and is trained exclusively on openly licensed data and Wikipedia back translated data using ElanMT-base-ja-en.
- Web crawled or other machine translated corpora are not used during the entire training procedure for the ElanMT models.
Despite the relatively low resource training, thanks to back-translation and a newly built CC0 corpus, the model achieved comparable performance to the currently available open translation models.
Model Details
This is a translation model based on Marian MT 6-layer encoder-decoder transformer architecture with sentencepiece tokenizer.
- Developed by: ELAN MITSUA Project / Abstract Engine
- Model type: Translation
- Source Language: English
- Target Language: Japanese
- License: CC BY-SA 4.0
Usage
- Install the python packages
pip install transformers accelerate sentencepiece
- This model is verified on
transformers==4.40.2
- Run
from transformers import pipeline
translator = pipeline('translation', model='Mitsua/elan-mt-bt-en-ja')
translator('Hello. I am an AI.')
- For longer multiple sentences, using pySBD is recommended.
pip install transformers accelerate sentencepiece pysbd
import pysbd
seg_en = pysbd.Segmenter(language="en", clean=False)
txt = 'Hello. I am an AI. How are you doing?'
print(translator(seg_en.segment(txt)))
This idea is from FuguMT repo.
Training Data
We heavily referred FuguMT author's blog post for dataset collection.
- Mitsua/wikidata-parallel-descriptions-en-ja (CC0 1.0)
- We newly built this 1.5M lines wikidata parallel corpus to augment the training data. This greatly improved the vocabulary on a word basis.
- The Kyoto Free Translation Task (KFTT) (CC BY-SA 3.0)
- Graham Neubig, "The Kyoto Free Translation Task," http://www.phontron.com/kftt, 2011.
- Tatoeba (CC BY 2.0 FR / CC0 1.0)
- wikipedia-interlanguage-titles (The MIT License / CC BY-SA 4.0)
- We built parallel titles based on 2024-05-06 wikipedia dump.
- WikiMatrix (CC BY-SA 4.0)
- Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Francisco Guzmán, "WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia"
- MDN Web Docs (The MIT / CC0 1.0 / CC BY-SA 2.5)
- Wikimedia contenttranslation dump (CC BY-SA 4.0)
- 2024-5-10 dump is used.
*Even if the dataset itself is CC-licensed, we did not use it if the corpus contained in the dataset is based on web crawling, is based on unauthorized use of copyrighted works, or is based on the machine translation output of other translation models.
Training Procedure
We heavily referred "Beating Edinburgh's WMT2017 system for en-de with Marian's Transformer model" for training process and hyperparameter tuning.
- Trains a sentencepiece tokenizer 32k vocab on 4M lines openly licensed corpus.
- Trains
ja-en
back-translation model on 4M lines openly licensed corpus for 6 epochs. = ElanMT-base-ja-en - Trains
en-ja
base translation model on 4M lines openly licensed corpus for 6 epochs. = ElanMT-base-en-ja - Translates 20M lines
ja
Wikipedia toen
using back-translation model. - Trains 4
en-ja
models, which is finetuned from ElanMT-base-en-ja checkpoint, on 24M lines training data augmented with back-translated data for 6 epochs. - Merges 4 trained models that produces the best validation score on FLORES+ dev split.
- Finetunes the merged model on 1M lines high quality corpus subset for 5 epochs.
Evaluation
Dataset
Result
Model | Params | FLORES+ BLEU | FLORES+ chrf | NTREX BLEU | NTREX chrf |
---|---|---|---|---|---|
ElanMT-BT | 61M | 29.96 | 38.43 | 25.63 | 35.41 |
ElanMT-base w/o back-translation | 61M | 26.55 | 35.28 | 23.04 | 32.94 |
ElanMT-tiny | 15M | 25.93 | 34.69 | 22.78 | 33.00 |
staka/fugumt-en-ja (*1) | 61M | 30.89 | 38.38 | 24.74 | 34.23 |
facebook/mbart-large-50-many-to-many-mmt | 610M | 26.31 | 34.37 | 23.35 | 32.66 |
facebook/nllb-200-distilled-600M | 615M | 17.09 | 27.32 | 14.92 | 26.26 |
facebook/nllb-200-3.3B | 3B | 20.04 | 30.33 | 17.07 | 28.46 |
google/madlad400-3b-mt | 3B | 24.62 | 33.89 | 23.64 | 33.48 |
google/madlad400-7b-mt | 7B | 25.57 | 34.59 | 24.60 | 34.43 |
- *1 tested on
transformers==4.29.2
andnum_beams=4
- *2 BLEU score is calculated by
sacreBLEU
withtokenize=ja-mecab
Disclaimer
- The translated result may be very incorrect, harmful or biased. The model was developed to investigate achievable performance with only a relatively small, licensed corpus, and is not suitable for use cases requiring high translation accuracy. Under Section 5 of the CC BY-SA 4.0 License, ELAN MITSUA Project / Abstract Engine is not responsible for any direct or indirect loss caused by the use of the model.
- 免責事項:翻訳結果は不正確で、有害であったりバイアスがかかっている可能性があります。本モデルは比較的小規模でライセンスされたコーパスのみで達成可能な性能を調査するために開発されたモデルであり、翻訳の正確性が必要なユースケースでの使用には適していません。絵藍ミツアプロジェクト及び株式会社アブストラクトエンジンはCC BY-SA 4.0ライセンス第5条に基づき、本モデルの使用によって生じた直接的または間接的な損失に対して、一切の責任を負いません。