Mitsua commited on
Commit
539f80e
1 Parent(s): 0c916b7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +112 -3
README.md CHANGED
@@ -1,3 +1,112 @@
1
- ---
2
- license: cc-by-sa-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-sa-4.0
3
+ datasets:
4
+ - Mitsua/wikidata-parallel-descriptions-en-ja
5
+ language:
6
+ - ja
7
+ - en
8
+ metrics:
9
+ - bleu
10
+ - chrf
11
+ library_name: transformers
12
+ pipeline_tag: translation
13
+ ---
14
+ # ElanMT
15
+ [**ElanMT-BT-ja-en**](https://huggingface.co/Mitsua/elan-mt-bt-ja-en) is a Japanese to English translation model developed by [ELAN MITSUA Project](https://elanmitsua.com/en/) / Abstract Engine.
16
+ - [**ElanMT-base-ja-en**](https://huggingface.co/Mitsua/elan-mt-base-ja-en) and [**ElanMT-base-en-ja**](https://huggingface.co/Mitsua/elan-mt-base-en-ja) are trained from scratch, exclusively on openly licensed corpora such as CC0, CC BY and CC BY-SA.
17
+ - This model is a fine-tuned checkpoint of **ElanMT-base-ja-en** and is trained exclusively on openly licensed data and Wikipedia back translated data using **ElanMT-base-en-ja**.
18
+ - Web crawled or other machine translated corpora are **not** used during the entire training procedure for the **ElanMT** models.
19
+
20
+ Despite the relatively low resource training, thanks to back-translation and [a newly built CC0 corpus](https://huggingface.co/datasets/Mitsua/wikidata-parallel-descriptions-en-ja),
21
+ the model achieved comparable performance to the currently available open translation models.
22
+
23
+ ## Model Details
24
+ This is a translation model based on [Marian MT](https://marian-nmt.github.io/) 6-layer encoder-decoder transformer architecture with sentencepiece tokenizer.
25
+ - **Developed by**: [ELAN MITSUA Project](https://elanmitsua.com/en/) / Abstract Engine
26
+ - **Model type**: Translation
27
+ - **Source Language**: Japanese
28
+ - **Target Language**: English
29
+ - **License**: [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)
30
+
31
+ ## Usage
32
+ 1. Install the python packages
33
+
34
+ `pip install transformers accelerate sentencepiece`
35
+
36
+ * This model is verified on `transformers==4.40.2`
37
+
38
+ 2. Run
39
+
40
+ ```python
41
+ from transformers import pipeline
42
+ translator = pipeline('translation', model='Mitsua/elan-mt-bt-ja-en')
43
+ translator('こんにちは。私はAIです。')
44
+ ```
45
+
46
+ 3. For longer multiple sentences, using [pySBD](https://github.com/nipunsadvilkar/pySBD) is recommended.
47
+
48
+ `pip install transformers accelerate sentencepiece pysbd`
49
+ ```python
50
+ import pysbd
51
+ seg = pysbd.Segmenter(language="ja", clean=False)
52
+ txt = 'こんにちは。私はAIです。お元気ですか?'
53
+ print(translator(seg.segment(txt)))
54
+ ```
55
+ This idea is from [FuguMT](https://huggingface.co/staka/fugumt-ja-en) repo.
56
+
57
+ ## Training Data
58
+ We heavily referred [FuguMT author's blog post](https://staka.jp/wordpress/?p=413) for dataset collection.
59
+
60
+ - [Mitsua/wikidata-parallel-descriptions-en-ja](https://huggingface.co/datasets/Mitsua/wikidata-parallel-descriptions-en-ja) (CC0 1.0)
61
+ - We newly built this 1.5M lines wikidata parallel corpus to augment the training data. This greatly improved the vocabulary on a word basis.
62
+ - [The Kyoto Free Translation Task (KFTT)](https://www.phontron.com/kftt/) (CC BY-SA 3.0)
63
+ - Graham Neubig, "The Kyoto Free Translation Task," http://www.phontron.com/kftt, 2011.
64
+ - [Tatoeba](https://tatoeba.org/en/downloads) (CC BY 2.0 FR / CC0 1.0)
65
+ - https://tatoeba.org/
66
+ - [wikipedia-interlanguage-titles](https://github.com/bhaddow/wikipedia-interlanguage-titles) (The MIT License / CC BY-SA 4.0)
67
+ - We built parallel titles based on 2024-05-06 wikipedia dump.
68
+ - [WikiMatrix](https://github.com/facebookresearch/LASER/tree/main/tasks/WikiMatrix) (CC BY-SA 4.0)
69
+ - Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Francisco Guzmán, "WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia"
70
+ - [MDN Web Docs](https://github.com/mdn/translated-content) (The MIT / CC0 1.0 / CC BY-SA 2.5)
71
+ - https://github.com/mdn/translated-content
72
+ - [Wikimedia contenttranslation dump](https://dumps.wikimedia.org/other/contenttranslation/) (CC BY-SA 4.0)
73
+ - 2024-5-10 dump is used.
74
+
75
+ *Even if the dataset itself is CC-licensed, we did not use it if the corpus contained in the dataset is based on web crawling, is based on unauthorized use of copyrighted works, or is based on the machine translation output of other translation models.
76
+
77
+ ## Training Procedure
78
+ We heavily referred "[Beating Edinburgh's WMT2017 system for en-de with Marian's Transformer model](https://github.com/marian-nmt/marian-examples/tree/master/wmt2017-transformer)"
79
+ for training process and hyperparameter tuning.
80
+
81
+ 1. Trains a sentencepiece tokenizer 32k vocab on 4M lines openly licensed corpus.
82
+ 2. Trains `en-ja` back-translation model on 4M lines openly licensed corpus for 6 epochs. = **ElanMT-base-en-ja**
83
+ 3. Trains `ja-en` base translation model on 4M lines openly licensed corpus for 6 epochs. = **ElanMT-base-ja-en**
84
+ 4. Translates 20M lines `en` Wikipedia to `ja` using back-translation model.
85
+ 5. Trains 4 `ja-en` models, which is finetuned from **ElanMT-base-ja-en** checkpoint, on 24M lines training data augmented with back-translated data for 6 epochs.
86
+ 6. Merges 4 trained models that produces the best validation score on FLORES+ dev split.
87
+ 7. Finetunes the merged model on 1M lines high quality corpus subset for 5 epochs.
88
+
89
+ ## Evaluation
90
+ ### Dataset
91
+ - [FLORES+](https://github.com/openlanguagedata/flores) (CC BY-SA 4.0) devtest split is used for evaluation.
92
+ - [NTREX](https://github.com/MicrosoftTranslator/NTREX) (CC BY-SA 4.0)
93
+
94
+ ### Result
95
+ | **Model** | **Params** | **FLORES+ BLEU** | **FLORES+ chrf** | **NTREX BLEU** | **NTREX chrf** |
96
+ |:---|---:|---:|---:|---:|---:|
97
+ | [**ElanMT-BT**](https://huggingface.co/Mitsua/elan-mt-bt-ja-en) | 61M | 24.87 | 55.02 | 22.57 | 52.48|
98
+ | [**ElanMT-base**](https://huggingface.co/Mitsua/elan-mt-base-ja-en) | 61M | 21.61 | 52.53 | 18.43 | 49.09|
99
+ | [**ElanMT-tiny**](https://huggingface.co/Mitsua/elan-mt-tiny-ja-en) | 15M | 20.40 | 51.81 | 18.43 | 49.39|
100
+ | [staka/fugumt-ja-en](https://huggingface.co/staka/fugumt-ja-en) | 61M | 24.10 | 54.97 | 22.33 | 51.84|
101
+ | [facebook/mbart-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt) | 610M | 23.88 | 53.98 | 22.59 | 51.57|
102
+ | [facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | 615M | 22.92 | 52.13 | 22.59 | 51.36|
103
+ | [facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) | 3B | 28.13 | 56.86 | 27.65 | 55.60|
104
+ | [google/madlad400-3b-mt](https://huggingface.co/google/madlad400-3b-mt) | 3B | 26.95 | 56.62 | 26.11 | 54.61|
105
+ | [google/madlad400-7b-mt](https://huggingface.co/google/madlad400-7b-mt) | 7B | 28.84 | 57.46 | 28.19 | 55.85|
106
+
107
+ - *1 tested on `transformers==4.29.2` and `num_beams=4`
108
+ - *2 BLEU score is calculated by `sacreBLEU`
109
+
110
+ ## Disclaimer
111
+ - The translated result may be very incorrect, harmful or biased. The model was developed to investigate achievable performance with only a relatively small, licensed corpus, and is not suitable for use cases requiring high translation accuracy. Under Section 5 of the CC BY-SA 4.0 License, ELAN MITSUA Project / Abstract Engine is not responsible for any direct or indirect loss caused by the use of the model.
112
+ - 免責事項:翻訳結果は不正確で、有害であったりバイアスがかかっている可能性があります。本モデルは比較的小規模でライセンスされたコーパスのみで達成可能な性能を調査するために開発されたモデルであり、翻訳の正確性が必要なユースケースでの使用には適していません。絵藍ミツアプロジェクト及び株式会社アブストラクトエンジンはCC BY-SA 4.0ライセンス第5条に基づき、本モデルの使用によって生じた直接的または間接的な損失に対して、一切の責任を負いません。