jramompichel commited on
Commit
ef4a603
1 Parent(s): 38da8a5

Update README_English.md

Browse files
Files changed (1) hide show
  1. README_English.md +85 -0
README_English.md CHANGED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - gl
5
+ metrics:
6
+ - bleu (Gold1): 36.8
7
+ - bleu (Gold2): 47.1
8
+ - bleu (Flores): 32.3
9
+ - bleu (Test-suite): 42.7
10
+ ---
11
+ license: mit
12
+ ---
13
+
14
+ **Model Description**
15
+
16
+ OpenNMT model for the English-Galician pair using a transformer architecture.
17
+
18
+ **How to translate**
19
+
20
+ + Open bash terminal
21
+ + Install [Python 3.9](https://www.python.org/downloads/release/python-390/)
22
+ + Install [Open NMT toolkit v.2.2](https://github.com/OpenNMT/OpenNMT-py)
23
+ + Translate an input_text using the NOS-MT-en-gl model with the following command:
24
+
25
+ ```bash
26
+ onmt_translate -src input_text -model NOS-MT-en-gl -output ./output_file.txt -replace_unk -gpu 0
27
+ ```
28
+ + O resultado da tradución estará no PATH indicado no flag -output / The result of the translation will be in the PATH indicated by the -output flag.
29
+
30
+ **Training**
31
+
32
+ In the training we have used authentic and synthetic corpora from [ProxectoNós](https://github.com/proxectonos/corpora). The former are corpora of translations directly produced by human translators. The latter are corpora of english-portuguese translations, which we have converted into english-galician by means of portuguese-galician translation with Opentrad/Apertium and transliteration for out-of-vocabulary words.
33
+
34
+ **Training process**
35
+
36
+ + Tokenization of the datasets made with linguakit tokeniser https://github.com/citiususc/Linguakit
37
+ + The vocabulary for the models was generated through the script [learn_bpe.py](https://github.com/OpenNMT/OpenNMT-py/blob/master/tools/learn_bpe.py) da open NMT
38
+ + Using .yaml in this repository you can replicate the training process as follows
39
+
40
+ ```bash
41
+ onmt_build_vocab -config bpe-en-gl_emb.yaml -n_sample 100000
42
+ onmt_train -config bpe-en-gl_emb.yaml
43
+ ```
44
+
45
+ **Hyper-parameters**
46
+
47
+ The parameters used for the development of the model can be directly consulted in the same .yaml file bpe-en-gl_emb.yaml
48
+
49
+ **Evaluation**
50
+
51
+ The BLEU evaluation of the models is made with a mixture of internally developed tests (gold1, gold2, test-suite) with other datasets available in Galician (Flores).
52
+
53
+ | GOLD 1 | GOLD 2 | FLORES | TEST-SUITE|
54
+ | ------------- |:-------------:| -------:|----------:|
55
+ | 36.8 | 47.1 | 32.3 | 42.7 |
56
+
57
+ **Licensing information**
58
+
59
+ MIT License
60
+
61
+ Copyright (c) 2023 Proxecto Nós
62
+
63
+ Permission is hereby granted, free of charge, to any person obtaining a copy
64
+ of this software and associated documentation files (the "Software"), to deal
65
+ in the Software without restriction, including without limitation the rights
66
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
67
+ copies of the Software, and to permit persons to whom the Software is
68
+ furnished to do so, subject to the following conditions:
69
+
70
+ The above copyright notice and this permission notice shall be included in all
71
+ copies or substantial portions of the Software.
72
+
73
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
74
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
75
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
76
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
77
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
78
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
79
+ SOFTWARE.
80
+
81
+ **Funding**
82
+
83
+ This research was funded by the project "Nós: Galician in the society and economy of artificial intelligence", agreement between Xunta de Galicia and University of Santiago de Compostela, and grant ED431G2019/04 by the Galician Ministry of Education, University and Professional Training, and the European Regional Development Fund (ERDF/FEDER program), and Groups of Reference: ED431C 2020/21.
84
+
85
+ **Citation Information**