jramompichel commited on
Commit
98b7114
1 Parent(s): a19ed28

Update README_English.md

Browse files
Files changed (1) hide show
  1. README_English.md +85 -0
README_English.md CHANGED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - gl
5
+ metrics:
6
+ - bleu (Gold1): 79.6
7
+ - bleu (Gold2): 43.3
8
+ - bleu (Flores): 21.8
9
+ - bleu (Test-suite): 74.3
10
+ ---
11
+
12
+ **Model description**
13
+
14
+ Model developed with OpenNMT for the Spanish-Galician pair using a transformer architecture.
15
+
16
+ **How to translate**
17
+
18
+ + Open bash terminal
19
+ + Install [Python 3.9](https://www.python.org/downloads/release/python-390/)
20
+ + Install [Open NMT toolkit v.2.2](https://github.com/OpenNMT/OpenNMT-py)
21
+ + Translate an input_text using the NOS-MT-en-gl model with the following command:
22
+
23
+ ```bash
24
+ onmt_translate -src input_text -model NOS-MT-es-gl -output ./output_file.txt -replace_unk -phrase_table phrase_table-es-gl.txt -gpu 0
25
+ ```
26
+ + The result of the translation will be in the PATH indicated by the -output flag.
27
+
28
+ **Training**
29
+
30
+ In the training we have used authentic and synthetic corpora from [ProxectoNós](https://github.com/proxectonos/corpora). The former are corpora of translations directly produced by human translators. The latter are corpora of spanish-portuguese translations, which we have converted into spanish-galician by means of portuguese-galician translation with Opentrad/Apertium and transliteration for out-of-vocabulary words.
31
+
32
+
33
+ **Training process**
34
+
35
+ + Tokenization of the datasets made with linguakit tokeniser https://github.com/citiususc/Linguakit
36
+ + Vocabulary for the models was created by the script [learn_bpe.py](https://github.com/OpenNMT/OpenNMT-py/blob/master/tools/learn_bpe.py) da open NMT
37
+ + Using the .yaml in this repository you can replicate the training process as follows
38
+
39
+ ```bash
40
+ onmt_build_vocab -config bpe-es-gl_emb.yaml -n_sample 100000
41
+ onmt_train -config bpe-es-gl_emb.yaml
42
+ ```
43
+
44
+ **Hyper-parameters**
45
+
46
+ The parameters used for the development of the model can be directly viewed in the same .yaml file bpe-es-gl_emb.yaml
47
+
48
+ **Evaluation**
49
+
50
+ The BLEU evaluation of the models is done by mixing internally developed tests (gold1, gold2, test-suite) with other datasets available in Galician (Flores).
51
+
52
+ | GOLD 1 | GOLD 2 | FLORES | TEST-SUITE|
53
+ | ------------- |:-------------:| -------:|----------:|
54
+ | 79.6 | 43.3 | 21.8 | 74.3 |
55
+
56
+ **Licensing information**
57
+
58
+ MIT License
59
+
60
+ Copyright (c) 2023 Proxecto Nós
61
+
62
+ Permission is hereby granted, free of charge, to any person obtaining a copy
63
+ of this software and associated documentation files (the "Software"), to deal
64
+ in the Software without restriction, including without limitation the rights
65
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
66
+ copies of the Software, and to permit persons to whom the Software is
67
+ furnished to do so, subject to the following conditions:
68
+
69
+ The above copyright notice and this permission notice shall be included in all
70
+ copies or substantial portions of the Software.
71
+
72
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
73
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
74
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
75
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
76
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
77
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
78
+ SOFTWARE.
79
+
80
+ **Funding**
81
+
82
+ This research was funded by the project "Nós: Galician in the society and economy of artificial intelligence", agreement between Xunta de Galicia and University of Santiago de Compostela, and grant ED431G2019/04 by the Galician Ministry of Education, University and Professional Training, and the European Regional Development Fund (ERDF/FEDER program), and Groups of Reference: ED431C 2020/21.
83
+
84
+ **Citation Information**
85
+