Create README_English.md
Browse files- README_English.md +90 -0
README_English.md
ADDED
@@ -0,0 +1,90 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
language:
|
4 |
+
- gl
|
5 |
+
metrics:
|
6 |
+
- bleu (Gold1): 82.6
|
7 |
+
- bleu (Gold2): 49.9
|
8 |
+
- bleu (Flores): 23.8
|
9 |
+
- bleu (Test-suite): 77.2
|
10 |
+
---
|
11 |
+
|
12 |
+
---
|
13 |
+
License: MIT
|
14 |
+
---
|
15 |
+
|
16 |
+
**Model Description**
|
17 |
+
|
18 |
+
OpenNMT model for English-Galician using a transformer architecture.
|
19 |
+
|
20 |
+
**How to translate**
|
21 |
+
|
22 |
+
+ Open bash terminal
|
23 |
+
+ Install [Python 3.9](https://www.python.org/downloads/release/python-390/)
|
24 |
+
+ Install [Open NMT toolkit v.2.2](https://github.com/OpenNMT/OpenNMT-py)
|
25 |
+
+ Translate an input_text using the NOS-MT-gl-es model with the following command:
|
26 |
+
|
27 |
+
```bash
|
28 |
+
onmt_translate -src input_text -model NOS-MT-gl-es.pt -output ./output_file.txt -replace_unk -gpu 0
|
29 |
+
```
|
30 |
+
+ The result of the translation will be in the PATH indicated by the -output flag.
|
31 |
+
|
32 |
+
**Training**
|
33 |
+
|
34 |
+
In the training we have used authentic and synthetic corpora from [ProxectoNós](https://github.com/proxectonos/corpora). The former are corpora of translations directly produced by human translators. The latter are corpora of English-Portuguese translations, which we have converted into English-Galician by means of Portuguese-Galician translation with Opentrad/Apertium and transliteration for out-of-vocabulary words.
|
35 |
+
|
36 |
+
**Training process**
|
37 |
+
|
38 |
+
+ Tokenization of the datasets made with linguakit tokeniser https://github.com/citiususc/Linguakit
|
39 |
+
+ The vocabulary for the models was generated through the script [learn_bpe.py](https://github.com/OpenNMT/OpenNMT-py/blob/master/tools/learn_bpe.py) of OpenNMT
|
40 |
+
+ Using .yaml in this repository you can replicate the training process as follows
|
41 |
+
|
42 |
+
```bash
|
43 |
+
onmt_build_vocab -config bpe-gl-es_emb.yaml -n_sample 100000
|
44 |
+
onmt_train -config bpe-gl-es_emb.yaml
|
45 |
+
```
|
46 |
+
|
47 |
+
**Hyper-parameters**
|
48 |
+
|
49 |
+
The parameters used for the development of the model can be directly consulted in the same .yaml file bpe-en-gl_emb.yaml
|
50 |
+
|
51 |
+
**Evaluation**
|
52 |
+
|
53 |
+
The BLEU evaluation of the models is made with a mixture of internally developed tests (gold1, gold2, test-suite) and other datasets available in Galician (Flores).
|
54 |
+
|
55 |
+
| GOLD 1 | GOLD 2 | FLORES | TEST-SUITE|
|
56 |
+
| ------------- |:-------------:| -------:|----------:|
|
57 |
+
| 82.6 | 49.9 | 23.8 | 77.2 |
|
58 |
+
|
59 |
+
|
60 |
+
**Licensing information**
|
61 |
+
|
62 |
+
MIT License
|
63 |
+
|
64 |
+
Copyright (c) 2023 Proxecto Nós
|
65 |
+
|
66 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
67 |
+
of this software and associated documentation files (the "Software"), to deal
|
68 |
+
in the Software without restriction, including without limitation the rights
|
69 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
70 |
+
copies of the Software, and to permit persons to whom the Software is
|
71 |
+
furnished to do so, subject to the following conditions:
|
72 |
+
|
73 |
+
The above copyright notice and this permission notice shall be included in all
|
74 |
+
copies or substantial portions of the Software.
|
75 |
+
|
76 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
77 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
78 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
79 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
80 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
81 |
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
82 |
+
SOFTWARE.
|
83 |
+
|
84 |
+
**Funding**
|
85 |
+
|
86 |
+
This research was funded by the project "Nós: Galician in the society and economy of artificial intelligence", agreement between Xunta de Galicia and University of Santiago de Compostela, and grant ED431G2019/04 by the Galician Ministry of Education, University and Professional Training, and the European Regional Development Fund (ERDF/FEDER program), and Groups of Reference: ED431C 2020/21.
|
87 |
+
|
88 |
+
**Citation Information**
|
89 |
+
|
90 |
+
Gamallo, Pablo; Bardanca, Daniel; Pichel, José Ramom; García, Marcos; Rodríguez-Rey, Sandra; de-Dios-Flores, Iria. 2023. NOS-MT-OpenNMT-gl-es. Url: https://huggingface.co/proxectonos/NOS-MT-OpenNMT-gl-es
|