imdbo commited on
Commit
f203cdd
1 Parent(s): 8bc12a0

Update README_English.md

Browse files
Files changed (1) hide show
  1. README_English.md +8 -6
README_English.md CHANGED
@@ -11,7 +11,7 @@ metrics:
11
 
12
  **Model description**
13
 
14
- Model developed with OpenNMT for the Spanish-Galician pair using a transformer architecture.
15
 
16
  **How to translate**
17
 
@@ -23,18 +23,20 @@ Model developed with OpenNMT for the Spanish-Galician pair using a transformer a
23
  ```bash
24
  onmt_translate -src input_text -model NOS-MT-es-gl -output ./output_file.txt -replace_unk -phrase_table phrase_table-es-gl.txt -gpu 0
25
  ```
26
- + The result of the translation will be in the PATH indicated by the -output flag.
27
 
28
  **Training**
29
 
30
- In the training we have used authentic and synthetic corpora from [ProxectoNós](https://github.com/proxectonos/corpora). The former are corpora of translations directly produced by human translators. The latter are corpora of Spanish-Portuguese translations, which we have converted into Spanish-Galician by means of Portuguese-Galician translation with Opentrad/Apertium and transliteration for out-of-vocabulary words.
 
 
31
 
32
 
33
  **Training process**
34
 
35
- + Tokenisation of the datasets made with LinguaKit tokeniser https://github.com/citiususc/Linguakit
36
- + Vocabulary for the models was created by the script [learn_bpe.py](https://github.com/OpenNMT/OpenNMT-py/blob/master/tools/learn_bpe.py) of OpenNMT
37
- + Using the .yaml in this repository you can replicate the training process as follows
38
 
39
  ```bash
40
  onmt_build_vocab -config bpe-es-gl_emb.yaml -n_sample 100000
 
11
 
12
  **Model description**
13
 
14
+ Model developed with OpenNMT for the Spanish-Galician pair using the transformer architecture.
15
 
16
  **How to translate**
17
 
 
23
  ```bash
24
  onmt_translate -src input_text -model NOS-MT-es-gl -output ./output_file.txt -replace_unk -phrase_table phrase_table-es-gl.txt -gpu 0
25
  ```
26
+ + The resulting translation will be in the PATH indicated by the -output flag.
27
 
28
  **Training**
29
 
30
+ To train this model, we have used authentic and synthetic corpora from [ProxectoNós](https://github.com/proxectonos/corpora).
31
+
32
+ Authentic corpora are corpora produced by human translators. Synthetic corpora are Spanish-Portuguese translations, which have been converted to Spanish-Galician by means of Portuguese-Galician translation with Opentrad/Apertium and transliteration for out-of-vocabulary words.
33
 
34
 
35
  **Training process**
36
 
37
+ + Tokenisation was performed with a modified version of the [linguakit](https://github.com/citiususc/Linguakit) tokeniser (tokenizer.pl) that does not append a new line after each token.
38
+ + All BPE models were generated with the script [learn_bpe.py](https://github.com/OpenNMT/OpenNMT-py/blob/master/tools/learn_bpe.py)
39
+ + Using the .yaml in this repository it is possible to replicate the original training process. Before training the model, please verify that the path to each target (tgt) and (src) file is correct. Once this is done, proceed as follows:
40
 
41
  ```bash
42
  onmt_build_vocab -config bpe-es-gl_emb.yaml -n_sample 100000