n8rob commited on
Commit
79a6da4
1 Parent(s): c7765e0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -5
README.md CHANGED
@@ -1,21 +1,93 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
4
 
5
- This is a many-to-many model for Creole-English, English-Creole and Creole-Creole MT, trained from scratch on public data.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
  ```
8
  from transformers import MBartForConditionalGeneration, AutoModelForSeq2SeqLM
9
  from transformers import AlbertTokenizer, AutoTokenizer
10
 
11
- tokenizer = AutoTokenizer.from_pretrained("n8rob/kreyol-mt-scratch-pubtrain", do_lower_case=False, use_fast=False, keep_accents=True)
12
 
13
  # The tokenizer we use is based on the AlbertTokenizer class which is essentially sentencepiece. We train this sentencepeice model from scratch.
14
- # Or use tokenizer = AlbertTokenizer.from_pretrained("n8rob/kreyol-mt-scratch-pubtrain", do_lower_case=False, use_fast=False, keep_accents=True)
15
 
16
- model = AutoModelForSeq2SeqLM.from_pretrained("n8rob/kreyol-mt-scratch-pubtrain")
17
 
18
- # Or use model = MBartForConditionalGeneration.from_pretrained("n8rob/kreyol-mt-scratch-pubtrain")
19
 
20
  # Some initial mapping
21
  bos_id = tokenizer._convert_token_to_id_with_added_voc("<s>")
 
1
  ---
2
  license: mit
3
+ language:
4
+ - acf
5
+ - aoa
6
+ - bah
7
+ - bzj
8
+ - bzk
9
+ - cri
10
+ - crs
11
+ - dcr
12
+ - djk
13
+ - fab
14
+ - fng
15
+ - fpe
16
+ - gcf
17
+ - gcr
18
+ - gpe
19
+ - gul
20
+ - gyn
21
+ - hat
22
+ - icr
23
+ - jam
24
+ - kea
25
+ - kri
26
+ - ktu
27
+ - lou
28
+ - mfe
29
+ - mue
30
+ - pap
31
+ - pcm
32
+ - pov
33
+ - pre
34
+ - rcf
35
+ - sag
36
+ - srm
37
+ - srn
38
+ - svc
39
+ - tpi
40
+ - trf
41
+ - wes
42
+ - ara
43
+ - aze
44
+ - ceb
45
+ - deu
46
+ - eng
47
+ - fra
48
+ - nep
49
+ - por
50
+ - spa
51
+ - zho
52
+ task_categories:
53
+ - translation
54
  ---
55
 
56
+ # Kreyòl-MT
57
+
58
+ Welcome to the repository for our **from-scratch** **public-data** model.
59
+
60
+ Please see our paper: 📄 ["Kreyòl-MT: Building Machine Translation for Latin American, Caribbean, and Colonial African Creole Languages"](https://arxiv.org/abs/2405.05376)
61
+
62
+ And our GitHub repository: 💻 [Kreyòl-MT](https://github.com/JHU-CLSP/Kreyol-MT/tree/main)
63
+
64
+ And cite our work:
65
+
66
+ ```
67
+ @article{robinson2024krey,
68
+ title={Krey$\backslash$ol-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages},
69
+ author={Robinson, Nathaniel R and Dabre, Raj and Shurtz, Ammon and Dent, Rasul and Onesi, Onenamiyi and Monroc, Claire Bizon and Grobol, Lo{\"\i}c and Muhammad, Hasan and Garg, Ashi and Etori, Naome A and others},
70
+ journal={arXiv preprint arXiv:2405.05376},
71
+ year={2024}
72
+ }
73
+ ```
74
+
75
+ ## Model hosted here
76
+
77
+ This is a many-to-many model for MT into and out of Creole languages, trained from scratch on public data.
78
 
79
  ```
80
  from transformers import MBartForConditionalGeneration, AutoModelForSeq2SeqLM
81
  from transformers import AlbertTokenizer, AutoTokenizer
82
 
83
+ tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/kreyol-mt-scratch-pubtrain", do_lower_case=False, use_fast=False, keep_accents=True)
84
 
85
  # The tokenizer we use is based on the AlbertTokenizer class which is essentially sentencepiece. We train this sentencepeice model from scratch.
86
+ # Or use tokenizer = AlbertTokenizer.from_pretrained("jhu-clsp/kreyol-mt-scratch-pubtrain", do_lower_case=False, use_fast=False, keep_accents=True)
87
 
88
+ model = AutoModelForSeq2SeqLM.from_pretrained("jhu-clsp/kreyol-mt-scratch-pubtrain")
89
 
90
+ # Or use model = MBartForConditionalGeneration.from_pretrained("jhu-clsp/kreyol-mt-scratch-pubtrain")
91
 
92
  # Some initial mapping
93
  bos_id = tokenizer._convert_token_to_id_with_added_voc("<s>")