Update README.md
Browse files
README.md
CHANGED
@@ -1,21 +1,93 @@
|
|
1 |
---
|
2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
4 |
|
5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
|
7 |
```
|
8 |
from transformers import MBartForConditionalGeneration, AutoModelForSeq2SeqLM
|
9 |
from transformers import AlbertTokenizer, AutoTokenizer
|
10 |
|
11 |
-
tokenizer = AutoTokenizer.from_pretrained("
|
12 |
|
13 |
# The tokenizer we use is based on the AlbertTokenizer class which is essentially sentencepiece. We train this sentencepeice model from scratch.
|
14 |
-
# Or use tokenizer = AlbertTokenizer.from_pretrained("
|
15 |
|
16 |
-
model = AutoModelForSeq2SeqLM.from_pretrained("
|
17 |
|
18 |
-
# Or use model = MBartForConditionalGeneration.from_pretrained("
|
19 |
|
20 |
# Some initial mapping
|
21 |
bos_id = tokenizer._convert_token_to_id_with_added_voc("<s>")
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
+
language:
|
4 |
+
- acf
|
5 |
+
- aoa
|
6 |
+
- bah
|
7 |
+
- bzj
|
8 |
+
- bzk
|
9 |
+
- cri
|
10 |
+
- crs
|
11 |
+
- dcr
|
12 |
+
- djk
|
13 |
+
- fab
|
14 |
+
- fng
|
15 |
+
- fpe
|
16 |
+
- gcf
|
17 |
+
- gcr
|
18 |
+
- gpe
|
19 |
+
- gul
|
20 |
+
- gyn
|
21 |
+
- hat
|
22 |
+
- icr
|
23 |
+
- jam
|
24 |
+
- kea
|
25 |
+
- kri
|
26 |
+
- ktu
|
27 |
+
- lou
|
28 |
+
- mfe
|
29 |
+
- mue
|
30 |
+
- pap
|
31 |
+
- pcm
|
32 |
+
- pov
|
33 |
+
- pre
|
34 |
+
- rcf
|
35 |
+
- sag
|
36 |
+
- srm
|
37 |
+
- srn
|
38 |
+
- svc
|
39 |
+
- tpi
|
40 |
+
- trf
|
41 |
+
- wes
|
42 |
+
- ara
|
43 |
+
- aze
|
44 |
+
- ceb
|
45 |
+
- deu
|
46 |
+
- eng
|
47 |
+
- fra
|
48 |
+
- nep
|
49 |
+
- por
|
50 |
+
- spa
|
51 |
+
- zho
|
52 |
+
task_categories:
|
53 |
+
- translation
|
54 |
---
|
55 |
|
56 |
+
# Kreyòl-MT
|
57 |
+
|
58 |
+
Welcome to the repository for our **from-scratch** **public-data** model.
|
59 |
+
|
60 |
+
Please see our paper: 📄 ["Kreyòl-MT: Building Machine Translation for Latin American, Caribbean, and Colonial African Creole Languages"](https://arxiv.org/abs/2405.05376)
|
61 |
+
|
62 |
+
And our GitHub repository: 💻 [Kreyòl-MT](https://github.com/JHU-CLSP/Kreyol-MT/tree/main)
|
63 |
+
|
64 |
+
And cite our work:
|
65 |
+
|
66 |
+
```
|
67 |
+
@article{robinson2024krey,
|
68 |
+
title={Krey$\backslash$ol-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages},
|
69 |
+
author={Robinson, Nathaniel R and Dabre, Raj and Shurtz, Ammon and Dent, Rasul and Onesi, Onenamiyi and Monroc, Claire Bizon and Grobol, Lo{\"\i}c and Muhammad, Hasan and Garg, Ashi and Etori, Naome A and others},
|
70 |
+
journal={arXiv preprint arXiv:2405.05376},
|
71 |
+
year={2024}
|
72 |
+
}
|
73 |
+
```
|
74 |
+
|
75 |
+
## Model hosted here
|
76 |
+
|
77 |
+
This is a many-to-many model for MT into and out of Creole languages, trained from scratch on public data.
|
78 |
|
79 |
```
|
80 |
from transformers import MBartForConditionalGeneration, AutoModelForSeq2SeqLM
|
81 |
from transformers import AlbertTokenizer, AutoTokenizer
|
82 |
|
83 |
+
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/kreyol-mt-scratch-pubtrain", do_lower_case=False, use_fast=False, keep_accents=True)
|
84 |
|
85 |
# The tokenizer we use is based on the AlbertTokenizer class which is essentially sentencepiece. We train this sentencepeice model from scratch.
|
86 |
+
# Or use tokenizer = AlbertTokenizer.from_pretrained("jhu-clsp/kreyol-mt-scratch-pubtrain", do_lower_case=False, use_fast=False, keep_accents=True)
|
87 |
|
88 |
+
model = AutoModelForSeq2SeqLM.from_pretrained("jhu-clsp/kreyol-mt-scratch-pubtrain")
|
89 |
|
90 |
+
# Or use model = MBartForConditionalGeneration.from_pretrained("jhu-clsp/kreyol-mt-scratch-pubtrain")
|
91 |
|
92 |
# Some initial mapping
|
93 |
bos_id = tokenizer._convert_token_to_id_with_added_voc("<s>")
|