cifope commited on
Commit
3c89035
1 Parent(s): 248fd4e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +110 -0
README.md CHANGED
@@ -1,3 +1,113 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - wo
5
+ - fr
6
+ metrics:
7
+ - bleu
8
+ pipeline_tag: translation
9
+ tags:
10
+ - text-generation-inference
11
  ---
12
+
13
+ # Model Documentation: Wolof to French Translation with NLLB-200
14
+
15
+ ## Model Overview
16
+
17
+ This document describes a machine translation model fine-tuned from Meta's NLLB-200 for translating from Wolof to French. The model, hosted at `cifope/nllb-200-wo-fr-distilled-600M`, utilizes a distilled version of the NLLB-200 model which has been specifically optimized for translation tasks between the Wolof and French languages.
18
+
19
+ ## Dependencies
20
+
21
+ The model requires the `transformers` library by Hugging Face. Ensure that you have the library installed:
22
+
23
+ ```bash
24
+ pip install transformers
25
+ ```
26
+
27
+ ## Setup
28
+
29
+ Import necessary classes from the `transformers` library:
30
+
31
+ ```python
32
+ from transformers import AutoModelForSeq2SeqLM, NllbTokenizer
33
+ ```
34
+
35
+ Initialize the model and tokenizer:
36
+
37
+ ```python
38
+ model = AutoModelForSeq2SeqLM.from_pretrained('cifope/nllb-200-wo-fr-distilled-600M')
39
+ tokenizer = NllbTokenizer.from_pretrained('facebook/nllb-200-distilled-600M')
40
+ ```
41
+
42
+ ## Tokenizer Customization
43
+
44
+ To integrate specific features like new language codes into the tokenizer, you can use the `fix_tokenizer` function:
45
+
46
+ ```python
47
+ def fix_tokenizer(tokenizer, new_lang='wol_Wol'):
48
+ old_len = len(tokenizer) - int(new_lang in tokenizer.added_tokens_encoder)
49
+ tokenizer.lang_code_to_id[new_lang] = old_len-1
50
+ tokenizer.id_to_lang_code[old_len-1] = new_lang
51
+ tokenizer.fairseq_tokens_to_ids["<mask>"] = len(tokenizer.sp_model) + len(tokenizer.lang_code_to_id) + tokenizer.fairseq_offset
52
+ tokenizer.fairseq_tokens_to_ids.update(tokenizer.lang_code_to_id)
53
+ tokenizer.fairseq_ids_to_tokens = {v: k for k, v in tokenizer.fairseq_tokens_to_ids.items()}
54
+ if new_lang not in tokenizer._additional_special_tokens:
55
+ tokenizer._additional_special_tokens.append(new_lang)
56
+ tokenizer.added_tokens_encoder = {}
57
+ tokenizer.added_tokens_decoder = {}
58
+
59
+ fix_tokenizer(tokenizer)
60
+ ```
61
+
62
+ ## Translation Functions
63
+
64
+ ### Translate from French to Wolof
65
+
66
+ The `translate` function translates text from French to Wolof:
67
+
68
+ ```python
69
+ def translate(text, src_lang='fra_Latn', tgt_lang='wol_Wol', a=16, b=1.5, max_input_length=1024, **kwargs):
70
+ tokenizer.src_lang = src_lang
71
+ tokenizer.tgt_lang = tgt_lang
72
+ inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length)
73
+ result = model.generate(
74
+ **inputs.to(model.device),
75
+ forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
76
+ max_new_tokens=int(a + b * inputs.input_ids.shape[1]),
77
+ **kwargs
78
+ )
79
+ return tokenizer.batch_decode(result, skip_special_tokens=True)
80
+ ```
81
+
82
+ ### Translate from Wolof to French
83
+
84
+ The `reversed_translate` function translates text from Wolof to French:
85
+
86
+ ```python
87
+ def reversed_translate(text, src_lang='wol_Wol', tgt_lang='fra_Latn', a=16, b=1.5, max_input_length=1024, **kwargs):
88
+ tokenizer.src_lang = src_lang
89
+ tokenizer.tgt_lang = tgt_lang
90
+ inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length)
91
+ result = model.generate(
92
+ **inputs.to(model.device),
93
+ forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
94
+ max_new_tokens=int(a + b * inputs.input_ids.shape[1]),
95
+ **kwargs
96
+ )
97
+ return tokenizer.batch_decode(result, skip_special_tokens=True)
98
+ ```
99
+
100
+ ## Usage
101
+
102
+ To use the model for translating text, simply call the `translate` or `reversed_translate` function with the appropriate text and parameters. For example:
103
+
104
+ ```python
105
+ french_text = "L'argent peut être échangé à la seule banque des îles située à Stanley"
106
+ wolof_translation = translate(french_text)
107
+ print(wolof_translation)
108
+
109
+ wolof_text = "alkaati yi tàmbali nañu xàll léegi kilifa gi ñów"
110
+ french_translation = reversed_translate(wolof_text)
111
+ print(french_translation)
112
+ ```
113
+