losca commited on
Commit
6cd5273
·
1 Parent(s): 2850c1a

Update README.md

Browse files

Added usage example

Files changed (1) hide show
  1. README.md +25 -1
README.md CHANGED
@@ -4,4 +4,28 @@ tags:
4
  ---
5
  This model translate from English to Khmer.
6
  It is the pure fine-tuned version of MarianMT model en-zh.
7
- This is the result after 30 epochs of pure fine-tuning of khmer language.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
  This model translate from English to Khmer.
6
  It is the pure fine-tuned version of MarianMT model en-zh.
7
+ This is the result after 30 epochs of pure fine-tuning of khmer language.
8
+
9
+ ### Example
10
+ ```
11
+ %%capture
12
+ !pip install transformers transformers[sentencepiece]
13
+
14
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
15
+ # Download the pretrained model for English-Vietnamese available on the hub
16
+ model = AutoModelForSeq2SeqLM.from_pretrained("CLAck/en-km")
17
+
18
+ tokenizer = AutoTokenizer.from_pretrained("CLAck/en-km")
19
+ # Download a tokenizer that can tokenize English since the model Tokenizer doesn't know anymore how to do it
20
+ # We used the one coming from the initial model
21
+ # This tokenizer is used to tokenize the input sentence
22
+ tokenizer_en = AutoTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-zh')
23
+ # These special tokens are needed to reproduce the original tokenizer
24
+ tokenizer_en.add_tokens(["<2zh>", "<2khm>"], special_tokens=True)
25
+
26
+ sentence = "The cat is on the table"
27
+ # This token is needed to identify the target language
28
+ input_sentence = "<2khm> " + sentence
29
+ translated = model.generate(**tokenizer_en(input_sentence, return_tensors="pt", padding=True))
30
+ output_sentence = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
31
+ ```