Ayaka commited on
Commit
8e578c9
·
1 Parent(s): d355ad0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -0
README.md CHANGED
@@ -1,3 +1,40 @@
1
  ---
 
 
 
 
 
 
2
  license: other
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - yue
4
+ tags:
5
+ - bart
6
+ - cantonese
7
+ - fill-mask
8
  license: other
9
+ library_name: bart-base-jax
10
+ co2_eq_emissions:
11
+ emissions: 6.29
12
+ source: estimated by using ML CO2 Calculator
13
+ training_type: second-stage pre-training
14
+ hardware_used: Google Cloud TPU v4-16
15
  ---
16
+
17
+ # bart-base-cantonese
18
+
19
+ This is the Cantonese model of BART base. It is obtained by a second-stage pre-training on the [LIHKG dataset](https://github.com/ayaka14732/lihkg-scraper) based on the [fnlp/bart-base-chinese](https://huggingface.co/fnlp/bart-base-chinese) model.
20
+
21
+ **Note**: This model is not the final version and the training is still in progress. Besides, to avoid any copyright issues, please do not use this model for any purpose.
22
+
23
+ ## GitHub Links
24
+
25
+ - Tokeniser: [ayaka14732/bert-tokenizer-cantonese](https://github.com/ayaka14732/bert-tokenizer-cantonese)
26
+ - Model: [ayaka14732/bart-base-jax](https://github.com/ayaka14732/bart-base-jax)
27
+
28
+ ## Usage
29
+
30
+ ```python
31
+ from transformers import BertTokenizer, BartForConditionalGeneration, Text2TextGenerationPipeline
32
+ tokenizer = BertTokenizer.from_pretrained('Ayaka/bart-base-cantonese')
33
+ model = BartForConditionalGeneration.from_pretrained('Ayaka/bart-base-cantonese')
34
+ text2text_generator = Text2TextGenerationPipeline(model, tokenizer)
35
+ output = text2text_generator('聽日就要返香港,我激動到[MASK]唔着', max_length=50, do_sample=False)
36
+ print(output[0]['generated_text'].replace(' ', ''))
37
+ # output: 聽日就要返香港,我激動到瞓唔着
38
+ ```
39
+
40
+ **Note**: Please use the `BertTokenizer` for the model vocabulary. DO NOT use the original `BartTokenizer`.