Ayaka
/

bart-base-cantonese

feature-extraction

Inference Endpoints

Model card Files Files and versions Community

Ayaka commited on Oct 27, 2022

Commit

8e578c9

·

1 Parent(s): d355ad0

Update README.md

Files changed (1) hide show

README.md +37 -0

README.md CHANGED Viewed

@@ -1,3 +1,40 @@
 ---
 license: other
 ---

 ---
+language:
+- yue
+tags:
+- bart
+- cantonese
+- fill-mask
 license: other
+library_name: bart-base-jax
+co2_eq_emissions:
+  emissions: 6.29
+  source: estimated by using ML CO2 Calculator
+  training_type: second-stage pre-training
+  hardware_used: Google Cloud TPU v4-16
 ---
+# bart-base-cantonese
+This is the Cantonese model of BART base. It is obtained by a second-stage pre-training on the [LIHKG dataset](https://github.com/ayaka14732/lihkg-scraper) based on the [fnlp/bart-base-chinese](https://huggingface.co/fnlp/bart-base-chinese) model.
+**Note**: This model is not the final version and the training is still in progress. Besides, to avoid any copyright issues, please do not use this model for any purpose.
+## GitHub Links
+- Tokeniser: [ayaka14732/bert-tokenizer-cantonese](https://github.com/ayaka14732/bert-tokenizer-cantonese)
+- Model: [ayaka14732/bart-base-jax](https://github.com/ayaka14732/bart-base-jax)
+## Usage
+```python
+from transformers import BertTokenizer, BartForConditionalGeneration, Text2TextGenerationPipeline
+tokenizer = BertTokenizer.from_pretrained('Ayaka/bart-base-cantonese')
+model = BartForConditionalGeneration.from_pretrained('Ayaka/bart-base-cantonese')
+text2text_generator = Text2TextGenerationPipeline(model, tokenizer)
+output = text2text_generator('聽日就要返香港，我激動到[MASK]唔着', max_length=50, do_sample=False)
+print(output[0]['generated_text'].replace(' ', ''))
+# output: 聽日就要返香港，我激動到瞓唔着
+```
+**Note**: Please use the `BertTokenizer` for the model vocabulary. DO NOT use the original `BartTokenizer`.