--- language: - yue tags: - bart - cantonese - fill-mask license: other library_name: bart-base-jax co2_eq_emissions: emissions: 6.29 source: estimated by using ML CO2 Calculator training_type: second-stage pre-training hardware_used: Google Cloud TPU v4-16 --- # bart-base-cantonese This is the Cantonese model of BART base. It is obtained by a second-stage pre-training on the [LIHKG dataset](https://github.com/ayaka14732/lihkg-scraper) based on the [fnlp/bart-base-chinese](https://huggingface.co/fnlp/bart-base-chinese) model. **Note**: This model is not the final version and the training is still in progress. Besides, to avoid any copyright issues, please do not use this model for any purpose. ## GitHub Links - Tokeniser: [ayaka14732/bert-tokenizer-cantonese](https://github.com/ayaka14732/bert-tokenizer-cantonese) - Model: [ayaka14732/bart-base-jax](https://github.com/ayaka14732/bart-base-jax) ## Usage ```python from transformers import BertTokenizer, BartForConditionalGeneration, Text2TextGenerationPipeline tokenizer = BertTokenizer.from_pretrained('Ayaka/bart-base-cantonese') model = BartForConditionalGeneration.from_pretrained('Ayaka/bart-base-cantonese') text2text_generator = Text2TextGenerationPipeline(model, tokenizer) output = text2text_generator('聽日就要返香港,我激動到[MASK]唔着', max_length=50, do_sample=False) print(output[0]['generated_text'].replace(' ', '')) # output: 聽日就要返香港,我激動到瞓唔着 ``` **Note**: Please use the `BertTokenizer` for the model vocabulary. DO NOT use the original `BartTokenizer`.