File size: 2,156 Bytes
16f6e99 8e578c9 a975e23 16f6e99 d11588a 8e578c9 03eec09 16f6e99 8e578c9 9f2e214 8e578c9 9f2e214 8e578c9 0ee9346 8e578c9 9f2e214 03eec09 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
---
language:
- yue
tags:
- cantonese
license: other
library_name: transformers
co2_eq_emissions:
emissions: 6.29
source: estimated by using ML CO2 Calculator
training_type: second-stage pre-training
hardware_used: Google Cloud TPU v4-16
pipeline_tag: fill-mask
---
# bart-base-cantonese
This is the Cantonese model of BART base. It is obtained by a second-stage pre-training on the [LIHKG dataset](https://github.com/ayaka14732/lihkg-scraper) based on the [fnlp/bart-base-chinese](https://huggingface.co/fnlp/bart-base-chinese) model.
This project is supported by Cloud TPUs from Google's [TPU Research Cloud](https://sites.research.google/trc/about/) (TRC).
**Note**: To avoid any copyright issues, please do not use this model for any purpose.
## GitHub Links
- Dataset: [ayaka14732/lihkg-scraper](https://github.com/ayaka14732/lihkg-scraper)
- Tokeniser: [ayaka14732/bert-tokenizer-cantonese](https://github.com/ayaka14732/bert-tokenizer-cantonese)
- Base model: [ayaka14732/bart-base-jax](https://github.com/ayaka14732/bart-base-jax)
- Pre-training: [ayaka14732/bart-base-cantonese](https://github.com/ayaka14732/bart-base-cantonese)
## Usage
```python
from transformers import BertTokenizer, BartForConditionalGeneration, Text2TextGenerationPipeline
tokenizer = BertTokenizer.from_pretrained('Ayaka/bart-base-cantonese')
model = BartForConditionalGeneration.from_pretrained('Ayaka/bart-base-cantonese')
text2text_generator = Text2TextGenerationPipeline(model, tokenizer)
output = text2text_generator('聽日就要返香港,我激動到[MASK]唔着', max_length=50, do_sample=False)
print(output[0]['generated_text'].replace(' ', ''))
# output: 聽日就要返香港,我激動到瞓唔着
```
**Note**: Please use the `BertTokenizer` for the model vocabulary. DO NOT use the original `BartTokenizer`.
## Training Details
- Optimiser: SGD 0.03 + Adaptive Gradient Clipping 0.1
- Dataset: 172937863 sentences, pad or truncate to 64 tokens
- Batch size: 640
- Number of epochs: 7 epochs + 61440 steps
- Time: 44.0 hours on Google Cloud TPU v4-16
WandB link: [`1j7zs802`](https://wandb.ai/ayaka/bart-base-cantonese/runs/1j7zs802) |