Ayaka
/

bart-base-cantonese

feature-extraction

Inference Endpoints

Model card Files Files and versions Community

Ayaka commited on Oct 27, 2022

Commit

9f2e214

•

1 Parent(s): fa25c95

Update README.md

Files changed (1) hide show

README.md +15 -2

README.md CHANGED Viewed

@@ -18,12 +18,15 @@ co2_eq_emissions:
 This is the Cantonese model of BART base. It is obtained by a second-stage pre-training on the [LIHKG dataset](https://github.com/ayaka14732/lihkg-scraper) based on the [fnlp/bart-base-chinese](https://huggingface.co/fnlp/bart-base-chinese) model.
-**Note**: This model is not the final version and the training is still in progress. Besides, to avoid any copyright issues, please do not use this model for any purpose.
 ## GitHub Links
 - Tokeniser: [ayaka14732/bert-tokenizer-cantonese](https://github.com/ayaka14732/bert-tokenizer-cantonese)
-- Model: [ayaka14732/bart-base-jax](https://github.com/ayaka14732/bart-base-jax)
 ## Usage
@@ -38,3 +41,13 @@ print(output[0]['generated_text'].replace(' ', ''))
 ```
 **Note**: Please use the `BertTokenizer` for the model vocabulary. DO NOT use the original `BartTokenizer`.

 This is the Cantonese model of BART base. It is obtained by a second-stage pre-training on the [LIHKG dataset](https://github.com/ayaka14732/lihkg-scraper) based on the [fnlp/bart-base-chinese](https://huggingface.co/fnlp/bart-base-chinese) model.
+This project is supported by Cloud TPUs from Google's [TPU Research Cloud](https://sites.research.google/trc/about/) (TRC).
+**Note**: To avoid any copyright issues, please do not use this model for any purpose.
 ## GitHub Links
+- Dataset: [ayaka14732/lihkg-scraper](https://github.com/ayaka14732/lihkg-scraper)
 - Tokeniser: [ayaka14732/bert-tokenizer-cantonese](https://github.com/ayaka14732/bert-tokenizer-cantonese)
+- Model: [ayaka14732/bart-base-jax#cantonese-pretrain](https://github.com/ayaka14732/bart-base-jax/tree/cantonese-pretrain)
 ## Usage
 ```
 **Note**: Please use the `BertTokenizer` for the model vocabulary. DO NOT use the original `BartTokenizer`.
+## Training Details
+- Optimiser: SGD 0.03 + Adaptive Gradient Clipping 0.1
+- Dataset: 172937863 sentences, pad or truncate to 64 tokens
+- Batch size: 640
+- Number of epochs: 7 epochs + 61440 steps
+- Time: 44.0 hours on Google Cloud TPU v4-16
+WandB link: [`1j7zs802`](https://wandb.ai/ayaka/bart-base-cantonese/runs/1j7zs802)