Update README.md
Browse files
README.md
CHANGED
@@ -18,12 +18,15 @@ co2_eq_emissions:
|
|
18 |
|
19 |
This is the Cantonese model of BART base. It is obtained by a second-stage pre-training on the [LIHKG dataset](https://github.com/ayaka14732/lihkg-scraper) based on the [fnlp/bart-base-chinese](https://huggingface.co/fnlp/bart-base-chinese) model.
|
20 |
|
21 |
-
|
|
|
|
|
22 |
|
23 |
## GitHub Links
|
24 |
|
|
|
25 |
- Tokeniser: [ayaka14732/bert-tokenizer-cantonese](https://github.com/ayaka14732/bert-tokenizer-cantonese)
|
26 |
-
- Model: [ayaka14732/bart-base-jax](https://github.com/ayaka14732/bart-base-jax)
|
27 |
|
28 |
## Usage
|
29 |
|
@@ -38,3 +41,13 @@ print(output[0]['generated_text'].replace(' ', ''))
|
|
38 |
```
|
39 |
|
40 |
**Note**: Please use the `BertTokenizer` for the model vocabulary. DO NOT use the original `BartTokenizer`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
18 |
|
19 |
This is the Cantonese model of BART base. It is obtained by a second-stage pre-training on the [LIHKG dataset](https://github.com/ayaka14732/lihkg-scraper) based on the [fnlp/bart-base-chinese](https://huggingface.co/fnlp/bart-base-chinese) model.
|
20 |
|
21 |
+
This project is supported by Cloud TPUs from Google's [TPU Research Cloud](https://sites.research.google/trc/about/) (TRC).
|
22 |
+
|
23 |
+
**Note**: To avoid any copyright issues, please do not use this model for any purpose.
|
24 |
|
25 |
## GitHub Links
|
26 |
|
27 |
+
- Dataset: [ayaka14732/lihkg-scraper](https://github.com/ayaka14732/lihkg-scraper)
|
28 |
- Tokeniser: [ayaka14732/bert-tokenizer-cantonese](https://github.com/ayaka14732/bert-tokenizer-cantonese)
|
29 |
+
- Model: [ayaka14732/bart-base-jax#cantonese-pretrain](https://github.com/ayaka14732/bart-base-jax/tree/cantonese-pretrain)
|
30 |
|
31 |
## Usage
|
32 |
|
|
|
41 |
```
|
42 |
|
43 |
**Note**: Please use the `BertTokenizer` for the model vocabulary. DO NOT use the original `BartTokenizer`.
|
44 |
+
|
45 |
+
## Training Details
|
46 |
+
|
47 |
+
- Optimiser: SGD 0.03 + Adaptive Gradient Clipping 0.1
|
48 |
+
- Dataset: 172937863 sentences, pad or truncate to 64 tokens
|
49 |
+
- Batch size: 640
|
50 |
+
- Number of epochs: 7 epochs + 61440 steps
|
51 |
+
- Time: 44.0 hours on Google Cloud TPU v4-16
|
52 |
+
|
53 |
+
WandB link: [`1j7zs802`](https://wandb.ai/ayaka/bart-base-cantonese/runs/1j7zs802)
|