Upload folder using huggingface_hub

Browse files

Files changed (5) hide show

README.md +36 -3
all_results.json +7 -7
model.safetensors +1 -1
train_results.json +7 -7
trainer_state.json +0 -0

README.md CHANGED Viewed

@@ -20,8 +20,42 @@ should probably proofread and complete it, then remove this comment. -->
 This model is a continuation of [indiejoseph/bert-base-cantonese](https://huggingface.co/indiejoseph/bert-base-cantonese), a BERT-based model pre-trained on a substantial corpus of Cantonese text. The dataset was sourced from a variety of platforms, including news articles, social media posts, and web pages. The text was segmented into sentences containing 11 to 460 tokens per line. To ensure data quality, Minhash LSH was employed to eliminate near-duplicate sentences, resulting in a final dataset comprising 161,338,273 tokens. Training was conducted using the `run_mlm.py` script from the `transformers` library.
 [WandB](https://wandb.ai/indiejoseph/public/runs/wy2ja88z/workspace?nw=nwuserindiejoseph)
 ## Intended uses & limitations
@@ -32,7 +66,7 @@ This model is intended to be used for further fine-tuning on Cantonese downstrea
 ### Training hyperparameters
 The following hyperparameters were used during training:
-- learning_rate: 5e-05
 - train_batch_size: 180
 - eval_batch_size: 8
 - seed: 42
@@ -41,8 +75,7 @@ The following hyperparameters were used during training:
 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
 - lr_scheduler_type: cosine
 - lr_scheduler_warmup_ratio: 0.1
-- num_epochs: 5.0
 ### Framework versions

 This model is a continuation of [indiejoseph/bert-base-cantonese](https://huggingface.co/indiejoseph/bert-base-cantonese), a BERT-based model pre-trained on a substantial corpus of Cantonese text. The dataset was sourced from a variety of platforms, including news articles, social media posts, and web pages. The text was segmented into sentences containing 11 to 460 tokens per line. To ensure data quality, Minhash LSH was employed to eliminate near-duplicate sentences, resulting in a final dataset comprising 161,338,273 tokens. Training was conducted using the `run_mlm.py` script from the `transformers` library.
+This continuous pre-training aims to expand the model's knowledge with more up-to-date Hong Kong and Cantonese text data. So we slightly overfit the model with higher learng rate and more epochs.
 [WandB](https://wandb.ai/indiejoseph/public/runs/wy2ja88z/workspace?nw=nwuserindiejoseph)
+## Usage
+```python
+from transformers import pipeline
+pipe = pipeline("fill-mask", model="/home/pj24001684/ku40000295/jc/projects/bert-pretrain/models/20241120")
+pipe("香港特首係李[MASK]超")
+# [{'score': 0.3057154417037964,
+#   'token': 2157,
+#   'token_str': '家',
+#   'sequence': '香 港 特 首 係 李 家 超'},
+#  {'score': 0.08251259475946426,
+#   'token': 6631,
+#   'token_str': '超',
+#   'sequence': '香 港 特 首 係 李 超 超'},
+# ...
+pipe("我睇到由治及興帶嚟[MASK]好處")
+# [{'score': 0.9563464522361755,
+#   'token': 1646,
+#   'token_str': '嘅',
+#   'sequence': '我 睇 到 由 治 及 興 帶 嚟 嘅 好 處'},
+#  {'score': 0.00982475932687521,
+#   'token': 4638,
+#   'token_str': '的',
+#   'sequence': '我 睇 到 由 治 及 興 帶 嚟 的 好 處'},
+# ...
+```
 ## Intended uses & limitations
 ### Training hyperparameters
 The following hyperparameters were used during training:
+- learning_rate: 0.0001
 - train_batch_size: 180
 - eval_batch_size: 8
 - seed: 42
 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
 - lr_scheduler_type: cosine
 - lr_scheduler_warmup_ratio: 0.1
+- num_epochs: 10.0
 ### Framework versions

all_results.json CHANGED Viewed

@@ -1,9 +1,9 @@
 {
-    "epoch": 5.0,
-    "total_flos": 3.940359181120512e+17,
-    "train_loss": 1.6082196723956328,
-    "train_runtime": 3331.9519,
-    "train_samples": 299445,
-    "train_samples_per_second": 449.354,
-    "train_steps_per_second": 0.312
 }

 {
+    "epoch": 9.97074312463429,
+    "total_flos": 8.067940197159444e+17,
+    "train_loss": 1.4460852847972385,
+    "train_runtime": 7405.4046,
+    "train_samples": 307441,
+    "train_samples_per_second": 415.158,
+    "train_steps_per_second": 0.288
 }

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7a35accdd748bd776054dae1c3b9ec5e6c8ab34c387cfd5a08b5d30a9a4ca7b9
 size 410722920

 version https://git-lfs.github.com/spec/v1
+oid sha256:e71b775601c84098ad6748573aaf5917d953cbbc3ebb83405c10e0169acd638b
 size 410722920

train_results.json CHANGED Viewed

@@ -1,9 +1,9 @@
 {
-    "epoch": 5.0,
-    "total_flos": 3.940359181120512e+17,
-    "train_loss": 1.6082196723956328,
-    "train_runtime": 3331.9519,
-    "train_samples": 299445,
-    "train_samples_per_second": 449.354,
-    "train_steps_per_second": 0.312
 }

 {
+    "epoch": 9.97074312463429,
+    "total_flos": 8.067940197159444e+17,
+    "train_loss": 1.4460852847972385,
+    "train_runtime": 7405.4046,
+    "train_samples": 307441,
+    "train_samples_per_second": 415.158,
+    "train_steps_per_second": 0.288
 }

trainer_state.json CHANGED Viewed

The diff for this file is too large to render. See raw diff