indiejoseph commited on
Commit
886c759
·
verified ·
1 Parent(s): 67b9a88

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -20,8 +20,42 @@ should probably proofread and complete it, then remove this comment. -->
20
 
21
  This model is a continuation of [indiejoseph/bert-base-cantonese](https://huggingface.co/indiejoseph/bert-base-cantonese), a BERT-based model pre-trained on a substantial corpus of Cantonese text. The dataset was sourced from a variety of platforms, including news articles, social media posts, and web pages. The text was segmented into sentences containing 11 to 460 tokens per line. To ensure data quality, Minhash LSH was employed to eliminate near-duplicate sentences, resulting in a final dataset comprising 161,338,273 tokens. Training was conducted using the `run_mlm.py` script from the `transformers` library.
22
 
 
 
23
  [WandB](https://wandb.ai/indiejoseph/public/runs/wy2ja88z/workspace?nw=nwuserindiejoseph)
24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
  ## Intended uses & limitations
27
 
@@ -32,7 +66,7 @@ This model is intended to be used for further fine-tuning on Cantonese downstrea
32
  ### Training hyperparameters
33
 
34
  The following hyperparameters were used during training:
35
- - learning_rate: 5e-05
36
  - train_batch_size: 180
37
  - eval_batch_size: 8
38
  - seed: 42
@@ -41,8 +75,7 @@ The following hyperparameters were used during training:
41
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
42
  - lr_scheduler_type: cosine
43
  - lr_scheduler_warmup_ratio: 0.1
44
- - num_epochs: 5.0
45
-
46
 
47
  ### Framework versions
48
 
 
20
 
21
  This model is a continuation of [indiejoseph/bert-base-cantonese](https://huggingface.co/indiejoseph/bert-base-cantonese), a BERT-based model pre-trained on a substantial corpus of Cantonese text. The dataset was sourced from a variety of platforms, including news articles, social media posts, and web pages. The text was segmented into sentences containing 11 to 460 tokens per line. To ensure data quality, Minhash LSH was employed to eliminate near-duplicate sentences, resulting in a final dataset comprising 161,338,273 tokens. Training was conducted using the `run_mlm.py` script from the `transformers` library.
22
 
23
+ This continuous pre-training aims to expand the model's knowledge with more up-to-date Hong Kong and Cantonese text data. So we slightly overfit the model with higher learng rate and more epochs.
24
+
25
  [WandB](https://wandb.ai/indiejoseph/public/runs/wy2ja88z/workspace?nw=nwuserindiejoseph)
26
 
27
+ ## Usage
28
+
29
+ ```python
30
+ from transformers import pipeline
31
+
32
+ pipe = pipeline("fill-mask", model="/home/pj24001684/ku40000295/jc/projects/bert-pretrain/models/20241120")
33
+
34
+ pipe("香港特首係李[MASK]超")
35
+
36
+ # [{'score': 0.3057154417037964,
37
+ # 'token': 2157,
38
+ # 'token_str': '家',
39
+ # 'sequence': '香 港 特 首 係 李 家 超'},
40
+ # {'score': 0.08251259475946426,
41
+ # 'token': 6631,
42
+ # 'token_str': '超',
43
+ # 'sequence': '香 港 特 首 係 李 超 超'},
44
+ # ...
45
+
46
+ pipe("我睇到由治及興帶嚟[MASK]好處")
47
+
48
+ # [{'score': 0.9563464522361755,
49
+ # 'token': 1646,
50
+ # 'token_str': '嘅',
51
+ # 'sequence': '我 睇 到 由 治 及 興 帶 嚟 嘅 好 處'},
52
+ # {'score': 0.00982475932687521,
53
+ # 'token': 4638,
54
+ # 'token_str': '的',
55
+ # 'sequence': '我 睇 到 由 治 及 興 帶 嚟 的 好 處'},
56
+ # ...
57
+
58
+ ```
59
 
60
  ## Intended uses & limitations
61
 
 
66
  ### Training hyperparameters
67
 
68
  The following hyperparameters were used during training:
69
+ - learning_rate: 0.0001
70
  - train_batch_size: 180
71
  - eval_batch_size: 8
72
  - seed: 42
 
75
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
76
  - lr_scheduler_type: cosine
77
  - lr_scheduler_warmup_ratio: 0.1
78
+ - num_epochs: 10.0
 
79
 
80
  ### Framework versions
81
 
all_results.json CHANGED
@@ -1,9 +1,9 @@
1
  {
2
- "epoch": 5.0,
3
- "total_flos": 3.940359181120512e+17,
4
- "train_loss": 1.6082196723956328,
5
- "train_runtime": 3331.9519,
6
- "train_samples": 299445,
7
- "train_samples_per_second": 449.354,
8
- "train_steps_per_second": 0.312
9
  }
 
1
  {
2
+ "epoch": 9.97074312463429,
3
+ "total_flos": 8.067940197159444e+17,
4
+ "train_loss": 1.4460852847972385,
5
+ "train_runtime": 7405.4046,
6
+ "train_samples": 307441,
7
+ "train_samples_per_second": 415.158,
8
+ "train_steps_per_second": 0.288
9
  }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7a35accdd748bd776054dae1c3b9ec5e6c8ab34c387cfd5a08b5d30a9a4ca7b9
3
  size 410722920
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e71b775601c84098ad6748573aaf5917d953cbbc3ebb83405c10e0169acd638b
3
  size 410722920
train_results.json CHANGED
@@ -1,9 +1,9 @@
1
  {
2
- "epoch": 5.0,
3
- "total_flos": 3.940359181120512e+17,
4
- "train_loss": 1.6082196723956328,
5
- "train_runtime": 3331.9519,
6
- "train_samples": 299445,
7
- "train_samples_per_second": 449.354,
8
- "train_steps_per_second": 0.312
9
  }
 
1
  {
2
+ "epoch": 9.97074312463429,
3
+ "total_flos": 8.067940197159444e+17,
4
+ "train_loss": 1.4460852847972385,
5
+ "train_runtime": 7405.4046,
6
+ "train_samples": 307441,
7
+ "train_samples_per_second": 415.158,
8
+ "train_steps_per_second": 0.288
9
  }
trainer_state.json CHANGED
The diff for this file is too large to render. See raw diff