fzmnm commited on
Commit
fb67790
1 Parent(s): 6befd01

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -29,7 +29,7 @@ Inspired by the TinyStories research, which explores the effectiveness of small
29
  For detailed training procedures and configurations, please refer to [this GitHub repository](https://github.com/jia-zhuang/chinese-llama2.c).
30
  - **Hardware:** Trained on an NVIDIA RTX 2080 Super with 8 GB RAM—a modest gaming rig.
31
  - **Duration:** 87 hours (just over 3.5 days), covering 20k iterations and processing 2G tokens.
32
- - **Optimizer:** AdamW, with a learning rate (lr) of 5e-4, weight decay of 0.1, and gradient clipping at 1.0. The model underwent 1000 warm-up iterations without any dropout.
33
  - **Dropout:** no
34
  - **Batch Size:** 4, configured to fit within the 8GB RAM of the 2080; gradient accumulation steps set at 128, achieving an effective 524,288 tokens per iteration as suggested by the Chinchilla paper ([Chinchilla study](https://arxiv.org/abs/2203.15556)).
35
  - **Training Iterations:** 20k, including a warm-up phase of 1k steps.
 
29
  For detailed training procedures and configurations, please refer to [this GitHub repository](https://github.com/jia-zhuang/chinese-llama2.c).
30
  - **Hardware:** Trained on an NVIDIA RTX 2080 Super with 8 GB RAM—a modest gaming rig.
31
  - **Duration:** 87 hours (just over 3.5 days), covering 20k iterations and processing 2G tokens.
32
+ - **Optimizer:** AdamW, with a learning rate (lr) of 5e-4, with 1000 warm-up iterations. gradient clipping at 1.0.
33
  - **Dropout:** no
34
  - **Batch Size:** 4, configured to fit within the 8GB RAM of the 2080; gradient accumulation steps set at 128, achieving an effective 524,288 tokens per iteration as suggested by the Chinchilla paper ([Chinchilla study](https://arxiv.org/abs/2203.15556)).
35
  - **Training Iterations:** 20k, including a warm-up phase of 1k steps.