Update README.md
Browse files
README.md
CHANGED
@@ -15,6 +15,9 @@ In addition to sharing the model weights, we provide the core designs, engineeri
|
|
15 |
- **Language(s):** English; Chinese; Other languages
|
16 |
- **License:** Apache 2.0
|
17 |
|
|
|
|
|
|
|
18 |
|
19 |
|
20 |
## Bias, Risks, and Limitations
|
@@ -68,7 +71,7 @@ We adopt the architecture of FLM-101B as the backbone for Tele-FLM, with several
|
|
68 |
Consequently, Tele-FLM is largely compatible with Llama architecturally.
|
69 |
To maximize convenience for the community, we made minimal adjustments to Llama's code to adapt it to Tele-FLM and released it as open source.
|
70 |
|
71 |
-
In the pre-training stage, we employ μP for optimal hyperparameter search. The μP model (Tele-FLM_μP) is architecturally identical to Tele-FLM except for the model width
|
72 |
The architecture of Tele-FLM and Tele-FLM_μP is listed below.
|
73 |
For more details of μP, please refer to our technical report and the original Tensor Program papers.
|
74 |
|
@@ -83,7 +86,7 @@ For more details of μP, please refer to our technical report and the original T
|
|
83 |
### Training Hyperparameters
|
84 |
|
85 |
Due to the smaller size, Tele-FLM_μP allows for significantly more experimental runs within fixed time and resource constraints.
|
86 |
-
We searched
|
87 |
|
88 |
|
89 |
| Searched Hyperparameters ||| Non-Searched Hyperparameters ||
|
@@ -146,9 +149,8 @@ The parallel training setup for Tele-FLM is configured as follows: tensor parall
|
|
146 |
| Tele-FLM | 71.13 | 65.48 | 66.98 | 66.25 | 92.57 | 64.38 |
|
147 |
|
148 |
|
149 |
-
## Tech report
|
150 |
-
For more detailed capabilities of Tele-FLM, see [Tele-FLM Technical Report](https://arxiv.org/pdf/2404.16645)
|
151 |
|
|
|
152 |
If you find our work helpful, please consider citing it.
|
153 |
```
|
154 |
@misc{li2024teleflm,
|
|
|
15 |
- **Language(s):** English; Chinese; Other languages
|
16 |
- **License:** Apache 2.0
|
17 |
|
18 |
+
## Tech report
|
19 |
+
|
20 |
+
[Tele-FLM Technical Report](https://arxiv.org/pdf/2404.16645)
|
21 |
|
22 |
|
23 |
## Bias, Risks, and Limitations
|
|
|
71 |
Consequently, Tele-FLM is largely compatible with Llama architecturally.
|
72 |
To maximize convenience for the community, we made minimal adjustments to Llama's code to adapt it to Tele-FLM and released it as open source.
|
73 |
|
74 |
+
In the pre-training stage, we employ μP for optimal hyperparameter search. The μP model (Tele-FLM_μP) is architecturally identical to Tele-FLM except for the model width.
|
75 |
The architecture of Tele-FLM and Tele-FLM_μP is listed below.
|
76 |
For more details of μP, please refer to our technical report and the original Tensor Program papers.
|
77 |
|
|
|
86 |
### Training Hyperparameters
|
87 |
|
88 |
Due to the smaller size, Tele-FLM_μP allows for significantly more experimental runs within fixed time and resource constraints.
|
89 |
+
We searched seven hyperparameters for pretraining. All the hyperparameters are shown below.
|
90 |
|
91 |
|
92 |
| Searched Hyperparameters ||| Non-Searched Hyperparameters ||
|
|
|
149 |
| Tele-FLM | 71.13 | 65.48 | 66.98 | 66.25 | 92.57 | 64.38 |
|
150 |
|
151 |
|
|
|
|
|
152 |
|
153 |
+
## Citation
|
154 |
If you find our work helpful, please consider citing it.
|
155 |
```
|
156 |
@misc{li2024teleflm,
|