GregorZiegltrumAA
commited on
Commit
•
b325caa
1
Parent(s):
edc895f
Update README.md
Browse files
README.md
CHANGED
@@ -18,10 +18,16 @@ pipeline_tag: text-generation
|
|
18 |
|
19 |
This Repository holds the model weights for the 7B u-μP models trained at Aleph Alpha Research, in collaboration with Graphcore, for 72k steps (300B tokens). Please note, that the released checkpoints are not fully converged models and are intended for research use only.
|
20 |
|
21 |
-
You can find all model weights
|
22 |
- [umup-research-7b-bf16](https://huggingface.co/Aleph-Alpha/umup-research-7b-bf16)
|
23 |
- [umup-research-7b-fp8](https://huggingface.co/Aleph-Alpha/umup-research-7b-fp8)
|
24 |
- [sp-baseline-research-7b-bf16](https://huggingface.co/Aleph-Alpha/sp-baseline-research-7b-bf16)
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
|
26 |
The Maximal Update Parametrization (μP) aims to make the optimal hyperparameters (HPs) of a model-independent of its size, allowing them to be swept using a cheap proxy model rather than the full-size target model. We present a new scheme, u-μP, which improves upon μP by combining it with Unit Scaling, a method for designing models that makes them easy to train in low precision. The two techniques have a natural affinity: μP ensures that the scale of activations is independent of model size, and Unit Scaling ensures that activations, weights, and gradients begin training with a scale of one. This synthesis opens the door to a simpler scheme, whose default values are near-optimal. This in turn facilitates a more efficient sweeping strategy, with u-μP models reaching a lower loss than comparable μP models and working out-of-the-box in FP8.
|
27 |
|
|
|
18 |
|
19 |
This Repository holds the model weights for the 7B u-μP models trained at Aleph Alpha Research, in collaboration with Graphcore, for 72k steps (300B tokens). Please note, that the released checkpoints are not fully converged models and are intended for research use only.
|
20 |
|
21 |
+
You can find all model weights at the following links:
|
22 |
- [umup-research-7b-bf16](https://huggingface.co/Aleph-Alpha/umup-research-7b-bf16)
|
23 |
- [umup-research-7b-fp8](https://huggingface.co/Aleph-Alpha/umup-research-7b-fp8)
|
24 |
- [sp-baseline-research-7b-bf16](https://huggingface.co/Aleph-Alpha/sp-baseline-research-7b-bf16)
|
25 |
+
- [umup-research-3b-bf16](https://huggingface.co/Aleph-Alpha/umup-research-3b-bf16)
|
26 |
+
- [umup-research-3b-fp8](https://huggingface.co/Aleph-Alpha/umup-research-3b-fp8)
|
27 |
+
- [sp-baseline-research-3b-bf16](https://huggingface.co/Aleph-Alpha/sp-baseline-research-3b-bf16)
|
28 |
+
- [umup-research-1b-bf16](https://huggingface.co/Aleph-Alpha/umup-research-1b-bf16)
|
29 |
+
- [umup-research-1b-fp8](https://huggingface.co/Aleph-Alpha/umup-research-1b-fp8)
|
30 |
+
- [sp-baseline-research-1b-bf16](https://huggingface.co/Aleph-Alpha/sp-baseline-research-1b-bf16)
|
31 |
|
32 |
The Maximal Update Parametrization (μP) aims to make the optimal hyperparameters (HPs) of a model-independent of its size, allowing them to be swept using a cheap proxy model rather than the full-size target model. We present a new scheme, u-μP, which improves upon μP by combining it with Unit Scaling, a method for designing models that makes them easy to train in low precision. The two techniques have a natural affinity: μP ensures that the scale of activations is independent of model size, and Unit Scaling ensures that activations, weights, and gradients begin training with a scale of one. This synthesis opens the door to a simpler scheme, whose default values are near-optimal. This in turn facilitates a more efficient sweeping strategy, with u-μP models reaching a lower loss than comparable μP models and working out-of-the-box in FP8.
|
33 |
|