Crystalcareai
commited on
Commit
•
f4651f0
1
Parent(s):
bb09a79
Update README.md
Browse files
README.md
CHANGED
@@ -6,9 +6,9 @@
|
|
6 |
|
7 |
# GemMoE: A New MoE Method via Branch Train Mix
|
8 |
|
9 |
-
I would like to introduce GemMoE, a new Mixture of Experts (MoE) method that utilizes a custom implementation of Meta's Branch Train Mix MoE method. This approach, detailed in the research paper ["Branch Train Mix: A Novel Mixture-of-Experts Training Procedure"](https://arxiv.org/abs/2403.07816), has allowed me to create an efficient model that
|
10 |
|
11 |
-
GemMoE is a 4x8.5b MoE model, consisting of 4 experts that were trained separately and then combined using a custom fork of axolotl. This fork enabled me to freeze all experts and focus on training the router mechanism. The router was trained on 4 epochs of
|
12 |
|
13 |
One of the main differences between GemMoE and previous versions is the use of tokenization routing instead of semantic routing. This approach, similar to the one used in mixtral, results in improved VRAM usage and competitive performance for its size.
|
14 |
|
@@ -24,7 +24,7 @@ By utilizing this training procedure, GemMoE aims to achieve competitive results
|
|
24 |
|
25 |
## Collaboration and Open-Source Development
|
26 |
|
27 |
-
GemMoE builds upon the work of researchers and developers from various organizations, including Meta, Hugging Face, and the broader open-source community. I would like to thank the researchers behind the Branch Train Mix method, as well as the developers of axolotl and
|
28 |
|
29 |
## Future Development
|
30 |
|
|
|
6 |
|
7 |
# GemMoE: A New MoE Method via Branch Train Mix
|
8 |
|
9 |
+
I would like to introduce GemMoE, a new Mixture of Experts (MoE) method that utilizes a custom implementation of Meta's Branch Train Mix MoE method. This approach, detailed in the research paper ["Branch Train Mix: A Novel Mixture-of-Experts Training Procedure"](https://arxiv.org/abs/2403.07816), has allowed me to create an efficient model that overcomes the limitations of previous GemMoE models.
|
10 |
|
11 |
+
GemMoE is a 4x8.5b MoE model, consisting of 4 experts that were trained separately and then combined using a custom fork of axolotl. This fork enabled me to freeze all experts and focus on training the router mechanism. The router was trained on 4 epochs of my Self-Discover-MM dataset and 2 epochs of TruthyDPO from Jon Durbin.
|
12 |
|
13 |
One of the main differences between GemMoE and previous versions is the use of tokenization routing instead of semantic routing. This approach, similar to the one used in mixtral, results in improved VRAM usage and competitive performance for its size.
|
14 |
|
|
|
24 |
|
25 |
## Collaboration and Open-Source Development
|
26 |
|
27 |
+
GemMoE builds upon the work of researchers and developers from various organizations, including Meta, Hugging Face, and the broader open-source community. I would like to thank the researchers behind the Branch Train Mix method, as well as the developers of axolotl and Jon Durbin for creating TruthyDPO.
|
28 |
|
29 |
## Future Development
|
30 |
|