NeuralNovel commited on
Commit
326146d
1 Parent(s): 904465d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -5
README.md CHANGED
@@ -22,6 +22,9 @@ Mini-Mixtral-v0.2 is a Mixture of Experts (MoE) made with the following models u
22
  * [unsloth/mistral-7b-v0.2](https://huggingface.co/unsloth/mistral-7b-v0.2)
23
  * [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
24
 
 
 
 
25
  ## 🧩 Configuration
26
 
27
  ```yaml
@@ -77,10 +80,6 @@ print(outputs[0]["generated_text"])
77
 
78
  ## "[What is a Mixture of Experts (MoE)?](https://huggingface.co/blog/moe)"
79
 
80
-
81
- <a href='https://ko-fi.com/S6S2UH2TC' target='_blank'><img height='38' style='border:0px;height:36px;' src='https://storage.ko-fi.com/cdn/kofi1.png?v=3' border='0' alt='Buy Me a Coffee at ko-fi.com' /></a>
82
- <a href='https://discord.gg/KFS229xD' target='_blank'><img width='140' height='500' style='border:0px;height:36px;' src='https://i.ibb.co/tqwznYM/Discord-button.png' border='0' alt='Join Our Discord!' /></a>
83
-
84
  ### (from the MistralAI papers...click the quoted question above to navigate to it directly.)
85
 
86
  The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps.
@@ -113,4 +112,3 @@ If all our tokens are sent to just a few popular experts, that will make trainin
113
  ![image/gif](https://cdn-uploads.huggingface.co/production/uploads/6589d7e6586088fd2784a12c/43v7GezlOGg2BYljbU5ge.gif)
114
  ## "Wait...but you called this a frankenMoE?"
115
  The difference between MoE and "frankenMoE" lies in the fact that the router layer in a model like the one on this repo is not trained simultaneously.
116
- ```
 
22
  * [unsloth/mistral-7b-v0.2](https://huggingface.co/unsloth/mistral-7b-v0.2)
23
  * [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
24
 
25
+ <a href='https://ko-fi.com/S6S2UH2TC' target='_blank'><img height='38' style='border:0px;height:36px;' src='https://storage.ko-fi.com/cdn/kofi1.png?v=3' border='0' alt='Buy Me a Coffee at ko-fi.com' /></a>
26
+ <a href='https://discord.gg/KFS229xD' target='_blank'><img width='140' height='500' style='border:0px;height:36px;' src='https://i.ibb.co/tqwznYM/Discord-button.png' border='0' alt='Join Our Discord!' /></a>
27
+
28
  ## 🧩 Configuration
29
 
30
  ```yaml
 
80
 
81
  ## "[What is a Mixture of Experts (MoE)?](https://huggingface.co/blog/moe)"
82
 
 
 
 
 
83
  ### (from the MistralAI papers...click the quoted question above to navigate to it directly.)
84
 
85
  The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps.
 
112
  ![image/gif](https://cdn-uploads.huggingface.co/production/uploads/6589d7e6586088fd2784a12c/43v7GezlOGg2BYljbU5ge.gif)
113
  ## "Wait...but you called this a frankenMoE?"
114
  The difference between MoE and "frankenMoE" lies in the fact that the router layer in a model like the one on this repo is not trained simultaneously.