Fizzarolli
commited on
Commit
•
dcfa5f5
1
Parent(s):
235090d
write readme (benchmarks are running as we speak)
Browse files
README.md
ADDED
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
tags:
|
4 |
+
- phi3
|
5 |
+
- nlp
|
6 |
+
- moe
|
7 |
+
datasets:
|
8 |
+
- BEE-spoke-data/gutenberg-en-v1-clean
|
9 |
+
- NeelNanda/pile-10k
|
10 |
+
---
|
11 |
+
# phi 3 4x4b
|
12 |
+
a continually pretrained phi3-mini sparse moe upcycle
|
13 |
+
|
14 |
+
## support me on ko-fi!
|
15 |
+
[~~please i need money to stay alive and keep making models~~](https://ko-fi.com/fizzai)
|
16 |
+
|
17 |
+
## notes
|
18 |
+
*not trained on instruct data.* it's pretty likely that it won't be much different from phi 3 if you use it like that, if not worse due to any forgetting of instruct formats during the continued training.
|
19 |
+
|
20 |
+
## future experiments
|
21 |
+
- the datasets for this were literally chosen on a whim. perhaps experiment with a highly filtered [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)?
|
22 |
+
- actually freeze the gate layers next time (see [Chen et. al, 2023](https://arxiv.org/abs/2303.01610)), oops
|
23 |
+
- MOAR TRAINING, this only went up to ~0.2 of an epoch because i ran out of dolar
|