trollek
/

NinjaMouse-3B-40L-danube

Text Generation

Inference Endpoints

text-generation-inference

Model card Files Files and versions Community

trollek commited on Mar 29

Commit

d203bbf

•

1 Parent(s): 547f5f0

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -32,7 +32,7 @@ tags:
 - art
 ---
-This is NinjaMouse extended even further. Instead of Cosmopedia I used different coding datasets.
 I have learned a lot during this process, and if you got a GPU capable of training your own you should try it. I made some mistakes, like using the pure_bf16 at some point among other things, but the second version will slap the leaderboard for its weight class.
@@ -40,7 +40,7 @@ I don't know if it will be able to write textbook quality articles from finetuni
 The way the model is expanded depth wise is to copy the middle and last layer and inserting them as the new middle and new last layer. This results in the 2 layers that have just been trained are the layers that will be copied for the next step. In theory each new step of expansion keeps some of the parameters, which may be utilized to optimize the order of which datasets to use with each expansion.
-Due to some of the issues with Unsloth I'm waiting patiently for a bug fix on the tokenizer (it seems), while I watch lectures and podcasts for guidance and inspiration. With Unsloth I can get through 10k samples/h on a 16GB 4060Ti, and without it I can expect 4x the training time/electricity. There's also a bug with batch responses, like with LLM leaderboard evals, where [Daniel](https://github.com/danielhanchen) continues to be a champ an address the problems.
 I've been testing the Stable Diffusion abilities, and they seem to work. It actually seems reasonable.

 - art
 ---
+This is [NinjaMouse](https://huggingface.co/trollek/NinjaMouse-2.4B-32L-danube) extended even further. Instead of Cosmopedia I used different coding datasets.
 I have learned a lot during this process, and if you got a GPU capable of training your own you should try it. I made some mistakes, like using the pure_bf16 at some point among other things, but the second version will slap the leaderboard for its weight class.
 The way the model is expanded depth wise is to copy the middle and last layer and inserting them as the new middle and new last layer. This results in the 2 layers that have just been trained are the layers that will be copied for the next step. In theory each new step of expansion keeps some of the parameters, which may be utilized to optimize the order of which datasets to use with each expansion.
+Due to some of the issues with Unsloth I'm waiting patiently for a bug fix on the tokenizer (it seems), while I watch lectures and podcasts for guidance and inspiration. With Unsloth I can get through 10k samples/h on a 16GB 4060Ti, and without it I can expect 4x the training time/electricity. There's also a bug with batched responses, like with LLM leaderboard evals.
 I've been testing the Stable Diffusion abilities, and they seem to work. It actually seems reasonable.