Text Generation
Transformers
GGUF
English
mistral
code
art
conversational
Inference Endpoints
text-generation-inference
trollek commited on
Commit
d203bbf
1 Parent(s): 547f5f0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -32,7 +32,7 @@ tags:
32
  - art
33
  ---
34
 
35
- This is NinjaMouse extended even further. Instead of Cosmopedia I used different coding datasets.
36
 
37
  I have learned a lot during this process, and if you got a GPU capable of training your own you should try it. I made some mistakes, like using the pure_bf16 at some point among other things, but the second version will slap the leaderboard for its weight class.
38
 
@@ -40,7 +40,7 @@ I don't know if it will be able to write textbook quality articles from finetuni
40
 
41
  The way the model is expanded depth wise is to copy the middle and last layer and inserting them as the new middle and new last layer. This results in the 2 layers that have just been trained are the layers that will be copied for the next step. In theory each new step of expansion keeps some of the parameters, which may be utilized to optimize the order of which datasets to use with each expansion.
42
 
43
- Due to some of the issues with Unsloth I'm waiting patiently for a bug fix on the tokenizer (it seems), while I watch lectures and podcasts for guidance and inspiration. With Unsloth I can get through 10k samples/h on a 16GB 4060Ti, and without it I can expect 4x the training time/electricity. There's also a bug with batch responses, like with LLM leaderboard evals, where [Daniel](https://github.com/danielhanchen) continues to be a champ an address the problems.
44
 
45
  I've been testing the Stable Diffusion abilities, and they seem to work. It actually seems reasonable.
46
 
 
32
  - art
33
  ---
34
 
35
+ This is [NinjaMouse](https://huggingface.co/trollek/NinjaMouse-2.4B-32L-danube) extended even further. Instead of Cosmopedia I used different coding datasets.
36
 
37
  I have learned a lot during this process, and if you got a GPU capable of training your own you should try it. I made some mistakes, like using the pure_bf16 at some point among other things, but the second version will slap the leaderboard for its weight class.
38
 
 
40
 
41
  The way the model is expanded depth wise is to copy the middle and last layer and inserting them as the new middle and new last layer. This results in the 2 layers that have just been trained are the layers that will be copied for the next step. In theory each new step of expansion keeps some of the parameters, which may be utilized to optimize the order of which datasets to use with each expansion.
42
 
43
+ Due to some of the issues with Unsloth I'm waiting patiently for a bug fix on the tokenizer (it seems), while I watch lectures and podcasts for guidance and inspiration. With Unsloth I can get through 10k samples/h on a 16GB 4060Ti, and without it I can expect 4x the training time/electricity. There's also a bug with batched responses, like with LLM leaderboard evals.
44
 
45
  I've been testing the Stable Diffusion abilities, and they seem to work. It actually seems reasonable.
46