Crystalcareai
/

GemMoE-Beta-1

Text Generation

Transformers

gemmoe

custom_code

Model card Files Files and versions Community

Crystalcareai commited on Mar 13

Commit

2e3cea0

•

1 Parent(s): 341c466

Update README.md

Browse files

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -24,7 +24,7 @@ The development of GemMoE was challenging yet incredibly rewarding. Although I e
 One of the most significant challenges I faced was the lack of experience in developing a custom MoE architecture. As a self-taught developer, I had to learn many hard lessons from my previous project, Qwen1.5-8x7b. Although the methodology was there, the execution was lacking. I soon realized that I would need to create a completely new model class within the transformers library and implement a custom MoE architecture tailored specifically for Gemma. Through countless iterations and a tremendous amount of trial and error, I was able to refine the architecture and optimize its performance.
-I cannot stress enough how grateful I am for the support and contributions of the AI community throughout this journey. Hugging Face, in particular, played a crucial role in making GemMoE a reality. Their generous compute grant allowed me to refine the architecture and optimize the model before committing resources to the full training. Victor and Apolinaro from the Hugging Face team were instrumental in getting this project off the ground, and their swift support and dedication were truly inspiring.
 I also want to express my heartfelt thanks to Eric Hartford, who provided invaluable assistance in troubleshooting Gemma's unstable loss curve during the early stages of development. Daniel Han from Unsloth deserves a special mention for his tireless work in identifying and fixing many of the bugs in Gemma's transformers implementation. His fixes enabled me to fine-tune 8 different versions of Gemma and combine them using a hidden gate with a heavily modified version of mergekit, a tool developed by the brilliant Charles Goddard.
@@ -32,7 +32,7 @@ Adapting Mergekit to support Gemma was no small feat, and I had to make signific
 I want to extend my gratitude to Maxime Labonne for his incredibly useful LLM course and colab notebooks, which helped me level up my fine-tuning skills. Jon Durbin's bagel GitHub repository was a crash course in what makes good data, and it played a crucial role in informing my data selection process. The transparency and example set by Teknium inspired me to turn my AI side hustle into a full-time gig, and for that, I am deeply grateful.
-Locustique's datasets served as a prime example of how to aggregate data like a pro. Justin Lin from Alibaba research supported my previous project, Qwen1.5 - 8x7b, which laid the foundation for GemMoE. I also want to thank Deepmind for releasing Gemma and acknowledge the hard work and dedication of everyone who contributed to its development.
 The Deepseek team's DeepseekMoE paper was a game-changer for me, providing critical insights into what makes an MoE as good as possible. I am also incredibly grateful to the entire Perplexity team, whose answer engine accelerated my education and understanding of AI by a factor of five (source: vibes).

 One of the most significant challenges I faced was the lack of experience in developing a custom MoE architecture. As a self-taught developer, I had to learn many hard lessons from my previous project, Qwen1.5-8x7b. Although the methodology was there, the execution was lacking. I soon realized that I would need to create a completely new model class within the transformers library and implement a custom MoE architecture tailored specifically for Gemma. Through countless iterations and a tremendous amount of trial and error, I was able to refine the architecture and optimize its performance.
+I cannot stress enough how grateful I am for the support and contributions of the AI community throughout this journey. Hugging Face, in particular, played a crucial role in making GemMoE a reality. Their generous compute grant allowed me to refine the architecture and optimize the model before committing resources to the full training. Victor M and Apolinaro from the Hugging Face team were instrumental in getting this project off the ground, and their swift support and dedication were truly inspiring.
 I also want to express my heartfelt thanks to Eric Hartford, who provided invaluable assistance in troubleshooting Gemma's unstable loss curve during the early stages of development. Daniel Han from Unsloth deserves a special mention for his tireless work in identifying and fixing many of the bugs in Gemma's transformers implementation. His fixes enabled me to fine-tune 8 different versions of Gemma and combine them using a hidden gate with a heavily modified version of mergekit, a tool developed by the brilliant Charles Goddard.
 I want to extend my gratitude to Maxime Labonne for his incredibly useful LLM course and colab notebooks, which helped me level up my fine-tuning skills. Jon Durbin's bagel GitHub repository was a crash course in what makes good data, and it played a crucial role in informing my data selection process. The transparency and example set by Teknium inspired me to turn my AI side hustle into a full-time gig, and for that, I am deeply grateful.
+Locutusque's datasets served as a prime example of how to aggregate data like a pro. Justin Lin from Alibaba research supported my previous project, Qwen1.5 - 8x7b, which laid the foundation for GemMoE. I also want to thank Deepmind for releasing Gemma and acknowledge the hard work and dedication of everyone who contributed to its development.
 The Deepseek team's DeepseekMoE paper was a game-changer for me, providing critical insights into what makes an MoE as good as possible. I am also incredibly grateful to the entire Perplexity team, whose answer engine accelerated my education and understanding of AI by a factor of five (source: vibes).