Minimal inference code
Hi, can you provide a minimal example for running inference? I keep running into tensor length mismatch errors, I haven't found an inference example in any of the GemMoE repos. Thanks
Can you share the error you're receiving? But yes I can update the readme.
Sorry, keyboard just spazzed on me. Check the readme for some code, and I'll update with some proper examples once I can spin up an instance.
Thanks, that worked for me. The issue was I was not using attn_implementation="flash_attention_2" in my model implementation.
Yeah only flash-attn is supported atm. tbh i'm getting pretty disillusioned with Gemma as a whole - and will likely move my future MoE experiments to yi or mistral. It's too unpredictable, even with bug fixes.
Interesting, that is good context to have. I'll keep up to date on your work, thanks again.