My understanding is that the MoE model uses the same LM Loss like previous transformers. Is there any other aux losses used?
Please clarify or point me to the right file in the megablocks src. Thank you!

