--- language: - en license: apache-2.0 library_name: transformers tags: - moe - moah - mod datasets: - Locutusque/UltraTextbooks --- # Model Card for Model ID ## Model Details ### Model Description MoM: Mixture of Mixture This Model is a first test to combine [Jamba](https://huggingface.co/ai21labs/Jamba-v0.1) architecture with mixture of attention head and mixture of depth. Mamba and attention layers are in bf16 precision and the rest is in 1.58bits precision 107M over a total of 1025M parameters are in bf16 precision ~ 10% of the parameters are in bf16 The goal is to developpe and test if this kind of architectures have not too much quality loss for a fast inference. - **Model type:** Mixture of attention head mixture of depth and mixture of expert with 1.58bits linear layer for **MLP** - **License:** Apache licence 2.0 ### Model Sources [optional] - **Repository:** https://github.com/ostix360/optimized-LLM ## How to Get Started with the Model If you want to test this model please look at this repo at this [commit](https://github.com/ostix360/optimized-LLM/tree/d266bc404346b71ea237c0744be0f8928f6b3217) ## Training Details - **wandb**: [training detail](https://wandb.ai/ostix360/Mixture%20of%20mixture%20(mod,%20moah%20moe)/runs/wtoujazq) ### Training Data We use the first 100k data of Locutusque/UltraTextbooks to train this model ### Training Procedure We use adam-8 bits with default betas and epsilon values #### Preprocessing [optional] The data fit the model max length i.e. 512 tokens #### Training Hyperparameters Please look at the wandb meta data or the train.py in the repo to see the hyperparameters ## Technical Specifications [optional] ### Compute Infrastructure #### Hardware - one 4070 ti GPU #### Software - pytorch, transformers etc