--- license: apache-2.0 --- # Rodimus* ## Introduction Rodimus* is a new series of efficient large language models designed to address the challenges of computational complexity in Transformer-based architectures. The Rodimus* includes the base Rodimus model and its enhanced version, Rodimus+. Rodimus leverages a novel Data-Dependent Tempered Selection (DDTS) mechanism within a purely recurrent, linear attention-based framework, achieving high performance. Building on this, Rodimus+ combines the strengths of Rodimus and the innovative Sliding Window Shared-Key Attention (SW-SKA) in a hybrid approach. This combination effectively integrates semantic, token, and head compression techniques, enabling a balance between accuracy and efficiency. For more details, please refer to our [Paper](https://openreview.net/forum?id=IIVYiJ1ggK) and [Github](https://github.com/codefuse-ai/rodimus). > This repository contains the **latest checkpoint** of Rodimus+ 1.6B trained by continuously updated data, with a focus on the performance of code and math. > ## Usage We do not recommend using base language models directly for text generation. Instead, consider applying post-training techniques such as SFT, RLHF or continued pretraining to enhance the model's performance. **Installation** 1. The latest version of [transformers](https://github.com/huggingface/transformers) is recommended (at least 4.42.0). 2. We evaluate our models with `python=3.8` and `torch==2.1.2`. 3. If you use Rodimus, you need to install [flash-linear-attention](https://github.com/sustcsonglin/flash-linear-attention) and [triton>=2.2.0](https://github.com/triton-lang/triton). If you use Rodimus+, you need to further install [flash-attention](https://github.com/Dao-AILab/flash-attention). ## Generation `generate` APi ```python import os import torch from modeling_rodimus import RodimusForCausalLM from tokenization_rodimus_fast import RodimusTokenizer # load model ckpt_dir = "model_path" tokenizer = RodimusTokenizer.from_pretrained(ckpt_dir) model = RodimusForCausalLM.from_pretrained( ckpt_dir, torch_dtype=torch.float16, device_map="cuda" ).eval() # inference input_prompt = "你好!你是谁?" model_inputs = tokenizer(input_prompt, return_tensors="pt").to(model.device) outputs = model.generate(**model_inputs, max_length=32) response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0] print(response) ``` ## Performance **Code Tasks**: HumanEval (0-shot), MBPP (0-shot) **Math Tasks**: GSM8K (4-shot), MATH (5-shot) **NLP Tasks**: C-Eval (5-shot), CMMLU (5-shot), MMLU (5-shot), BBH (3-shot) > Latest update time: 2025/02/15 > | Datasets | Rodimus+ 1.6B (20250215) | | --- | :---: | | HumanEval | 24.39 | | MBPP | 26.60 | | GSM8K | 50.19 | | MATH | 15.06 | | C-Eval | 47.19 | | CMMLU | 43.76 | | MMLU | 45.52 | | BBH | 35.28 | ## Citation If you find our work helpful, feel free to give us a cite. ```markdown @inproceedings{ he2025rodimus, title={Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions}, author={Zhihao He and Hang Yu and Zi Gong and Shizhan Liu and Jianguo Li and Weiyao Lin}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025}, url={https://openreview.net/forum?id=IIVYiJ1ggK} } ```