AaryanK/Qwen_2.5_3B_GRPO_Reasoning_XIOSERV

A fine-tuned variant of Qwen 2.5 3B Instruct designed specifically for improved toggleable reasoning and instruction-following capabilities. This model has been built by engineers at xioserv.com and incorporates specialized modifications to enhance performance for structured reasoning tasks.

Overview

The AaryanK/Qwen_2.5_3B_Instruct_GRPO_Reasoning_XIOSERV model is a refined version of the Qwen 2.5 3B Instruct model. It is optimized to provide responses in a structured format, making it particularly useful for tasks requiring clear separation between reasoning steps and final answers.

Toggleable Reasoning Mode

If you include the system prompt, the model will explicitly separate reasoning and the final answer.
If you omit the system prompt, the model will respond naturally without structured reasoning.

This makes the model highly versatile, allowing users to choose between structured reasoning and direct responses based on their specific use case.

System Prompt

To enable structured reasoning, use the following system prompt:

Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>

If you do not include this prompt, the model will respond in a standard, conversational manner without explicitly separating reasoning from the final answer.

Methodology

To replicate the 'aha moment,' we employed Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), which enhances reasoning capabilities while optimizing memory usage. This approach aligns with the techniques outlined in the DeepSeekMath paper, where GRPO was instrumental in advancing reasoning in language models. By integrating GRPO with reinforcement learning, our model autonomously refines its problem-solving strategies, mirroring the self-reflective behavior observed in DeepSeek's R1.

Usage

We have provided GGUF files that can be run with llama.cpp for efficient inference.

To run the model with llama.cpp, follow the instructions in the llama.cpp repository.

Ensure that you include the system prompt in your input if you want structured reasoning output. Otherwise, the model will function like a standard instruct model.

Acknowledgements

xioserv.com – For the engineering efforts in fine-tuning this model.
Hugging Face – For providing an accessible platform to share and deploy models.

For any questions or contributions, please open an issue or submit a pull request on our GitHub repository.

Happy coding!