thellumi/LLuMi_Think_8B · Hugging Face

Model Information

The LLuMi multilingual large language model (LLM) is an instruction tuned generative model in 70B (text in/text out). LLuMi builds upon this robust foundation by incorporating additional refinements and distillation techniques inspired by the DeepSeek-R1 framework. This results in a model that not only retains the original strengths of Llama 3.3 but also delivers improved performance and efficiency for real-world applications. LLuMi exhibits sophisticated chain-of-thought behaviors, improved self-verification, and reduced issues such as repetition and language mixing.

research@llumi.tech

Model Release Date:

LLuMi Think LLM Family: February 24, 2025

1. Introduction

We introduce LLuMi, a state-of-the-art multilingual large language model (LLM) built on the robust Llama 3.3 70B architecture. LLuMi is instruction tuned to excel in real-world applications, particularly in multilingual dialogue and complex reasoning tasks.

Leveraging advanced refinements and distillation techniques inspired by the DeepSeek-R1 framework, LLuMi not only retains the core strengths of its Llama 3.3 foundation but also delivers enhanced performance and efficiency. By integrating large-scale reinforcement learning directly on the base model, LLuMi exhibits sophisticated chain-of-thought behaviors, improved self-verification, and reduced issues such as repetition and language mixing.

To support the research community and foster further innovation, we are releasing the full LLuMi model along with a range of distilled checkpoints across various sizes. This initiative empowers researchers to deploy both the complete model and resource-efficient distilled versions for diverse applications.

NOTE: Before deploying LLuMi locally, please review the How to use & Usage Recommendations section for detailed guidelines and best practices.

Distillation: Unlocking the Power of Smaller Models

We demonstrate that the advanced reasoning patterns of larger models can be distilled into smaller, more efficient models. This approach yields improved performance compared to the reasoning strategies derived solely via reinforcement learning on smaller models. The open source DeepSeek-R1 framework—and its API—play a crucial role in enabling the research community to distill and develop even more powerful smaller models in the future.
Leveraging the rich reasoning data generated by DeepSeek-R1, we fine-tuned LLuMi—a dense, instruction-tuned model built upon the Llama 3.3 70B architecture. Our evaluation results show that the distilled LLuMi model performs exceptionally well on various benchmarks, often matching or even surpassing the performance of larger models.
Furthermore, we are excited to open-source the full LLuMi model along with a series of distilled checkpoints across multiple sizes—including 3B, 8B, and 70B—based on the LLuMi framework. This initiative provides the research community with access to both the complete model and its distilled versions, enabling a wide range of applications with varying computational needs.

Post-Training: Large-Scaling Reinforcement Learning on the Base Model

We directly apply reinforcement learning (RL) to the base LLuMi model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach enables LLuMi to explore advanced chain-of-thought (CoT) capabilities for tackling complex problems, leading to enhanced self-verification, reflective reasoning, and the generation of extended CoTs. Notably, LLuMi is among the first open research initiatives to demonstrate that the reasoning capabilities of large language models can be effectively incentivized purely through RL, without the need for an initial SFT phase. This breakthrough paves the way for future advancements in scalable reinforcement learning strategies for LLMs.

We introduce our comprehensive pipeline for developing LLuMi inspired from DeepSeek-R1, which includes:

Two RL Stages: Designed to discover improved reasoning patterns and align the model with human preferences.
Two SFT Stages: Serving as the foundational seed for both the model’s reasoning and non-reasoning capabilities.

We believe this innovative pipeline will not only enhance LLuMi's performance but also benefit the industry by inspiring the creation of more robust and efficient models.

2. Model Distillation and GRPO-Based Thinking Enhancement

The LLuMi 70B model has been meticulously developed using the advanced techniques of DeepSeek-R1 Distill Llama 3.3 70B. By leveraging state-of-the-art distillation methods, LLuMi 70B not only retains the powerful multilingual and instruction-tuned capabilities of its foundation but also achieves enhanced performance and efficiency for diverse real-world applications.

Furthermore, inspired by the successes of DeepSeek-R1, we have infused our smaller LLuMi 8B and 3B models with a unique thinking property through the use of GRPO (Guided Reasoning Policy Optimization). This innovative approach endows the smaller models with sophisticated chain-of-thought reasoning and reflective problem-solving abilities—ensuring that even with fewer parameters, they can deliver agile and context-aware responses.

Together, these advancements underscore our commitment to creating a versatile family of models that scale seamlessly from 3B to 70B, providing powerful solutions tailored to various computational and application needs.

3. Model Downloads

LLuMi Think Models

Model	Base Model	Download
LLuMi Think 3B	Qwen2.5-3B-Instruct	🤗 HuggingFace
LLuMi Think 8B	Llama-3.1-8B-Instruct	🤗 HuggingFace
LLuMi Think 70B	Llama-3.3-70B-Instruct	🤗 HuggingFace

4. How to use

This repository contains three versions of LLuMi Think LLM Models, for use with transformers and with bitsandbytes codebase.

Use with transformers

Starting with transformers >= 4.48.3 onward, you can run conversational inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.

Make sure to update your transformers installation via pip install --upgrade transformers.

See the snippet below for usage with Transformers:

import transformers
import torch

model_id = "thellumi/LLuMi_Think_70B"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Why are tomatoes red?"},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

Use bitsandbytes

The model checkpoints can be used in 8-bit and 4-bit for further memory optimisations using bitsandbytes and transformers

See the snippet below for usage:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "thellumi/LLuMi_Think_70B"
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

quantized_model = AutoModelForCausalLM.from_pretrained(
  model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)

tokenizer = AutoTokenizer.from_pretrained(model_id)
input_text = "Why are tomatoes red?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

output = quantized_model.generate(**input_ids, max_new_tokens=10)

print(tokenizer.decode(output[0], skip_special_tokens=True))

To load in 4-bit simply pass load_in_4bit=True

5. Usage Recommendations

We recommend adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance:

Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs.
Avoid adding a system prompt; all instructions should be contained within the user prompt.
For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}."
When evaluating model performance, it is recommended to conduct multiple tests and average the results.

Additionally, DeepSeek have observed that the DeepSeek-R1 series models tend to bypass thinking pattern (i.e., outputting "<think>\n\n</think>") when responding to certain queries, which can adversely affect the model's performance. To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with "<think>\n" at the beginning of every output.

6. Training Data

Overview:
LLuMi is built upon the robust Llama 3.3 architecture, which was pretrained on approximately 15 trillion tokens sourced from publicly available datasets. For fine-tuning, LLuMi leverages a combination of publicly available instruction datasets and over 10 million examples sourced from Hugging Face. This comprehensive training corpus has been curated to ensure high performance across various languages, with dedicated support for Turkish and other languages.

Data Freshness:
The pretraining data includes content up to a cutoff date of Aug. 2024, ensuring that LLuMi is aligned with recent language trends and developments.

7. Benchmarks

Model	AIME 2024 pass@1	AIME 2024 cons@64	MATH-500 pass@1	GPQA Diamond pass@1	LiveCodeBench pass@1	CodeForces rating
Claude-3.5-Sonnet-1022	16.0	26.7	78.3	65.0	38.9	717
OpenAI o1-1217	79.2	-	96.4	75.7	63.4	2061
OpenAI o1-mini	63.6	80.0	90.0	60.0	53.8	1820
OpenAI GPT-4o-0513	9.3	13.4	74.6	49.9	32.9	759
QwQ-32B-Preview	44.0	60.0	90.6	54.5	41.9	1316
DeepSeek R1	79.8	-	97.3	71.5	65.9	2209
LLuMi Think 70B	69.3	86.4	94.1	64.8	56.9	1603

Note on Benchmark Results: Due to hardware limitations, full-scale benchmark tests could not be performed, and the results may vary. We remain fully transparent about these constraints and are actively working towards securing the necessary resources to conduct comprehensive evaluations in the near future.

8. Responsibility & Safety

At LLuMi, we are committed to promoting responsible and ethical use of our technology. We recognize that large language models carry inherent risks and potential for misuse, and we have taken several measures to mitigate these challenges:

Bias Mitigation: We have implemented various strategies during training to minimize biases in model outputs. However, users should be aware that, despite these efforts, occasional biases or unintended outputs may still occur.
Usage Guidelines: LLuMi is designed for research and responsible deployment. We strongly encourage users to adhere to ethical guidelines, applicable laws, and best practices when using the model. Generating harmful, misleading, or offensive content is strictly prohibited.
Safety Measures: Users deploying LLuMi in real-world applications should implement additional safety filters and monitoring mechanisms. We recommend regular audits and evaluations to ensure that the model’s outputs remain within acceptable ethical boundaries.
Community Engagement: We invite the community to provide feedback on any safety or ethical issues encountered during usage. This collaborative approach is vital for continuously refining the model and addressing potential risks.
Transparency and Accountability: By open-sourcing LLuMi, we aim to foster transparency and accountability. We commit to ongoing research and updates focused on improving the model's safety and ethical performance.

By using LLuMi, you agree to follow these guidelines and contribute to a safer, more responsible AI ecosystem.

9. License

This code repository and the model weights are licensed under the MIT License. LLuMi Think series support commercial use, allow for any modifications and derivative works, including, but not limited to, distillation for training other LLMs. Please note that:

LLuMi Think 3B is derived from Qwen-2.5-3B, which are originally licensed under Apache 2.0 License.
LLuMi Think 8B is derived from Llama3.1-8B-Instruct and is originally licensed under llama3.1 license.
LLuMi Think 70B is derived from Llama3.3-70B-Instruct and is originally licensed under llama3.3 license.

10. Citation

@misc{deepseekai2025deepseekr1incentivizingreasoningcapability,
      title={DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning}, 
      author={DeepSeek-AI},
      year={2025},
      eprint={2501.12948},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.12948}, 
}

@misc{thellumi,
  author = {The Lucy},
  month = feb,
  title = {{LLuMi Think}},
  howpublished = {https://llumi.tech},
  year = {2025}
}

11. Contact

If you have any questions, please raise an issue or contact us at research@llumi.tech.

thellumi
/

LLuMi_Think_8B