Confucius-o1-14B

Introduction

Confucius-o1-14B is a o1-like reasoning model developed by the NetEase Youdao Team, it can be easily deployed on a single GPU without quantization. This model is based on the Qwen2.5-14B-Instruct model and adopts a two-stage learning strategy, enabling the lightweight 14B model to possess thinking abilities similar to those of o1. What sets it apart is that after generating the chain of thought, it can summarize a step-by-step problem-solving process from the chain of thought on its own. This can prevent users from getting bogged down in the complex chain of thought and allows them to easily obtain the correct problem-solving ideas and answers. Experience the Confucius-o1-14B demo right now!

Optimization Methods

Selection of the Base Model: Our open-source model is based on the Qwen2.5-14B-Instruct model. We chose this model as the starting point for optimization because it can be deployed on a single GPU and has powerful basic capabilities. Our internal experiments show that starting from a base model with stronger mathematical capabilities and going through the same optimization process can result in an o1-like model with stronger reasoning capabilities.

Two-stage Learning: Our optimization process is divided into two stages. In the first stage, the model learns from a larger o1-like teacher model. This is the most effective way to enable a small model to efficiently master the o1 thinking pattern. In the second stage, the model conducts self-iterative learning to further enhance its reasoning ability. On our internal evaluation dataset, these two stages bring about a performance improvement of approximately 10 points and 6 points respectively.

Data Formatting: Different from general o1-like models, our model is designed for applications in the education field. Therefore, we expect the model not only to output the final answer but also to provide a step-by-step problem-solving process based on the correct thinking process in the chain of thought. To this end, we standardize the output format of the model as follows: the chain-of-thought process is output in the <thinking></thinking> block, and then the step-by-step problem-solving process is summarized in the <summary></summary> block.

More Stringent Data filtering: To ensure the quality of the learning data, we not only evaluate the correctness of the final answer but also examine the accuracy of the explanation process in the entire summary. This is achieved through an automated evaluation methods developed internally, which can effectively prevent the model from learning false positives.

Selection of Training Instructions: The training instruction data we used was sampled from an internal training dataset. We made this data selection because our optimization is mainly targeted at applications in the education field. To efficiently verify the effectiveness, the sample size is only 6,000 records, mainly covering non-graphical mathematical problems in K12 scenarios, and there is no overlap with the training data of the benchmark test set. It has been proven that with just a small amount of data, it is possible to transform a general-purpose model into a Chain of Thought model with reasoning capabilities similar to those of o1.

Evaluation and Results

alt text

Note: The results marked with * are directly obtained from the data provided by the respective model/interface provider, and the other results are from our evaluation.

Limitations

However, there are some limitations that must be stated in advance:

  1. Scenario Limitations: Our optimization is only carried out on data from the K12 mathematics scenario, and the effectiveness has only been verified in math-related benchmark tests. The performance of the model in non-mathematical scenarios has not been tested, so we cannot guarantee its quality and effectiveness in other fields.
  2. Language-related Issues: In the โ€œsummaryโ€ block, the model has a stronger tendency to generate Chinese content. In the โ€œthinkingโ€ block, the model may reason in an unexpected language environment or even present a mixture of languages. However, this does not affect the actual reasoning ability of the model. This indicates that the chain of thought itself may not have independent value, it is merely an easier-to-learn path leading to a correct summary.
  3. Invalid Results: The model may sometimes fall into circular reasoning. Since we use explicit identifiers to divide the thinking and summary parts, when the model enters this mode, it may generate invalid results that cannot be parsed.
  4. Safety and Ethics: This model has not undergone optimization and testing for alignment at the safety and ethical levels. Any output generated by the model does not represent the official positions, views, or attitudes of our company. When using this model, users should independently judge and evaluate the rationality and applicability of the output content and comply with relevant laws, regulations, and social ethics.

Quickstart

The environmental requirements for running it are exactly the same as those of the Qwen2.5-14B-Instruct model. Therefore, you can easily use Transformers or vLLM to load and run the model for inference, and deploy your services.

The only thing you need to pay attention to is to use the predefined system message and user message templates provided below to request the model. Other templates may also be usable, but we haven't tested them yet.

SYSTEM_PROMPT_TEMPLATE = """ไฝ ๅซ"ๅฐP่€ๅธˆ"๏ผŒๆ˜ฏไธ€ไฝ็”ฑ็ฝ‘ๆ˜“ๆœ‰้“ใ€Œๅญๆ›ฐใ€ๆ•™่‚ฒๅคงๆจกๅž‹ๅˆ›ๅปบ็š„AIๅฎถๅบญๆ•™ๅธˆใ€‚
ๅฐฝไฝ ๆ‰€่ƒฝๅ›ž็ญ”ๆ•ฐๅญฆ้—ฎ้ข˜ใ€‚

!!! ่ฏท่ฎฐไฝ๏ผš
- ไฝ ๅบ”่ฏฅๅ…ˆ้€š่ฟ‡ๆ€่€ƒๆŽข็ดขๆญฃ็กฎ็š„่งฃ้ข˜ๆ€่ทฏ๏ผŒ็„ถๅŽๆŒ‰็…งไฝ ๆ€่€ƒ่ฟ‡็จ‹้‡Œๆญฃ็กฎ็š„่งฃ้ข˜ๆ€่ทฏๆ€ป็ป“ๅ‡บไธ€ไธชๅŒ…ๅซ3-5ๆญฅ่งฃ้ข˜่ฟ‡็จ‹็š„ๅ›ž็ญ”ใ€‚

ๆ€่€ƒ่ฟ‡็จ‹็š„ไธ€ไบ›ๅ‡†ๅˆ™๏ผš
- ่ฟ™ไธชๆ€่€ƒ่ฟ‡็จ‹ๅบ”่ฏฅๅ‘ˆ็Žฐๅ‡บไธ€็งๅŽŸๅง‹ใ€่‡ช็„ถไธ”ๆ„่ฏ†ๆต็š„็Šถๆ€๏ผŒๅฐฑๅฆ‚ๅŒไฝ ๅœจ่งฃ้ข˜ๆ—ถๅ†…ๅฟƒ็š„็‹ฌ็™ฝไธ€ๆ ท๏ผŒๅ› ๆญคๅฏไปฅๅŒ…ๅซไธ€ไบ›ๅ–ƒๅ–ƒ่‡ช่ฏญใ€‚
- ๅœจๆ€่€ƒๅˆๆœŸ๏ผŒไฝ ๅบ”่ฏฅๅ…ˆๆŒ‰่‡ชๅทฑ็š„็†่งฃ้‡่ฟฐ้—ฎ้ข˜๏ผŒ่€ƒ่™‘้—ฎ้ข˜ๆš—ๅซ็š„ๆ›ดๅนฟๆณ›็š„่ƒŒๆ™ฏไฟกๆฏ๏ผŒๅนถๆขณ็†ๅ‡บๅทฒ็Ÿฅๅ’Œๆœช็Ÿฅ็š„ๅ…ƒ็ด ๏ผŒๅŠๅ…ถไธŽไฝ ๆ‰€ๅญฆ็Ÿฅ่ฏ†็š„ไธ€ไบ›ๅ…ณ่”็‚น๏ผŒๅนถๅ‘ๆ•ฃๆ€็ปด่€ƒ่™‘ๅฏ่ƒฝๆœ‰ๅ‡ ็งๆฝœๅœจ็š„่งฃ้ข˜ๆ€่ทฏใ€‚
- ๅฝ“ไฝ ็กฎๅฎšไบ†ไธ€ไธช่งฃ้ข˜ๆ€่ทฏๆ—ถ๏ผŒไฝ ๅบ”่ฏฅๅ…ˆ้€ๆญฅๆŒ‰้ข„ๆƒณ็š„ๆ€่ทฏๆŽจ่ฟ›๏ผŒไฝ†ๆ˜ฏไธ€ๆ—ฆไฝ ๅ‘็Žฐ็Ÿ›็›พๆˆ–่€…ไธ็ฌฆๅˆ้ข„ๆœŸ็š„ๅœฐๆ–น๏ผŒไฝ ๅบ”่ฏฅๅŠๆ—ถๅœไธ‹ๆฅ๏ผŒๆๅ‡บไฝ ็š„่ดจ็–‘๏ผŒ่ฎค็œŸ้ชŒ่ฏ่ฏฅๆ€่ทฏๆ˜ฏๅฆ่ฟ˜ๅฏไปฅ็ปง็ปญใ€‚
- ๅฝ“ไฝ ๅ‘็Žฐไธ€ไธชๆ€่ทฏๅทฒ็ปไธๅฏ่กŒๆ—ถ๏ผŒไฝ ๅบ”่ฏฅ็ตๆดปๅˆ‡ๆขๅˆฐๅ…ถไป–ๆ€่ทฏไธŠ็ปง็ปญๆŽจ่ฟ›ไฝ ็š„ๆ€่€ƒใ€‚
- ๅฝ“ไฝ ๆŒ‰็…งไธ€ไธชๆ€่ทฏ็ป™ๅ‡บ็ญ”ๆกˆๅŽ๏ผŒๅˆ‡่ฎฐ่ฆไป”็ป†้ชŒ่ฏไฝ ็š„ๆฏไธ€ไธชๆŽจ็†ๅ’Œ่ฎก็ฎ—็ป†่Š‚๏ผŒ่ฟ™ๆ—ถๅ€™้€†ๅ‘ๆ€็ปดๅฏ่ƒฝๆœ‰ๅŠฉไบŽไฝ ๅ‘็Žฐๆฝœๅœจ็š„้—ฎ้ข˜ใ€‚
- ไฝ ็š„ๆ€่€ƒๅบ”่ฏฅๆ˜ฏ็ป†ๅŒ–็š„๏ผŒ้œ€่ฆๅŒ…ๆ‹ฌ่ฏฆ็ป†็š„่ฎก็ฎ—ๅ’ŒๆŽจ็†็š„็ป†่Š‚ใ€‚
- ๅŒ…ๅซ็š„ๅ–ƒๅ–ƒ่‡ช่ฏญๅบ”่ฏฅๆ˜ฏไธ€ไธชๅฃ่ฏญๅŒ–็š„่กจ่พพ๏ผŒ้œ€่ฆๅ’ŒไธŠไธ‹ๆ–‡่ฏญๅขƒๅŒน้…๏ผŒๅนถไธ”ๅฐฝ้‡ๅคšๆ ทๅŒ–ใ€‚

ๆ€ป็ป“็š„่งฃ้ข˜่ฟ‡็จ‹็š„ๆ ผๅผ่ฆๆฑ‚๏ผš
- ๆฑ‚่งฃ่ฟ‡็จ‹ๅบ”่ฏฅๅˆ†ไธบ3-5ๆญฅ๏ผŒๆฏไธชๆญฅ้ชคๅ‰้ข้ƒฝๆ˜Ž็กฎ็ป™ๅ‡บๆญฅ้ชคๅบๅท๏ผˆๆฏ”ๅฆ‚๏ผšโ€œๆญฅ้ชค1โ€๏ผ‰ๅŠๅ…ถๅฐๆ ‡้ข˜
- ๆฏไธชๆญฅ้ชค้‡Œๅช็ป™ๅ‡บๆ ธๅฟƒ็š„ๆฑ‚่งฃ่ฟ‡็จ‹ๅ’Œ้˜ถๆฎตๆ€ง็ญ”ๆกˆใ€‚
- ๅœจๆœ€ๅŽไธ€ไธชๆญฅ้ชค้‡Œ๏ผŒไฝ ๅบ”่ฏฅๆ€ป็ป“ไธ€ไธ‹ๆœ€็ปˆ็š„็ญ”ๆกˆใ€‚

่ฏทไฝฟ็”จไปฅไธ‹ๆจกๆฟใ€‚

<question>ๅพ…่งฃ็ญ”็š„ๆ•ฐๅญฆ้—ฎ้ข˜</question>

<thinking>
่ฟ™้‡Œ่ฎฐๅฝ•ไฝ ่ฏฆ็ป†็š„ๆ€่€ƒ่ฟ‡็จ‹
</thinking>
<summary>
ๆ นๆฎๆ€่€ƒ่ฟ‡็จ‹้‡Œๆญฃ็กฎ็š„่งฃ้ข˜่ทฏๅพ„ๆ€ป็ป“ๅ‡บ็š„๏ผŒๅŒ…ๅซ3-5ๆญฅ่งฃ้ข˜่ฟ‡็จ‹็š„ๅ›ž็ญ”ใ€‚
</summary>"""

USER_PROMPT_TEMPLATE = """็Žฐๅœจ๏ผŒ่ฎฉๆˆ‘ไปฌๅผ€ๅง‹ๅง๏ผ

<question>{question}</question>"""

Then you can create your messages as follows and use them to request model results. You just need to fill in your instructions in the "question" field.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "netease-youdao/Confucius-o1-14B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

messages = [
    {'role': 'system', 'content': SYSTEM_PROMPT_TEMPLATE},
    {'role': 'user', 'content': USER_PROMPT_TEMPLATE.format(question=question)},
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=16384
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

After obtaining the model results, you can parse out the "thinking" and "summary" parts as follows.

def parse_result_nostep(result):
    thinking_pattern = r"<thinking>(.*?)</thinking>"
    summary_pattern = r"<summary>(.*?)</summary>"

    thinking_list = re.findall(thinking_pattern, result, re.DOTALL)
    summary_list = re.findall(summary_pattern, result, re.DOTALL)

    assert len(thinking_list) == 1 and len(summary_list) == 1, \
        f"The parsing results do not meet the expectations.\n{result}"

    thinking = thinking_list[0].strip()
    summary = summary_list[0].strip()
    return thinking, summary
    
thinking, summary = parse_result_nostep(response)

Citation

If you find our work helpful, feel free to give us a cite.

@misc{confucius-o1,
   author = {NetEase Youdao Team},
   title = {Confucius-o1: Open-Source Lightweight Large Models to Achieve Excellent Chain-of-Thought Reasoning on Consumer-Grade Graphics Cards.},
   url = {https://huggingface.co/netease-youdao/Confucius-o1-14B},
   month = {January},
   year = {2025}
 }
Downloads last month
222
Safetensors
Model size
14.8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for netease-youdao/Confucius-o1-14B

Base model

Qwen/Qwen2.5-14B
Finetuned
(117)
this model
Merges
3 models
Quantizations
12 models

Spaces using netease-youdao/Confucius-o1-14B 4