🧙🏼WISDOM

WISDOM: PROGRESSIVE CURRICULUM SYNTHESIS MAKES LLMS BETTER MATHEMATICAL REASONER

Figure 1: The overall workflow of WISDOM, which leverages Progressive Curriculum Synthesis to generate questions and responses with Deepseek Coder V2 and GPT-4o, including weak teacher guiding, critical expert teaching, experts consistency voting, and hard instruction evolving.

Main Results on the smaller models

Method	Base	GSM8K	MATH	College†	Olympiad	TabMWP	TheoremQA	AMC2023	AIME2024
Mathstral	Mistral-7B	83.3	54.3	36.7	22.4	82.8	26.3	12/40	1/30
KPMath-Plus	Mistral-7B	82.1	46.8	–	–	66.4	–	–	–
DART-Math	Mistral-7B	81.3	45.0	28.3	14.5	65.8	20.5	7/40	0/30
MAmmoTH2	Mistral-7B	67.4	34.2	31.0	9.8	26.8	26.7	6/40	1/30
MathScale	Mistral-7B	58.5	33.2	22.0	7.8	73.3	18.1	6/40	1/30
*WISDOM*	Mistral-7B	80.0	56.4	41.6	21.9	72.3	27.6	15/40	1/30

Method	Base	GSM8K	MATH	College†	Olympiad	TabMWP	TheoremQA	AMC2023	AIME2024
Llama3-instruct	Llama3-8B	78.2	27.2	22.8	5.6	75.3	18.9	5/40	0/30
MetaMath	Llama3-8B	80.5	32.6	19.3	6.7	54.1	13.3	6/40	0/30
DART-Math	Llama3-8B	81.8	46.9	28.4	15.9	66.3	20.5	8/40	1/30
MAmmoTH2	Llama3-8B	69.6	33.4	32.3	8.1	43.8	29.7	7/40	0/30
MathScale	Llama3-8B	70.8	34.6	22.5	9.0	74.3	18.9	2/40	1/30
WISDOM	Llama3-8B	83.2	59.7	42.2	25.6	83.0	28.6	17/40	1/30

Method	Base	GSM8K	MATH	College†	Olympiad	TabMWP	TheoremQA	AMC2023	AIME2024
DSMath-instruct	DSMath-7B	82.0	46.3	38.1	13.6	76.7	31.9	7/40	1/30
MetaMath	DSMath-7B	76.5	37.2	27.3	10.7	67.1	13.9	10/40	0/30
KPMath-Plus	DSMath-7B	83.9	48.8	–	–	78.7	–	–	–
DART-Math	DSMath-7B	87.5	53.9	40.7	20.0	82.9	31.5	8/30	0/30
NuminaMath	DSMath-7B	77.1	53.7	32.4	24.0	77.7	29.4	12/40	1/30
MathScale	DSMath-7B	62.7	33.4	23.0	8.1	71.3	24.5	4/40	0/30
WISDOM	DSMath-7B	83.3	62.4	45.0	28.9	85.7	34.9	11/40	2/30

Main Results on the bigger models

Method	Base	GSM8K	MATH	College†	Olympiad	TabMWP	TheoremQA	AMC2023	AIME2024
GPT-4o-0513	–	95.8	76.6	–	–	–	–	–	2/30
GPT-4-1106-preview	–	91.4	64.3	–	–	–	–	–	1/30
Claude-3-Opus	–	95.0	60.1	–	–	–	–	–	2/30
DeepSeek Coder V2	–	94.9	75.7	–	–	–	–	–	4/30
Llama3-instruct	Llama3-70B	93.1	50.4	40.3	17.6	89.9	34.1	8/40	2/30
Qwen2-instruct	Qwen2-72B	93.6	69.3	46.8	35.3	92.4	42.0	17/40	4/30
DART-Math	Llama3-70B	89.8	55.7	37.9	21.0	80.9	28.2	13/40	1/30
KPMath-Plus	Qwen1.5-72B	87.0	58.3	–	–	76.7	–	–	–
MetaMath	Llama3-70B	88.0	44.9	31.9	11.6	–	21.9	–	–
NuminaMath	Qwen2-72B	91.5	66.9	42.1	33.6	86.7	29.0	13/40	4/30
WISDOM	Llama3-70B	94.1	68.2	43.4	34.4	91.8	41.4	22/40	3/30
WISDOM	Qwen2-72B	94.2	76.1	47.6	39.1	94.5	45.4	23/40	2/30

† In short of College MATH.

Table 1:Main results on in-domain benchmarks, GSM8K and MATH, and out-of-domain benchmarks, including College MATH, Olympiad, TabMWP, TheoremQA, AMC2023, and AIME2024. We select the current well-performing LLMs to evaluate their test accuracy on these benchmarks. Since KPMath-Plus is not open-sourced, the results are quoted from the corresponding paper.

Introduction of Paper

we introduce WISDOM, which draws inspiration from the human learning process and employs curriculum learning to gradually synthesize high-quality CoT data from easy to hard.

Template

All models were trained using the Alpaca template.

Below is an instruction that describes a task. Write a response that appropriately completes the request.\n### Instruction:\n{question}\n\n### Response:

Training Setup

Data Contamination

we applied a 10-gram hash deduplication method to the questions in both our in-domain and out-of-domain benchmarks, with a condition that the ratio of the longest common sequence must exceed 0.6, Any detected duplicates were removed.

Training details

We employ Llama-factory for fine-tuning the entire suite of models and utilized sequence packing to accelerate the training process.

The training was conducted using 88 NVIDIA A800 GPUs, with a configuration of batch size 1, gradient accumulation of 2, sequence length of 8192, and bf16 precision. We optimized the models with the AdamW optimizer, setting a learning rate warmup using a cosine schedule with a warmup ratio of 0.03, and trained each model for 3 epochs. The learning rates were adjusted slightly for different models: Mistral 7B at 1e-5, DeepSeekMath-7B at 5e-5, Llama3-8B at 4e-5, and both Llama3-70B and Qwen2-72B at 2e-5.