YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

๐Ÿง™๐ŸผWISDOM

WISDOM: PROGRESSIVE CURRICULUM SYNTHESIS MAKES LLMS BETTER MATHEMATICAL REASONER

๐Ÿค—Datasets&Models@HF | ๐Ÿฑ Code@GitHub

Figure 1: The overall workflow of WISDOM, which leverages Progressive Curriculum Synthesis to generate questions and responses with Deepseek Coder V2 and GPT-4o, including weak teacher guiding, critical expert teaching, experts consistency voting, and hard instruction evolving.

Main Results on the smaller models

Method Base GSM8K MATH Collegeโ€  Olympiad TabMWP TheoremQA AMC2023 AIME2024
Mathstral Mistral-7B 83.3 54.3 36.7 22.4 82.8 26.3 12/40 1/30
KPMath-Plus Mistral-7B 82.1 46.8 โ€“ โ€“ 66.4 โ€“ โ€“ โ€“
DART-Math Mistral-7B 81.3 45.0 28.3 14.5 65.8 20.5 7/40 0/30
MAmmoTH2 Mistral-7B 67.4 34.2 31.0 9.8 26.8 26.7 6/40 1/30
MathScale Mistral-7B 58.5 33.2 22.0 7.8 73.3 18.1 6/40 1/30
WISDOM Mistral-7B 80.0 56.4 41.6 21.9 72.3 27.6 15/40 1/30
Method Base GSM8K MATH Collegeโ€  Olympiad TabMWP TheoremQA AMC2023 AIME2024
Llama3-instruct Llama3-8B 78.2 27.2 22.8 5.6 75.3 18.9 5/40 0/30
MetaMath Llama3-8B 80.5 32.6 19.3 6.7 54.1 13.3 6/40 0/30
DART-Math Llama3-8B 81.8 46.9 28.4 15.9 66.3 20.5 8/40 1/30
MAmmoTH2 Llama3-8B 69.6 33.4 32.3 8.1 43.8 29.7 7/40 0/30
MathScale Llama3-8B 70.8 34.6 22.5 9.0 74.3 18.9 2/40 1/30
WISDOM Llama3-8B 83.2 59.7 42.2 25.6 83.0 28.6 17/40 1/30
Method Base GSM8K MATH Collegeโ€  Olympiad TabMWP TheoremQA AMC2023 AIME2024
DSMath-instruct DSMath-7B 82.0 46.3 38.1 13.6 76.7 31.9 7/40 1/30
MetaMath DSMath-7B 76.5 37.2 27.3 10.7 67.1 13.9 10/40 0/30
KPMath-Plus DSMath-7B 83.9 48.8 โ€“ โ€“ 78.7 โ€“ โ€“ โ€“
DART-Math DSMath-7B 87.5 53.9 40.7 20.0 82.9 31.5 8/30 0/30
NuminaMath DSMath-7B 77.1 53.7 32.4 24.0 77.7 29.4 12/40 1/30
MathScale DSMath-7B 62.7 33.4 23.0 8.1 71.3 24.5 4/40 0/30
WISDOM DSMath-7B 83.3 62.4 45.0 28.9 85.7 34.9 11/40 2/30

Main Results on the bigger models

Method Base GSM8K MATH Collegeโ€  Olympiad TabMWP TheoremQA AMC2023 AIME2024
GPT-4o-0513 โ€“ 95.8 76.6 โ€“ โ€“ โ€“ โ€“ โ€“ 2/30
GPT-4-1106-preview โ€“ 91.4 64.3 โ€“ โ€“ โ€“ โ€“ โ€“ 1/30
Claude-3-Opus โ€“ 95.0 60.1 โ€“ โ€“ โ€“ โ€“ โ€“ 2/30
DeepSeek Coder V2 โ€“ 94.9 75.7 โ€“ โ€“ โ€“ โ€“ โ€“ 4/30
Llama3-instruct Llama3-70B 93.1 50.4 40.3 17.6 89.9 34.1 8/40 2/30
Qwen2-instruct Qwen2-72B 93.6 69.3 46.8 35.3 92.4 42.0 17/40 4/30
DART-Math Llama3-70B 89.8 55.7 37.9 21.0 80.9 28.2 13/40 1/30
KPMath-Plus Qwen1.5-72B 87.0 58.3 โ€“ โ€“ 76.7 โ€“ โ€“ โ€“
MetaMath Llama3-70B 88.0 44.9 31.9 11.6 โ€“ 21.9 โ€“ โ€“
NuminaMath Qwen2-72B 91.5 66.9 42.1 33.6 86.7 29.0 13/40 4/30
WISDOM Llama3-70B 94.1 68.2 43.4 34.4 91.8 41.4 22/40 3/30
WISDOM Qwen2-72B 94.2 76.1 47.6 39.1 94.5 45.4 23/40 2/30

โ€  In short of College MATH.

Table 1:Main results on in-domain benchmarks, GSM8K and MATH, and out-of-domain benchmarks, including College MATH, Olympiad, TabMWP, TheoremQA, AMC2023, and AIME2024. We select the current well-performing LLMs to evaluate their test accuracy on these benchmarks. Since KPMath-Plus is not open-sourced, the results are quoted from the corresponding paper.

Introduction of Paper

we introduce WISDOM, which draws inspiration from the human learning process and employs curriculum learning to gradually synthesize high-quality CoT data from easy to hard.

Template

All models were trained using the Alpaca template.

Below is an instruction that describes a task. Write a response that appropriately completes the request.\n### Instruction:\n{question}\n\n### Response:

Training Setup

Data Contamination

we applied a 10-gram hash deduplication method to the questions in both our in-domain and out-of-domain benchmarks, with a condition that the ratio of the longest common sequence must exceed 0.6, Any detected duplicates were removed.

Training details

We employ Llama-factory for fine-tuning the entire suite of models and utilized sequence packing to accelerate the training process.

The training was conducted using 88 NVIDIA A800 GPUs, with a configuration of batch size 1, gradient accumulation of 2, sequence length of 8192, and bf16 precision. We optimized the models with the AdamW optimizer, setting a learning rate warmup using a cosine schedule with a warmup ratio of 0.03, and trained each model for 3 epochs. The learning rates were adjusted slightly for different models: Mistral 7B at 1e-5, DeepSeekMath-7B at 5e-5, Llama3-8B at 4e-5, and both Llama3-70B and Qwen2-72B at 2e-5.

Downloads last month
4
Safetensors
Model size
8.03B params
Tensor type
BF16
ยท
Inference API
Unable to determine this model's library. Check the docs .