GSM8K Evaluation Result: 84.5 vs. 76.95

#81
by tanliboy - opened

In the Llama 3.1 technical report, Llama-3.1-8B was evaluated with an 84.5 score on the GSM8K benchmark. However, when I evaluate with lm-evaluation-harness

accelerate launch -m lm_eval --model hf     --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct   --tasks gsm8k  --batch_size auto

I got the following result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7695|±  |0.0116|
|     |       |strict-match    |     5|exact_match|↑  |0.7521|±  |0.0119|

There seems to be a significant discrepancy. Am I missing something in the evaluation settings?

@tanliboy Did you ensure that it is set to num_fewshot=8

@Orenguteng thanks for pointing it out! The above result was actually 5-shot.

I corrected it with a new run and got the result below:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.7779|±  |0.0114|
|     |       |strict-match    |     8|exact_match|↑  |0.7672|±  |0.0116|

It is better than 5-shot, but there is still a wide gap.

@wukaixingxp Any thoughts on the difference?

Please check my readme about reproducing the huggingface leaderboard evaluation. Basically, you need to checkout the right branch under their fork and use --apply_chat_template --fewshot_as_multiturn for the instruct model. I used the command accelerate launch -m lm_eval --model hf --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype=bfloat16 --apply_chat_template --fewshot_as_multiturn --log_samples --output_path eval_results --tasks gsm8k --batch_size 4, and got this result which is closer to our reported number 84.5:

hf (pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype=bfloat16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 4
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8234|±  |0.0105|
|     |       |strict-match    |     5|exact_match|↑  |0.7968|±  |0.0111|

I think the difference (0.79 vs 0.85) can come from different prompting style and n-shot (5 vs 8). I just found there is a gsm8k-cot-llama.yaml created by community user that follows our style. While this is not an official Meta implementation, but I got a closer result, my command was accelerate launch -m lm_eval --model hf --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype=bfloat16 --apply_chat_template --fewshot_as_multiturn --log_samples --output_path eval_results --tasks gsm8k_cot_llama --batch_size 4

hf (pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct,dtype=bfloat16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 4
|     Tasks     |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot_llama|      3|flexible-extract|     8|exact_match|↑  |0.8544|±  |0.0097|
|               |       |strict-match    |     8|exact_match|↑  |0.8514|±  |0.0098|

Let me know if you have any more questions!

Thank you, @wukaixingxp !

I think the difference (0.79 vs 0.85) can come from different prompting style and n-shot (5 vs 8).

Would you mind elaborating more on the different prompting style? Which prompting style should we use for Llama and how would the requirement different from other models?

I tested with your above commands and ran into an error

ValueError: If fewshot_as_multiturn is set, num_fewshot must be greater than 0.

But after setting it with 8 shots, I can see a similar result in your report. It seems --apply_chat_template --fewshot_as_multiturn is critical here.

Another significant gap I saw was about the ifeval evaluation result:

accelerate launch -m lm_eval --model hf     --model_args pretrained=meta-llama/Meta-Llama-3.1-8B-Instruct   --tasks ifeval  --batch_size 32

|Tasks |Version|Filter|n-shot|        Metric         |   |Value |   |Stderr|
|------|------:|------|-----:|-----------------------|---|-----:|---|------|
|ifeval|      2|none  |     0|inst_level_loose_acc   |↑  |0.6223|±  |   N/A|
|      |       |none  |     0|inst_level_strict_acc  |↑  |0.5935|±  |   N/A|
|      |       |none  |     0|prompt_level_loose_acc |↑  |0.4843|±  |0.0215|
|      |       |none  |     0|prompt_level_strict_acc|↑  |0.4455|±  |0.0214|

The score is very low compared to the reported result (80.4), whereas the Gemma-2-9b-it can achieve 76 with the same setting.

Please follow the open_llm_leaderboard reproducibility section to install the correct version and this will solve your ValueError: If fewshot_as_multiturn is set, num_fewshot must be greater than 0. error. --apply_chat_template --fewshot_as_multiturn is required as the instruct model needs the chat_template to work. I think main difference between gsm8k and gsm8k-cot-llama is the doc_to_text config, which defines the prompt style, as show here: gsm8k-cot-llama VS gsm8k. Please compare those two yaml files to understand diff details.

Thank you, @wukaixingxp !

@wukaixingxp The result, 84.5, (Table 2) in the technical report is for Llama 3 8B, not for Llama 3 8B Instruct. I think there is still discrepancy with the numbers in the above reply here: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/discussions/81#66ccde9250c670b0ff5d49d6

Meta Llama org

I believe gsm8k 84.5 is for Llama 3.1 8B Instruct, 80.6 is for Llama 3 8B Instruct, please check our model card

I've done quite a lot of fine-tuning attempts with this model, but one issue that keeps troubling me is the significant drop in IFEVAL scores each time I fine-tune.
So far, I haven’t found a dataset or method that allows me to retain the IFEVAL score while fine-tuning.
Do you have any suggestions or insights on how to address this?

@tanliboy It's very very hard to tune on top of a instruct model and retain its intelligence, all about tuning parameters, dataset end methods used - but its doable. You can see in my research and check https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard for comparison :

image.png

Keep lookout I will post more about results in the future as im currently researching this area.

@Orenguteng it is great to know that you retained the IFEVAL score while improving GPQA.
Any suggestions/insights on this dimension? Have you incorporated the FLAN collections (https://huggingface.co/datasets/Open-Orca/FLAN)?

Also, do you know how I can reproduce these score in the leaderboard dashboard?

I tried with

git clone git@github.com:huggingface/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout adding_all_changess
accelerate launch -m lm_eval --model_args pretrained=<model>,dtype=bfloat16  --log_samples --output_path eval_results --tasks leaderboard  --batch_size 4 --apply_chat_template --fewshot_as_multiturn

following the instruction in the page, but I got.

Base:

|         Groups         |Version|Filter|n-shot|        Metric         |   |Value |   |Stderr|
|------------------------|-------|------|-----:|-----------------------|---|-----:|---|------|
|leaderboard             |N/A    |none  |     0|acc                    |↑  |0.3782|±  |0.0044|
|                        |       |none  |     0|acc_norm               |↑  |0.4617|±  |0.0054|
|                        |       |none  |     0|exact_match            |↑  |0.1707|±  |0.0098|
|                        |       |none  |     0|inst_level_loose_acc   |↑  |0.8441|±  |N/A   |
|                        |       |none  |     0|inst_level_strict_acc  |↑  |0.8106|±  |N/A   |
|                        |       |none  |     0|prompt_level_loose_acc |↑  |0.7782|±  |0.0179|
|                        |       |none  |     0|prompt_level_strict_acc|↑  |0.7320|±  |0.0191|
| - leaderboard_bbh      |N/A    |none  |     3|acc_norm               |↑  |0.5070|±  |0.0063|
| - leaderboard_gpqa     |N/A    |none  |     0|acc_norm               |↑  |0.2894|±  |0.0131|
| - leaderboard_math_hard|N/A    |none  |     4|exact_match            |↑  |0.1707|±  |0.0098|
| - leaderboard_musr     |N/A    |none  |     0|acc_norm               |↑  |0.3876|±  |0.0171|

My Fine-tuning:

|         Groups         |Version|Filter|n-shot|        Metric         |   |Value |   |Stderr|
|------------------------|-------|------|-----:|-----------------------|---|-----:|---|------|
|leaderboard             |N/A    |none  |     0|acc                    |↑  |0.3526|±  |0.0044|
|                        |       |none  |     0|acc_norm               |↑  |0.4573|±  |0.0054|
|                        |       |none  |     0|exact_match            |↑  |0.1110|±  |0.0084|
|                        |       |none  |     0|inst_level_loose_acc   |↑  |0.7326|±  |N/A   |
|                        |       |none  |     0|inst_level_strict_acc  |↑  |0.7050|±  |N/A   |
|                        |       |none  |     0|prompt_level_loose_acc |↑  |0.6322|±  |0.0208|
|                        |       |none  |     0|prompt_level_strict_acc|↑  |0.6026|±  |0.0211|
| - leaderboard_bbh      |N/A    |none  |     3|acc_norm               |↑  |0.4956|±  |0.0063|
| - leaderboard_gpqa     |N/A    |none  |     0|acc_norm               |↑  |0.2919|±  |0.0132|
| - leaderboard_math_hard|N/A    |none  |     4|exact_match            |↑  |0.1110|±  |0.0084|
| - leaderboard_musr     |N/A    |none  |     0|acc_norm               |↑  |0.4259|±  |0.0176

The score is quite different from the score reporeted in the leadership board page.

@tanliboy Just showcasing that it is possible, I'm unfortunately not able to provide details around my training, but I can give you 2 insights:

1: No additional knowledge was further trained upon it in my case, it was a custom made dataset only made for alignment research purposes. Biases and alignments in LLM is proven to "dumb" down the model and this is what I'm showcasing. Which will even be better for future releases. No contamination in the dataset for eval results neither whatsoever therefore.

2: It's all about parameter tuning as well as the quality of your dataset etc. Aim for high quality, not quantity. One bad entry can leave traces and contaminate the whole outcome if you are "unlucky" and the model catches on it.

Also what i've noticed is that most tunes get most affected in the math evals, unless they contaminate their training with eval data.

Thanks, @Orenguteng !
Could I know if your fine-tuning result is SFT only or does it include preference alignment (like DPO or RLHF/RLAIF)?

@tanliboy I'm using a custom built framework, with different methods combined, all customized. Can't provide more details unfortunately but it's not impossible, my initial tune had worse results for math (3.1 V1) the V2 improved. V3 will be better.

@Orenguteng I understand. Looking forward to your V3 result.

Sign up or log in to comment