MMLU-Pro

#2
by Marjovl - opened

I've been looking at this model for a while, and I must say, I've been impressed!

I'm wondering why this model is not as popular as I suspect it should be.

Would you mind also adding it to a few Leaderboards?

Like:

Thank you for your interest and advice!

The primary focus of this work is to demonstrate that the model trained with iLC-DPO improves over iterations without significantly altering response length, indicating better alignment with human values without length bias. Additionally, we observe no significant degradation in traditional NLP tasks. It is important to note that MMLU-Pro evaluates general ability and is not a benchmark for alignment.

However, I will also test storm-7b on MMLU-Pro. Thank you again.

We compute the MMLU-Pro accuracy of our Storm-7B and base model. MMLU-Pro measures general ability, especially for pretraining models, using a 5-shot setting. However, for our model, the 5-shot examples are too long and sometimes outside our context scope. Therefore, we report the zero-shot accuracy below. The table shows that the "alignment tax" (i.e., the performance drop on traditional NLP tasks with ground-truth answers) for iLR-DPO on MMLU-Pro is still minor (34.7->34.0). For reference, the zero-shot accuracy of Llama-3-8B is 31.5.

image.png

Hi, I also submitted your model here https://huggingface.co/spaces/occiglot/euro-llm-leaderboard

I'm actually surprised it did not do as well as I thought it would. According to my own personal testing I was quite surprised by the results it gave me for summarization. When I asked Claude 3 Haiku of what it thought of about 500 summaries, Haiku scored it at the top on average.

Thank you for your submission. It appears that the euro-llm-leaderboard consists of traditional NLP tasks, which are not benchmarks for alignment. Additionally, we did not adopt the Llama3-8B model as our base model because it did not exist when we started this project. Our base model may not be optimized for French, Italian, German, Spanish, and Dutch.

So,

Let me share some info about why I'm surprised.

I am going through some data and quantised a few models to translate and summarise pieces of text. The text was in Afrikaans (a small language from South Africa, where I'm from), and the source data is already a mix of Afrikaans and English. The summary needed to be in English. (I'm playing and learning, nothing major)

I passed all the text in chunks to each LLM and then asked Claude 3 Haiku to rank the summaries.

The results:

  1. Storm-7B Q5 KM (89.38% correct)
  2. eramax/salesforce-iterative-llama3-8b-dpo-r Q5 KM (82.06% correct)
  3. r3m8/llama3-simpo:8b-instruct Q5 KM (79.68% correct)
  4. starling-lm:7b-beta Q5 KM (79.48% correct)
  5. llama3_8b-instruct Q5 KM (77.21% correct)
  6. Llama-3-Smaug-8B Q5 KM (76.81% correct)
  7. solar_10.7b-instruct Q5 KM (75.92% correct)
  8. llama-3-instruct-8b-simpo-expo_8B Q5 KM (74.54% correct)
  9. phi3_14b-medium-4k-instruct Q5 KM (73.31% correct)
  10. qwen_14b-chat-v1.5 Q5 KM (73.16% correct)

With all these summaries, I also asked in the same prompt, after the rating, to create Its summary. I then passed it all again to Claude 3 Sonnet (Not 3.5)

  1. Claude 3 Haiku (87.4% correct)
  2. Storm-7B Q5 KM (73.81% correct)
  3. solar_10.7b-instruct Q5 KM (73.23% correct)
  4. r3m8/llama3-simpo:8b-instruct Q5 KM (72.97% correct)
  5. llama-3-instruct-8b-simpo-expo_8B Q5 KM (72.71% correct)
  6. eramax/salesforce-iterative-llama3-8b-dpo-r Q5 KM (72.5% correct)
  7. Llama-3-Smaug-8B Q5 KM (70.82% correct)
  8. llama3_8b-instruct Q5 KM (68.21% correct)
  9. starling-lm:7b-beta Q5 KM (63.41% correct)
  10. phi3_14b-medium-4k-instruct Q5 KM (62.73% correct)

Now, there is no way that most of these models have been trained in Afrikaans directly, maybe some Dutch, but Afrikaans has its own writing rules. Meaning there is some emergent skill there.

I think your model is onto something.

Disclaimer: I'm not a data science or LLM expert; I'm merely very interested in learning (mostly playing).

Wow, your experiments on the summary task are very impressive. I think these rankings are consistent with the AlpacaEval 2.0 Leaderboard, where our Storm-7B is the best open-sourced model. The "summary" task is a QA task, and Claude 3 reflects human preference for responses. However, MMLU-Pro and Euro-LLM-Leaderboard are multiple-choice questions, using logits or answers to match correctness. These are traditional NLP tasks with ground-truth answers. Thus, these benchmarks (tasks with correct answers) are not consistent with your summary task (open-ended and creative tasks).

Sign up or log in to comment