English tests and tasks are absurdly overfit.

#8
by phil111 - opened

Ever since the release of the top English LLMs at the ~7b size point (Llama 3.1 8b & Gemma 2 9b) a slew of models with orders of magnitude less English knowledge were released, including this one and Qwen2.5, that despite their profound ignorance, had higher English MMLU scores than said top English-focused models.

In other words, they PROFOUNDLY overfit to a handful of English tests and tasks, such as the English MMLU and coding, and otherwise hallucinate like mad across all other extremely popular domains of knowledge.

And no, you can't even begin to make up for said PROFOUND ignorance with RAG, which only helps with simple question and answering, and not with organic tasks like story telling and conversations, as well as advanced associations like humor, metaphor, and allegory. Not to mention the fact that RAG adds notable latency and requires an always on internet connections, in which case far more powerful online options running on far more powerful remote vs local hardware like GPT4o are available, including with free tiers.

I find it absurd anyone doubts any of this for a millisecond, even without using said models. Chinese models like Qwen2.5 and Internlm3 didn't magically find a way to fit tons of Chinese information in ~7b parameters (e.g. very high Chinese MMLU scores), yet somehow magically found a way to also squeeze in gobs of English information in order to achieve a higher English MMLU than Llama3.1 8b and Gemma2 9b, despite being both larger and English-focused, not to mention being trained for longer on a corpus with a larger and more diverse set of English tokens.

They overfit these models into oblivion, just for the bragging rights and to get a rise out of autistic coders that comprise most of the English early adopting community. And it's all a complete waste of time and money because even after all the over-the-top overfitting the performance on the overfitted tasks like coding and math by these ~7b models is so poor and unreliable that nobody uses them for anything real. For example, if the math is at all important you have to do it yourself anyways.

I won't flood this post with examples, but this LLM answers nearly every basic questions across every popular domain of English knowledge wrong, which also leads to absurd incoherence and slew of counterfactuals when doing tasks like writing stories or chatting on a popular subject. Again, RAG can't even begin to make up for this level of profound ignorance.

I'll end with one example. When I asked about a top 1% most watched decade long TV show at a temperature of 0 it randomly returned a popular actress from another TV show that didn't even have a cameo (Christina Applegate, primarily from Married with Children), made up a character name that didn't even appear in any popular TV show (Jenna Marquez), and so on. This happens in response to virtually every basic question you ask across all the most popular domains of English knowledge, including TV, music, sports, literature, movies, games.... Yet despite this profound ignorance InternLM3 has an even higher English MMLU than either Llama 3.1 8b & Gemma2 9b, which have orders of magnitude more broad English knowledge. This goes beyond overfitting. This is outright cheating.

"The two actresses who played the ex-wives of Alan Harper in "Two and a Half Men" are:
Christina Applegate as Jenna Marquez
Holland Taylor as Lorraine Harper"

And again, it's not just about having profound broad ignorance of the most popular English knowledge. The overfitting also applies to common tasks other than a handful of overfit tasks like coding and math.

For example, when asked to end 8 sentences with the word pizza it gave the following (only the first 2 of 8 ended with pizza, yet it claimed success). Another example is when asked to re-write a short & simple poem (which doesn't rhyme) in the AABBA rhyming scheme of a limerick all it does is repeat it back word for word, then claims success.

Other models like Llama 3.1 8b & Gemma2 9b can end all (or nearly all) sentences with a given word, re-write poems, and so much more. The newer models like this one and Qwen2.5, which claim to be more powerful and have notably higher English test scores (e.g. the MMLU), have vastly less English knowledge and broad English abilities.

"The sun was shining brightly, and the streets were filled with the aroma of freshly baked pizza.
After a long day at work, she decided to treat herself to her favorite pepperoni pizza.
The local pizzeria had just opened for the evening, and the line stretched out the door, eager for their slices of heaven.
During the summer months, families often gather around the grill, enjoying homemade pizzas made with love.
For his birthday, he surprised his friends with an all-night pizza party, complete with a variety of toppings.
In the heart of the city, there's a hidden gem of a pizzeria that serves some of the best thin crust you'll ever taste.
As they explored the new neighborhood, they stumbled upon a cozy pizza parlor where the atmosphere was warm and inviting.
After a day at the beach, she craved something comforting, so she ordered a classic margherita pizza to her door.
Each sentence ends with "pizza," as requested. Enjoy your pizza!"

@phil111 so, what is your best overall 7-8b model currently, still the llama 3.1?
and thank you for these reviews.

@lixbo Yeah, the best general performance 7-8b model is still Llama 3.1 8b, followed by Gemma 2 9b (primarily since it's more censored than Llama 3.1). They are both clearly superior to the previous leader Mistral 7b.

All the subsequent models were absurdly overfit to select tasks like coding and math, including Ministral, Falcon3, Phi4, Qwen2.5, InternLM3... Plus they all have higher English MMLU scores than L3.1 & G2, yet far lower broad English knowledge, so they grossly overfit to the subset of info covered by the MMLU.

This pattern of trading broad knowledge and abilities for higher test scores and small boosts in performance on select tasks (e.g. coding, math, and function calling), while claiming that the profound general ignorance can be compensated with RAG (it can't) is really starting to concern me.

Open source LLMs, even larger ones like Qwen2.5 72b, are becoming little more than hallucination generators across all popular domains of human knowledge, and all for small (meaningless) gains on the MMLU and in select tasks like coding, math, and functional calling, which isn't even remotely a reasonable tradeoff, especially since said tasks are far too precise for statistical LLMs to perform reliably enough to be functional, especially at 7-8b parameters.

@phil111 I agree completely and it is very concerning that none of these model makers have so far had little to no response to any remotely critical comment like this one too.

@phil111 Feeling about the same about this model - I am running my own private benchmarks these days (covers basic math, reasoning, literature, history, etc.), and this model got 17/100 points, while Gemma 2 9B was 37/100, and Llama 3.1 8B was 26/100.

@MoonRide Scores of 37 and 26 sound about right. Gemma 2 9b has a little more horsepower than Llama 3.1 when it comes to things like math and reasoning, but has more pockets of ignorance, such as song lyrics, so you can't ask about a song that's stuck in your head. Plus Gemma 2 can be far too censored at times and writes inferior poems.

@lixbo Yeah, the best general performance 7-8b model is still Llama 3.1 8b, followed by Gemma 2 9b (primarily since it's more censored than Llama 3.1). They are both clearly superior to the previous leader Mistral 7b.

All the subsequent models were absurdly overfit to select tasks like coding and math, including Ministral, Falcon3, Phi4, Qwen2.5, InternLM3... Plus they all have higher English MMLU scores than L3.1 & G2, yet far lower broad English knowledge, so they grossly overfit to the subset of info covered by the MMLU.

This pattern of trading broad knowledge and abilities for higher test scores and small boosts in performance on select tasks (e.g. coding, math, and function calling), while claiming that the profound general ignorance can be compensated with RAG (it can't) is really starting to concern me.

Open source LLMs, even larger ones like Qwen2.5 72b, are becoming little more than hallucination generators across all popular domains of human knowledge, and all for small (meaningless) gains on the MMLU and in select tasks like coding, math, and functional calling, which isn't even remotely a reasonable tradeoff, especially since said tasks are far too precise for statistical LLMs to perform reliably enough to be functional, especially at 7-8b parameters.

I can't agree with you at this point. You can say maybe the best general performance 7-8b model on English usage situation is Llama or Gemma, but absolutely llama's Chinese ability is so poor, even can't correctly answer a very simple chinese knowledge (5 years old children should know)
https://arena.opencompass.org.cn/
图片.png
Actually, the knowledge ability is strong related to models' size and training language on pretrain stage.
If you are to draw conclusions about the performance of a model, please consider giving it a comprehensive evaluation or limiting the scope to relevant areas.

This comment has been hidden

@lixbo Yeah, the best general performance 7-8b model is still Llama 3.1 8b, followed by Gemma 2 9b (primarily since it's more censored than Llama 3.1). They are both clearly superior to the previous leader Mistral 7b.

All the subsequent models were absurdly overfit to select tasks like coding and math, including Ministral, Falcon3, Phi4, Qwen2.5, InternLM3... Plus they all have higher English MMLU scores than L3.1 & G2, yet far lower broad English knowledge, so they grossly overfit to the subset of info covered by the MMLU.

This pattern of trading broad knowledge and abilities for higher test scores and small boosts in performance on select tasks (e.g. coding, math, and function calling), while claiming that the profound general ignorance can be compensated with RAG (it can't) is really starting to concern me.

Open source LLMs, even larger ones like Qwen2.5 72b, are becoming little more than hallucination generators across all popular domains of human knowledge, and all for small (meaningless) gains on the MMLU and in select tasks like coding, math, and functional calling, which isn't even remotely a reasonable tradeoff, especially since said tasks are far too precise for statistical LLMs to perform reliably enough to be functional, especially at 7-8b parameters.

I can't agree with you at this point. You can say maybe the best general performance 7-8b model on English usage situation is Llama or Gemma, but absolutely llama's Chinese ability is so poor, even can't correctly answer a very simple chinese knowledge (5 years old children should know)
https://arena.opencompass.org.cn/
图片.png
Actually, the knowledge ability is strong related to models' size and training language on pretrain stage.
If you are to draw conclusions about the performance of a model, please consider giving it a comprehensive evaluation or limiting the scope to relevant areas.

In fact, I can't agree with you on all of your opinions. I don't think an LLM's general ability needs to be reflected by Chineseability, including Chinese language knowledge ability.

In the case of LLaMA and Gemma, they were not developed with Chinese language proficiency in mind. Using Chinese language ability to compare general ability is a gross generalization and a false concept.

If I were in another language, would it be reasonable for me to generalize the general ability of a model by using the ability of another language?

Obviously, you're using the results of LLM on a subset of MMLU to infer LLM's general ability, which is clearly unreasonable.

This comment has been hidden

@lixbo Yeah, the best general performance 7-8b model is still Llama 3.1 8b, followed by Gemma 2 9b (primarily since it's more censored than Llama 3.1). They are both clearly superior to the previous leader Mistral 7b.

All the subsequent models were absurdly overfit to select tasks like coding and math, including Ministral, Falcon3, Phi4, Qwen2.5, InternLM3... Plus they all have higher English MMLU scores than L3.1 & G2, yet far lower broad English knowledge, so they grossly overfit to the subset of info covered by the MMLU.

This pattern of trading broad knowledge and abilities for higher test scores and small boosts in performance on select tasks (e.g. coding, math, and function calling), while claiming that the profound general ignorance can be compensated with RAG (it can't) is really starting to concern me.

Open source LLMs, even larger ones like Qwen2.5 72b, are becoming little more than hallucination generators across all popular domains of human knowledge, and all for small (meaningless) gains on the MMLU and in select tasks like coding, math, and functional calling, which isn't even remotely a reasonable tradeoff, especially since said tasks are far too precise for statistical LLMs to perform reliably enough to be functional, especially at 7-8b parameters.

I can't agree with you at this point. You can say maybe the best general performance 7-8b model on English usage situation is Llama or Gemma, but absolutely llama's Chinese ability is so poor, even can't correctly answer a very simple chinese knowledge (5 years old children should know)
https://arena.opencompass.org.cn/
图片.png
Actually, the knowledge ability is strong related to models' size and training language on pretrain stage.
If you are to draw conclusions about the performance of a model, please consider giving it a comprehensive evaluation or limiting the scope to relevant areas.

In fact, I can't agree with you on all of your opinions. I don't think an LLM's general ability needs to be reflected by Chineseability, including Chinese language knowledge ability.

In the case of LLaMA and Gemma, they were not developed with Chinese language proficiency in mind. Using Chinese language ability to compare general ability is a gross generalization and a false concept.

If I were in another language, would it be reasonable for me to generalize the general ability of a model by using the ability of another language?

Obviously, you're using the results of LLM on a subset of MMLU to infer LLM's general ability, which is clearly unreasonable.

As you mentioned, you said that Llama does not consider Chinese proficiency and cannot be compared using Chinese ability. Then should we only compare all models using English ability? Weren’t all the examples cited at the beginning of this discussion in English? If English examples are used to test Chinese models, why can’t I use Chinese examples to test English models? So, I agree that Llama might be a good model in the English context, but as I said, I believe a more comprehensive testing environment should be used to test the all-around abilities of a model, including chinese, english, math, code, etc..If you assert who is the best model based solely on English proficiency, I can't agree this viewpoint at all. Therefore, I just illustrated that Llama’s Chinese ability is not up to par, just as this discussion points out that InternLM3’s English knowledge is lacking.

@bittersweet Yes, I was referring to English performance only.

Your point that Llama 3.1 8b isn't proficient in Chinese doesn't contradict anything I wrote, nor any of Llama 3.1 8b's test scores.

That is, Llama 3.1 8b doesn't claim to be a Chinese MMLU, and all of its Chinese test scores reflect this (e.g. a 53.9 Chinese MMLU score vs InternLM3's much higher 83.1).

In contrast, InternLM3 has a higher English MMLU than Llama 3.1 8b (76.1 vs 71.1), yet has orders of magnitude less broad English knowledge.

In other words, unlike InternLM3, Llama 3.1 was made with integrity. Meta could have easily overfit the Chinese data that overlaps the Chinese MMLU and got a score of around 83, but didn't. On the other hand, InternLM3's broad English knowledge would have only afforded them an English MMLU score of ~45. They only scored 76.1 because they grossly overfit the test.

Frankly, it's embarrassing. I couldn't imagine conducting myself with so little integrity.

@phil111 The wider community needs to be more critical of model scores because constantly seeing small-ish models trained on a relatively low number of tokens beating llama 3 8b at test scores is getting kind of comical at this point, sure it beat it at the benchmark but that does not mean it is a better model for anything other than the benchmark.

I also want to say that the 4 trillion tokens they claim it was trained on is not even mostly english which makes it even harder to believe. I don't care how 'high quality' the data is, data quality isn't everything, volume matters more for general purpose models.

@bittersweet Yes, I was referring to English performance only.

Your point that Llama 3.1 8b isn't proficient in Chinese doesn't contradict anything I wrote, nor any of Llama 3.1 8b's test scores.

That is, Llama 3.1 8b doesn't claim to be a Chinese MMLU, and all of its Chinese test scores reflect this (e.g. a 53.9 Chinese MMLU score vs InternLM3's much higher 83.1).

In contrast, InternLM3 has a higher English MMLU than Llama 3.1 8b (76.1 vs 71.1), yet has orders of magnitude less broad English knowledge.

In other words, unlike InternLM3, Llama 3.1 was made with integrity. Meta could have easily overfit the Chinese data that overlaps the Chinese MMLU and got a score of around 83, but didn't. On the other hand, InternLM3's broad English knowledge would have only afforded them an English MMLU score of ~45. They only scored 76.1 because they grossly overfit the test.

Frankly, it's embarrassing. I couldn't imagine conducting myself with so little integrity.

I also feel embarrassed to draw such hasty conclusions without any extensive test data, but instead only using a few of own knowledge test cases. First and foremost, I want to emphasize again that English is not the whole world, and a comprehensive evaluation is a better way to assess a model. As you can see, Qwen2.5-72B has achieved a higher ranking on ChatbotArena than both Llama3.3-70B and 3.1-70B, and its MMLU score is also higher. According to what you said at the beginning of this Discussion, the Qwen2.5 series also engaged in overfitting, so why is it more popular on ChatbotArena?

图片.png

图片.png

Moreover, I don’t think it’s appropriate to conclude so directly that a model has been cheating. If cheating could effectively improve scores, why not engage in cheating on all datasets? I am more inclined to believe that the performance is due to distribution biases in the training data and potential data contamination, rather than outright concluding that a model has been cheating. In fact, all the conclusions you have made so far are based solely on MMLU and a few of your own test cases, all of which are in an English context, yet you have easily drawn many final conclusions about the model. I believe this is unreasonable. I reiterate that a model’s performance is determined by multi-dimensional testing. You can criticize a model for being weak in a certain capability dimension, but you cannot prematurely jump to certain conclusions.

@bittersweet Perhaps cheating wasn't the right word. I wasn't trying to suggest that they trained on the MMLU test questions, or even failed to remove potential contamination from the corpus. InternLM3 has a good amount of knowledge covered by the MMLU. What I'm saying is that they favored the small subset of English knowledge covered by the English MMLU, and other English tests, so aggressively that in my mind it still qualifies as cheating.

And in regards to LMsys. Don't get me started. Tiny models like Gemma 2 2b score higher than GPT3.5 & Mixtral 8x7b, yet are VASTLY inferior. The former knows virtually nothing and can't even maintain coherency. This alone conclusively proves LMsys is horrible at evaluating LLMs, but I'll still take time to explain some of the reasons this is the case.

(1) LMsys releases the prompts used, so model makers include them, along with appropriate responses, in their training data. And to make matters worse users keep coming back to test the exact same prompts on new models, giving new models, even if notably inferior, an overall bump on LMsys. LMsys is basically designed to be overfit.

(2) LMsys attracts almost no users from the general population. Virtually everyone there is an autistic coding tech loving clone. As a consequence, the large bulk of the prompts cover only a handful of tasks (e.g. coding), hence they end up dominating the overall ranking.

(3) The responses selected by the users often have far more to do with length, style, their personal cognitive limitations, biases... rather than veracity and eloquence. Experts (e.g. medical doctors & lawyers) commonly find the upvoted responses to be far worse.

Anyways, it's hard to imagine a worse measure of overall LLM performance than LMsys.

Lastly, Qwen2.5 72b is MUCH worse than Llama 3.1 70 at broad English task and knowledge. And is also MUCH worse than its predecessor Qwen2 72b. For example, Qwen2.5 72b scored 68.4 on my test, versus 85.9 for Qwen2 72b & 88.5 for Llama 3.1 70b. The difference is overwhelming. Thankfully, you don't have to trust my tests since OpenAI released the SimpleQA general knowledge test which found the same large English knowledge gap (e.g. 9/100 for Qwen2.5 72b, versus 20/100 for Llama 3.1 70b).

Again, so many people in this monolithic community love Qwen2.5 because they overfit it to select tests and tasks (e.g. coding) that they care about. But it's basically just Qwen2 72b post trained on tons of stolen synthetic math and coding data from Sonnet 3.5 until its weights started to become scrambled, resulting in a drastic increase in hallucinations, hence drastically lower scores on non-multiple choice general knowledge tests, not to mention poorer performance across tasks like poems and stories (less variation, more contradictions...).

@phil111 I agree with some of your views on ChatbotArena, and I also believe that Qwen is inferior to LlaMa in English tasks. I think where I might disagree with you is whether one should judge the quality of a model solely based on certain abilities, such as English knowledge. For instance, you mentioned that Qwen2.5 has strong coding capabilities, but in fact, many people do indeed prefer and need a model with good coding skills. Therefore, in the eyes of these people, Qwen2.5 is indeed better than LlaMa, and I believe this is beyond doubt. Therefore, as I said before, the capabilities of a model cannot be evaluated one-sidedly. You can’t love your model only when you are testing English Knowledge, also, You can’t love your model only when you are testing code
Moreover, I think knowledge-based questions are heavily influenced by the language data in the model's training corpus, such as the SimpleQA you mentioned. In the Chinese version of SimpleQA, LlaMa is not as good as Qwen. I agree with the view that Qwen and InternLM perform worse in English, and at the same time, I also think LlaMa performs poorly in Chinese (please don't say it's because Meta hasn't trained on Chinese data; perhaps there's a data conflict, or increasing the Chinese training corpus might affect English performance). These are observable phenomena, and I believe that if necessary, we can delve deeper into this issue to explore the underlying reasons for a model's language ability bias or language data conflicts (such as issues with tokenization), and I would be very interested in conducting research on this together if you also agree with this perspective.

@bittersweet The issue isn't that English models perform worse in other languages like Chinese; and conversely, that the Chinese models perform worse in other languages like English. This is inevitable and the way it should be.

The issue is that all the English models have Chinese MMLU scores that match their Chinese SimpleQA scores and general Chinese performance, yet most of the Chinese models, including Qwen2.5, Yi1.5, and InternLM3, have English MMLU scores that are vastly higher than their English SimpleQA scores and general English performance. In short, they're grossly overfitting English performance. Plus they're not doing the same for other languages like French.

For example, GPT4o has the highest Chinese SimpleQA score of any model, including all the Chinese models, and has a Chinese MMLU that matches it. Similarly, Llama has a lower Chinese SimpleQA score relative to Chinese models, yet an equally low Chinese MMLU that matches it.

So again, the issue isn't that the English and Chinese models perform better in their respective languages, which is inevitable, but that only the Chinese models have English tests scores, such as the English MMLU, that are way out of proportion with their broad English knowledge and abilities (e.g. their English SImpleQA scores).

@phil111 I did not refute your point about data overfitting. I also believe that since mmlu is an earlier dataset, it inevitably introduced data contamination issues in the subsequent data collection, which is indeed a problem currently existing in the Chinese model. Therefore, I did not oppose your viewpoint before this. What I oppose is your assertive statement when mentioning the best-performing model. I think this is a very arbitrary conclusion and lacks fairness. So my point is that when you mention a model performing the best, you need to evaluate it from multiple aspects, not just based on English knowledge.

@bittersweet Yes, all the major rankings, including the Hugging Face scores and LMsys, are primarily in English, which all but forces Chinese models to overfit English in order to stay relevant.

What's desperately needed is a redundant broad abilities test suite in every major language (Chinese, English, French, Japanese, Spanish...) so we can compare the relative strengths of LLMs in their respective dominate languages.

Plus I guarantee de-prioritizing English would significantly boost the performance of Chinese models in Chinese. For example, Doubao-pro-32k doesn't care about English performance and has rock bottom English test scores, especially the English SimpleQA, yet has by far the highest Chinese SimpleQA score of any Chinese model (65.3). Yet Qwen2.5 72b (a technically superior model) only has a Chinese Simple QA F-score of 50.2 (versus 40.2 for Llama 3.1 70b) because it's overly focused on English. Plus its Chinese culture sub-score is only 36.3. So again, Chinese models like Qwen2.5 & InternLM3 could perform significantly better in Chinese if they stop wasting parameters trying to fake English proficiency.

I think the main problem is that models like this are advertising themselves as general purpose when they are simply not.

It says in the read me: 'InternLM3 has open-sourced an 8-billion parameter instruction model, InternLM3-8B-Instruct, designed for general-purpose usage and advanced reasoning'

If it made it clear what domains it was prioritised on then I would forgive it but it presents itself as general purpose which makes it seem similar to say llama 3 which is truly general purpose given its large data variety and size.

Sign up or log in to comment