Spaces:
Running
on
CPU Upgrade
💬 Discussion thread: Model scores and model performances 💬
Hi!
From time to time, people open discussions to discuss their favorite models scores, the evolution of different model families through time, etc.
We love seeing these threads, and very cool insights often emerge from them, but they are not actionable for us (as they are discussions, not issues).
As such, it would be simpler for the OpenLLMLeaderboard team to have all these discussions centralized in one place.
This is what this discussion thread will be for. It will be kept pinned and open to allow users to debate and discuss.
TLDR:
- If you want to discuss models and scores with the people who frequent the leaderboard, do so here.
- If it's about something we can fix, open a new issue.
Thank you so much for reading :)
Ok, I will put it here:
On the leaderboard, the difference between average scores of OPT-66b and a good fine tuned 3-b model (acrastt/Marx-3B-V2) is quite small.
That seems odd to me. OPT is older, has seen less and possibly lower quality text data than this openllama-3b model fine tune.
But it is still weird. the model is 22 times larger. What a waste. Imagine you'd get llama2-70b performance out of a model with 1.5 trillion parameters.
That's not good. It's disproportional.
Its just not lower quality data, its a old design. That will obviously happen and newer models will perform better.
I'd guess the data is the main reason, or something else is not correct. There are for example nvidia made gpt-2 versions which perform surprisingly well, just by being trained on more tokens.
That's not a change to the architecture. You could use OPT architecture and train a competitive high performance model with the right data and enough compute.
While the 3b model has seen more tokens than opt-66b, the difference is not so large that one ought to expect a 66b model to look bad next to it.
This is weird.
If it is indeed due to OPTs lower pretraining data quality, they must have put a lot of low quality data in it.
That can happen. Meta also made a megatron gpt2 with 11b parameters. Total trash. Sometimes they don't get it right.
But it is really not good. OPT-175b was supposed to have roughly gpt-3 (old version) performance. I still have my doubts it is just a worse dataset.
Though check out the opt-66b MMLU score. It's really bad. That must have been some dataset.
Also, opt was only trained on 180b tokens while openllama 3b was trained on 1trillion tokens. The amount of tokens pretrained on also should have a massive difference
Would have thought they at least replicated the size of GPT-3 pretraining data. Apparently not the case. A scientific artefact. Good it brought us to llama.
I freed some space on my HD as result of this.
Even gpt3 was just trained on a few hundred billion tokens(around 500ish I think). But of course it’s 175b parameters
You don't really need 1 trillion tokens for a useful llm, but it becomes easier if you put more of 'properly good quality data' in there. Put fewer low quality data in and you get a suboptimal result. GPT3 data properly was better curated, too.
The difference for 400 billion tokens from well curated data and 180b tokens of 'let's see what happens if you put raw common crawl in there' is not just in the dataset size and llm parameters.
Data quality is very important.
Results from the data from the leaderboard drove me to dive deeper into MMLU’s moral scenarios task. I found out that MMLU’s Moral Scenarios Benchmark isn't a useful measure of moral judgement. Given that the leaderboard reports overall accuracy, I don’t think it is something that needs to be addressed as an issue. Just wanted to share my findings for anyone curious https://medium.com/@coreymorrisdata/is-it-really-about-morality-74fd6e512521
The mustache and lawn question sounds an absurd question in judging morality. Because they are absurd human behaviors. It's like asking : does this absurd behavior moral or unmoral? I don't know if they moral or unmoral, but they are surly absurd! They are stupid questions!
I was expecting MMLU involves testing for more specific and real-world moral scenario, like judication on excessive self-defence, hidden racial/sexual discrimination, conflict of interest, commonly shared humanities and so on.
The mustache and lawn question sounds an absurd question in judging morality. Because they are absurd human behaviors. It's like asking : does this absurd behavior moral or unmoral? I don't know if they moral or unmoral, but they are surly absurd! They are stupid questions!
I was expecting MMLU involves testing for more specific and real-world moral scenario, like judication on excessive self-defence, hidden racial/sexual discrimination, conflict of interest, commonly shared humanities and so on.
The scenarios are of a pretty wide variety and yeah there are a number of them that are a bit odd. I would say that most that I looked at are relatively clear, but it wasn't something I looked at in depth. If I had to guess I would say that at least 90% or so are either clearly morally wrong or clearly not morally wrong.
does anyone know why BLOOM isn't on here?
The 176B was a bit above what we could evaluate until recently (multi-node evaluation is tricky) - however, all smaller sizes should be on the leaderboard :)
Since we found a nice way to do this for the Falcon release (kudos to
@thomwolf
), I'll probably add it soon.
Perhaps a gptq model is an option now, too:
TheBloke/BLOOMChat-176B-v1-GPTQ
Though more generally, I would love to see 2-bit integrated in huggingface. Folks at @GreenBitAI seem to have cracked 2-bit quantization with only a small performance loss.
Why is there no T5 in the leaderboard?
Re the T5 model, it's because it's a AutoModelForSeq2SeqLM
model, not a AutoModelForCausalLM
, and we only support the latter at the moment.
(And since we got almost no Seq2SeqLM submitted to the leaderboard during its history, updating the code base to fit this other architecture is not a priority).
@dadavbaser45 Hello, please try to avoid spamming links without context in a specific conversation. Thanks!
Hey I was wondering, maybe it could be really useful and informative to test the models Needle In a Haystack - Context Recall :
Source : https://github.com/gkamradt/LLMTest_NeedleInAHaystack
As you can see, between GPT4 and Claude2.1, the difference is huge for context recall. The process may not be perfect but it does give a good insight on the performance of a model.
A lot of people, me included, tend to use LLM with huge context window to parse a lot of messy data and order or interpret it. I don't really know if the context recall is affected by fine-tuning or changes brought to a base model (for example, if any model based on Llama 2 have the same context recall efficiency, it might be useless to test all of them).
Thats my first contribution, feel free to tell me if I did something wrong ! Thanks
@zakichan
Normal llama 2 models have a measly 4096 context compared to gpt4 128k and claude 200k. However mistral models have 8k context
You can easily finetune any mistral or llama 2 model with yarn and get 128k context with yarn. There are only a few models with yarn right now but you could do it any if you really wanted.
However there are 2 models named yi 34b 200k and yi 6b 200k which as the name says, has a massive 200k context which is same as claude. They are very high on the open llm leaderboard as well.
@zakichan Also checkout Qwen 32K models, they managed to achieve impressive needle in the haystack score https://github.com/QwenLM/Qwen#long-context-understanding
Why are there so many 7B models in top 20? What is going on? I downloaded https://huggingface.co/v1olet/v1olet_marcoroni-go-bruins-merge-7B which is currently #7 on the leaderboard, just to see how it performs. That thing is awful. It fails to follow the most basic instructions. Maybe for 7b it's okay, but it is in no way beating 70b models. I think the whole leaderboard is suffering from the phenomenon described in this paper https://arxiv.org/abs/2309.08632 . The leaderboard is becoming a cheater competition instead of being a legitimate performance comparison overview. Please do something about it.
Agree I feel even without cheating, I feel models start to become overoptimized for the scores, diminishing their meaningfulness. It's possible to have a model that scores good after DPO or other emerging optimization procedures, but which fails at multiturn conversations.
Hi!
These are legitimate concerns, and why we 1) try to add new evals regularly, though it's insanely costly to do so, and 2) are also looking at adding contamination detection mechanisms.
To see if a model is going to be a "good" chat model, I think the chatbot arena is likely more relevant - the leaderboard contains both chat-FT/IFT/RLHF models with normal pretrained models, and not all of these are good in chat format.
Imo the use cases of the Open LLM Leaderboard is 1) to compare the results of OSS releases in a comparable framework (scores for pretrained models are very relevant) 2) to provide an estimate of specific capabilities for models and 3) to allow the community to test their own models when they don't have the compute to do so.
It's challenging! Great we have this.
I was thinking, perhaps you could have two versions of the openLLM leaderboard:
- the view that comes from all submissions
- a curated leaderboard, where trustworthy people have looked at the submissions in some way, before they are shown
Otherwise I think the credibility of the ranking might suffer too much.
There are already people doing some more qualitative testing of models.
The purpose I guess is to signal to people what models are worth a try?
If they are doubtful and without trust, then they will look for other ways to spot new models.
https://www.reddit.com/r/LocalLLaMA/comments/17fhp9k/huge_llm_comparisontest_39_models_tested_7b70b/
If users (such as the one above) want to setup their own leaderboards with their detailed analyses, I'll be glad to give them a hand - thanks for the link, it was an interesting read btw
Regarding on the above discussion, thanks for the leaderboard - I use it to eval mai models and to get an understanding if they perform better than the model I used as base. :)
On a different note, there are these 2 pairs of models that have the exact same eval scores - are they duplicates with different names or do they perform differently? Anyone tried both models in a pair on custom prompts?
Hi
@Mihaiii
,
Thanks for the ping about this! You can check the details for each model pair (or even just the results - to which decimal are the results equivalent?), and if they are identical, it's very likely that there's been a copy.
I was more thinking about a community-driven effort, yet I think it would be of interest of HF, too, to get that started. There are some names that come to mind. Everyone free to open their own, but I think some bridging function by HF would be a good idea.
A crude filter, which is not really measuring the right thing, would be a filter of a minimum of likes or downloads.
In the end it is just a filter, with more or less human involvement.
A more involved filter based on qualitative methods would require coordination, I feel HF is in the position to do that. It's the leaderboard of HF and most eyeballs go to yours, even if one would clone it.
It's also desirable to maintain evaluatory authority over the models on your hub, if HF does the thing themselves.
It is now centrally administered, also due to the compute required. Maybe there could also be a regime of validated local runs, but that might be further away. Maybe launch a HF coin for that, validating local evals :)
the demand for evals might be growing exponentially, perhaps that would be in HF own interest
An alternative idea would be to simply add a thumbs up & down system (with no one in control to make final judgement) for each model in the leaderboard and to allow custom filters based on that.
You'd avoid a lot of drama this way and - indeed - it would be useful for people that want to find good models.
LE: the amount of likes a model receives is not a good indicator. I use the like button for bookmarking models I want to use in the future. Also you can't differenciate between "I didn't like this model because I haven't tried it" and "I didn't liked this model because I tried it and it much worse than the leaderboard would imply".
the like button on the model page just indicates someone found it interesting. as suggested, it is not signalling model quality, but even that would be a useful filter already. it's already build in, only needs a condition.
a dedicated like button for the leaderboard would be a useful addition.
but if one adds that, one can also instead ask for a non anonymous quality judgement.
like a rating. making it a bit effortful might even be a good idea here.
Unrelated to the above:
I have a custom reasoning task I try different models on.
To me it looks like models that are trained using the MetaMath dataset have higher GSM8K score than should.
I noticed in another thread you used https://github.com/swj0419/detect-pretrain-code-contamination on a model and concluded it has a 99% chance of being contaminated on GSM8K.
Any chance you'd run it on at least one top model trained on MetaMath?
@Mihaiii This is an interesting question, do you want to run it and tell us what you get?
@KnutJaegersberg It's an interesting idea, but I don't think we have the bandwidth to run such a community effort atm
The likes are here to serve as the voting system you mentioned, and you can already display them in the table (though they are only updated every x, since fetching them dynamically would add too much load to the leaderboard) - however, I'll ask internally if it would be relevant to have a bookmark system which is different from the likes.
would be nice to have a filter for the likes in the table as a threshold filter, i.e. more than 10 likes or more than 10 likes gained over the past week. Personally I also use them primarily as bookmarks, but I see them as inspectable bookmarks.
but likes / popularity is not a great quality indicator of cause. they are properly correlated.
maybe curation with qualitative research needs to happen elsewhere. But that stuff could use a platform. it's now on reddit and discord servers, informally.
I feel HF would want to stage that.
"Vibes is a new LLM benchmark" some say. Why not collect the vibes?
@Mihaiii This is an interesting question, do you want to run it and tell us what you get?
@clefourrier I'd be happy to! :)
To make sure this can be replicated by other people and also in case I made mistakes in my code changes, I'll write all the steps I took in detail.
Step 1:
Use the code from my custom branch ( PR here ) since current master code doesn't have an option for GSM8K:
git clone https://github.com/Mihaiii/detect-pretrain-code-contamination.git
git checkout feature/add-gsm8k
Step 2:
Follow the example from the current readme :
DATASET='gsm8k'
python src/run.py --target_model meta-math/MetaMath-Mistral-7B --ref_model mistralai/Mistral-7B-v0.1 --data $DATASET --output_dir out/$DATASET --ratio_gen 0.4
Result: result < 0.1, %: 0.96
The result can be interpreted using a note from the current readme: "If #the result < 0.1# with a percentage greater than 0.85, it is highly likely that the dataset has been trained.".
Since meta-math/MetaMath-Mistral-7B
is the initial model that was trained on MetaMath dataset, it's fair to conclude that, according to the code from detect-pretrain-code-contamination
and the interpretation from the readme, MetaMath dataset is contaminated and therefore all the models that were trained on it are too.
Later edit: I also ran the same test on a model I didn't suspect on being trained on a contaminated dataset and it got a 0.8 score, which is below the threshold (0.85), but still high. It could be possible that for GSM8K the threshold to be higher than usual for some reason.
DATASET='gsm8k'
python src/run.py --target_model migtissera/Tess-XS-v1.2 --ref_model mistralai/Mistral-7B-v0.1 --data $DATASET --output_dir out/$DATASET --ratio_gen 0.4
Result: result < 0.1, %: 0.8
Thanks a lot for this detailed analysis
@Mihaiii
, this is great !
I digged a bit in MetaMath, and according to their paper, one of their methods to bootstrap a new Math dataset was rephrasing the GSM8K and MATH prompts.
However, a cool analysis by the LMSYS collective showed that using rephrases to bootstrap is a form of contamination.
I would therefore assume that all models SFT/IFT on MetaMaths are contaminated for GSM8K.
Edit: regarding the threshold, we are discussing with the method's author, to better understand how it must be set - the current threshold is quite empirical, and @SaylorTwift is leading an analysis on using this contamination detection method that will discuss it in more detail.
Thanks a lot for this detailed analysis @Mihaiii , this is great !
I digged a bit in MetaMath, and according to their paper, one of their methods to bootstrap a new Math dataset was rephrasing the GSM8K and MATH prompts.
However, a cool analysis by the LMSYS collective showed that using rephrases to bootstrap is a form of contamination.I would therefore assume that all models SFT/IFT on MetaMaths are contaminated for GSM8K.
Edit: regarding the threshold, we are discussing with the method's author, to better understand how it must be set - the current threshold is quite empirical, and @SaylorTwift is leading an analysis on using this contamination detection method that will discuss it in more detail.
Presumably, they were supposed to use rephrases of the GSM8K train set, not the GSM8K test set
@euclaise That would make sense! However, the LMSYS paper above showed that there is contamination between the train and test sets of GSM8K (= some of the test set questions are already rephrases of the train set :/ ) - so I think it still likely it induced contamination
There is currently tons of models using MetaMath dataset, finetuning on metamath and also merging with metamath (like I do)
IMO all models like this should be flagged.
What is the plan in this @clefourrier
Thanks.
@Weyaxi we are going to verify, using contamination detection techniques, if this hypothesis is confirmed, and if yes it's likely we'll create a special flag for this.
@Mihaiii This is an interesting question, do you want to run it and tell us what you get?
@clefourrier I'd be happy to! :)
To make sure this can be replicated by other people and also in case I made mistakes in my code changes, I'll write all the steps I took in detail.
Step 1:
Use the code from my custom branch ( PR here ) since current master code doesn't have an option for GSM8K:
git clone https://github.com/Mihaiii/detect-pretrain-code-contamination.git git checkout feature/add-gsm8k
Step 2:
Follow the example from the current readme :
DATASET='gsm8k' python src/run.py --target_model meta-math/MetaMath-Mistral-7B --ref_model mistralai/Mistral-7B-v0.1 --data $DATASET --output_dir out/$DATASET --ratio_gen 0.4
Result: result < 0.1, %: 0.96
The result can be interpreted using a note from the current readme: "If #the result < 0.1# with a percentage greater than 0.85, it is highly likely that the dataset has been trained.".
Since
meta-math/MetaMath-Mistral-7B
is the initial model that was trained on MetaMath dataset, it's fair to conclude that, according to the code fromdetect-pretrain-code-contamination
and the interpretation from the readme, MetaMath dataset is contaminated and therefore all the models that were trained on it are too.Later edit: I also ran the same test on a model I didn't suspect on being trained on a contaminated dataset and it got a 0.8 score, which is below the threshold (0.85), but still high. It could be possible that for GSM8K the threshold to be higher than usual for some reason.
DATASET='gsm8k' python src/run.py --target_model migtissera/Tess-XS-v1.2 --ref_model mistralai/Mistral-7B-v0.1 --data $DATASET --output_dir out/$DATASET --ratio_gen 0.4
Result: result < 0.1, %: 0.8
Thanks for sharing these results!
One question I have is on what reference models to use for this particular test.
It seems the paper that the repo is associated with is about Min_K prob method for detecting contamination which is reference model free but the actual code in the repo needs a reference model.
If the model lineage is obvious (e.g. like the meta-math/MetaMath-Mistral-7B), it's straightforward to set the reference model.
However, for models with unclear lineage, setting the correct reference model seems to be pretty difficult, and from my personal testing, changing the reference model also changes the results quite a bit.
Furthermore, how to set the reference model for testing pretrained models is also not clear.
I have been personally testing the aforementioned contamination detection technique and would love to hear what others have to say about the points I made above!
One question I have is on what reference models to use for this particular test.
Furthermore, how to set the reference model for testing pretrained models is also not clear.
I might misinterpret what you're asking, but I set it as a command argument when calling run.py: --ref_model mistralai/Mistral-7B-v0.1
One question I have is on what reference models to use for this particular test.
Furthermore, how to set the reference model for testing pretrained models is also not clear.
I might misinterpret what you're asking, but I set it as a command argument when calling run.py:
--ref_model mistralai/Mistral-7B-v0.1
To clarify, I'm asking about testing other models than meta-math/MetaMath-Mistral-7B, perhaps ones with less clear lineage + testing pretrained models.
However, for models with unclear lineage, setting the correct reference model seems to be pretty difficult, and from my personal testing, changing the reference model also changes the results quite a bit.
On hindsight, it might also be a good idea to force model cards to contain information on the model lineage.
I think all models have to share what data they trained with, because for contaminated, for example, a data, its merges, finetunes need to be FLAG, so I think everyone should have to explain their data and technique. @clefourrier for all models , no matter who
Hi everyone!
I decided to create my own proprietary benchmark which measures the abilities of the models to follow commands and creative writing. Feel free to check it out and suggest models you want to get additionally tested.
Hi!
@ChuckMcSneed
, sounds cool! Would you want to make it a leaderboard? (If yes you can send me an email in DM on twitter so I add you to our slack)
@Q-bert
- I agree, it's a bit hard to select this in model cards atm, but we are working on improving this.
@killawhale2
Regarding model references, I don't think it's actually mandatory for the code base - but
@SaylorTwift
is going to publish a tool to make it usable more easily during the week
Since the rest of the discussion is about model lineages which are being discussed in other issues, I'll close this one if that works for you
@killawhale2 Regarding model references, I don't think it's actually mandatory for the code base - but @SaylorTwift is going to publish a tool to make it usable more easily during the week
I look forward to the tool!
I just wanted to point out that without setting a proper reference model, the contamination detection results may not be accurate.
For example, taking migtissera/Tess-XS-v1.2 which is thought to be not contaminated on GSM8K,
DATASET='gsm8k'
python src/run.py --target_model migtissera/Tess-XS-v1.2 --ref_model mistralai/Mistral-7B-v0.1 --data $DATASET --output_dir out/$DATASET --ratio_gen 0.4
will result in result < 0.1, %: 0.8 as
@Mihaiii
tested.
However,
DATASET='gsm8k'
python src/run.py --target_model migtissera/Tess-XS-v1.2 --data $DATASET --output_dir out/$DATASET --ratio_gen 0.4
results in result < 0.1, %: 0.98 by my local testing.
I suspect this is due to the reference model being too different from the model being tested. cc.
@clefourrier
@SaylorTwift
Thank you very much for this input! @SaylorTwift is working with the author of the tool who will be able to give us insights on this, so we'll ask them what they think given your comment :)
Edit: going to reopen the discussion till the EOY, since it was originally to discuss model performances in general, and maybe people will want to discuss more models 😅
I'll create a new one for Q1 of 2024.
@clefourrier Yeah make it a leaderboard or whatever you want. Data is provided under WTFPL 🙂. I don't use twitter because I don't like Elon.
I think it is worth determining if MetaMath is more contaminated than the train set of GSM8K is. The point of the train set is to be trained on - contamination beyond the potential existing contamination between the train and test sets seems more interesting to me.
Hey! everyone, I think claiming that the Metamath data is contaminated is an unsuitable allegation. All the MetaMathQA data is sourced from the GSM8K and MATH train sets. Our data augmentation process does not involve any data from the test set.
We have transparently disclosed the source of each data point, and you can verify it here: https://huggingface.co/datasets/meta-math/MetaMathQA
Additionally, we have disclosed the code for obtaining MetaMathQA data, which can be checked at: https://github.com/meta-math/MetaMath/tree/main/code_for_generating_data
The Metamath project is entirely transparent and open-source, encompassing code (for both data augmentation and model training), models (a range of models), and data (comprising all our data along with its sources). Anyone interested in contributing is welcome to join our HuggingFace. Again, Allegations of contamination in MetaMathQA data are detrimental to us (I personally feel quite disheartened). We have never utilized any test data, and all our data and models are transparently and openly available: https://huggingface.co/meta-math
Hi, My friends,
@Mihaiii
@clefourrier
@Weyaxi
@euclaise
@killawhale2
@Q-bert
@ChuckMcSneed
The MetaMath project is fully transparent and open-source, including all the data, model, code.
MetaMath is always eager to make more contributions to the open-source LLM, if you have any questions, we would be more than happy to help!
One aspect I suspect is that the accuracy of this leak detection code, for example, training solely on the GSM8K train set and comparing the scores between models trained for 1 epoch and 20 epochs, may exhibit disparities, despite the data itself remaining unchanged.
https://github.com/swj0419/detect-pretrain-code-contamination is an excellent repository. However, the detection of data contamination might not be as precise, which could explain why MetaMath only trained on the train set but was mistakenly flagged as contaminated.
Hi
@Longhui98
,
Super cool to get this feedback from you, the transparency is great! :)
Side question if you have the time, did you account for self contamination in MATH when building your dataset? It's not a trivial thing to anticipate, so I was wondering if you had taken it into account
(Like lmsys reported in their contamination blog post)
Hey all, just wanted to make sure I understood the main takeaways from this thread in regard to the MetaMathQA dataset:
- There have been concerns that MetaMathQA may be contaminated with GSM8K
- Tests using a public contamination detection tool indicate that this may be the case
- This tool is not 100% accurate, and it's possible that both the reference model and the threshold may be playing significant roles here
- LMSys reported that rephrasing the test set of a dataset is still a form of contamination
- The MetaMathQA developer has made it clear that the dataset was constructed using the train split of GSM8K, and not the test set
Given only the train set was used and not the test set, meaning the LMSys report isn't necessarily relevant to this situation (unless GSM8K itself has train/test contamination I am unaware of), the hesitancy toward the results of the contamination detection tool, and the transparency from the MetaMathQA developer, is the current consensus that MetaMathQA is not contaminated and we are safe to train models using this dataset?
I may have missed something so please let me know if I have misread or misinterpreted any of the information here! Thanks :)
Hi
@OxxoCodes
!
It's an accurate summary, thank you :)
Just to give you an example of how rephrasing the test set can be a big contamination factor (example with MATH, from the LMSYS report)
I think the main problem is that it's unclear how much of GSM8K is cross-contaminated between its train and test set, and it would need someone to go look at each sample (and I have not had the time so far, I'll have the bandwidth in end of Feb I think). There are examples of rephrases between train and test of GSM8K in the LMSYS paper, but they are not as bad as the above example (which would probably be the only kind of rephrases I would consider contamination).
So to answer your question, I think you can fine-tune with MetaMathQA, and once we have the time to go back with GMS8K, if we find that some test examples are contaminated on the train, we'll remove them a posteriori from the score computations and recompute the scores for every model, which shouldn't be too costly as we store all the predictions.
I question if the MATH train/test contamination is actually "rephrasing", it seems more likely to me that it is just coincidentally semantically identical. It's a relatively simple question, so two people independently coming up with it the same doesn't seem implausible to me. Further, some extent of semantic similarity is necessary - there needs to be a cutoff chosen with some principle.
A.I. models score high on gaokao language tests, low in math
https://m.youtube.com/watch?v=dQfFsRyYwM8