A few model submissions that didn't work

#2
by KnutJaegersberg - opened

Qwen/Qwen1.5-110B-Chat (multilingual European languages skills)
seedboxai/KafkaLM-8x7B-German-V0.1 (neat German fine tune)
dbmdz/german-gpt2 (first German GPT2 available)
GroNLP/gpt2-small-italian
THUMT/mGPT
google/mt5-xxl
CohereForAI/aya-23-35B (expected here you have to ask for access)
CohereForAI/aya-23-8B (expected here you have to ask for access)
CohereForAI/c4ai-command-r-plus (expected here you have to ask for access)
vilm/vulture-40b (needs trust remote code true)
RWKV/v5-EagleX-v2-7B-HF (needs trust remote code true)
xverse/XVERSE-65B-Chat (needs trust remote code true)
xverse/XVERSE-MoE-A4.2B-Chat (needs trust remote code true)
lightonai/alfred-40b-1023 (needs trust remote code true)

In all cases, the error message says: Model not found on the hub!

both the cohere and qwen models should be quite good, also curious about Eagle.

I've submitted CohereForAI/aya-101 too but perhaps eval fails because it is t5 based.

oh I hope I didn't crash the board

Occiglot org

I'll add them manually to the queue.

I got the same issue for our latest KafkaLM Mixtral version:
seedboxai/KafkaLM-Mixtral-8x7B-V0.2

The new GLM-4 is said to have good multilingual abilities, too, but it requires trust remote code = true

https://github.com/THUDM/GLM-4

image.png

image.png

;P

Occiglot org

I can't auto-enable trust_remote_code = true for safety reasons. But I'll manually check all of the models. We currently have some internal updates on the leaderboard. That's why so many models are in the queue.

tried to submit the now released Qwen/Qwen2-72B-Instruct
and a few of other qwen2 models, but it could not find the models on the hub.

this is another interesting model line up because they added support for 27 languages!

VAGOsolutions/SauerkrautLM-1.5b
not found on the hub

Wishlist for inclusion, as these work fairly well for non-English languages:
CohereForAI/c4ai-command-r-v01
CohereForAI/c4ai-command-r-plus
microsoft/Phi-3-medium-4k-instruct
microsoft/Phi-3-medium-128k-instruct

looks like this leaderboard is ghosted

Occiglot org

Hey!

Thanks for contributing to the leaderboard. Currently, I'm working on this Leaderboard alone—that's why sometimes there are some delays. I'm very sorry for that.

I just added some of the models to the queue.

I also like to inform you that we decided in the last Occiglot meeting at Discord that we limit the model size to 10B parameters.

Best wishes
Fabio

I think 10B is too small for true multilingual models right now, unless we get a radical improvement in architecture.
This kind of gives the impression that you simply want to hide Mixtral, Phi-3-medium and Command-R from your leaderboard, and makes this leaderboard quite useless IMHO.
Even when one might now have the resources to run Mixtral, Command-R-plus etc., including them as baselines in the leaderboard would be meaningful as to whether it would pay off to increase the hardware capacities.
I would rather omit merges such as Spaetzle that do not represent a true advancement but likely only overfitting to the leaderboard.

Occiglot org

Hey @kno10 ,

maybe I should have been more precise about the reason behind the parameter limit. This limit is simple due to hardware limitations. As Occiglot is for now an open research project we can only provide so much compute. However, we are working on improving it. We are not trying to "hide" any model. We already did some initial evaluations with Mixtral-8x22B-v0.1 (see leaderboard). There is no reason to exclude any LLM (except if we can't evaluate it on our GPUs).

I also like to add that with your statement: "10B is too small for true multilingual models" you undermine the great work of all the researchers who trained and contributed the <10B models to the leaderboard.

Regarding the merged models, I'd like to point out that you are raising a valid concern. We will discuss this topic in our next Occiglot meeting and we most likely find a solution (maybe a flag where you can exclude/hide the merged models from the leaderboard).

I do not "undermine" the work, but this is simply a sober observation of what we measure here not driven by the hype.
The Mixtral model (which is not optimized for German or this benchmark) scores 68.30 on average, the best 8B model is 64.49, and performs best on English.
This is (A) a significant difference, and (B) these scores are still too low IMHO for practical applications (obviously they do not translate directly into error rates for real applications, but trying these small models in real multilingual applications has so far been very disappointing).

I would expect both Phi-3-medium and Command-R to perform quite well. In the subjective assessment of users, these work much better than any 7B/8B model tested so far.
Hence, without these baselines, such a benchmark is of little use to me.

Also, there IS substantial (over-) fitting to the prompting used by the leaderboard benchmark:
Pure Llama-3 8B scores 63.08 for English, Sauerkraut Llama-3 8B scores 74.71 for English.
I dare to claim that this is NOT because the model is significantly more powerful, but only much more tuned to the evaluation benchmark.

The classic OpenLLM leaderboard had become useless when it was flooded with m(o)erges that overfit the benchmark, often even with data pollution.

Thank you for adding phi-3-medium and command-r. They perform as expected between the 8B and the 8x22B model.

Sign up or log in to comment