Is this "large" or "large2"?
On mistral website there are 3 models:
large2, codestral and Nemo.
Is this large2?
This model is GREAT. I tried it on mistral website!
I am mainly interested about sheer reasoning and it's really good.
I wonder how many parameters are the minimum needed to have "good" reasoning.
Mistral proved more than once that it's about training and datasets and not just the number of parameters.
P.S.
If there is any way to contact Mistral directly I would like to explain a few of my ideas in that regard.
This model is GREAT. I tried it on mistral website!
I am mainly interested about sheer reasoning and it's really good.
I wonder how many parameters are the minimum needed to have "good" reasoning.
Mistral proved more than once that it's about training and datasets and not just the number of parameters.
P.S.
If there is any way to contact Mistral directly I would like to explain a few of my ideas in that regard.
There's no such thing as "minimum parameters", not that we know of I don't think. I think it has been proven recently that just about every passable test can be passed with a relatively small model too if it was trained extensively on huge amounts of data. This even includes base64 encoding/decoding tasks and uncommon foreign languages.
GPT4o-mini is rumored/said to be very small (think gemma2 size). The only obvious drawbacks that I've personally noticed from models becoming too small relate to capacity and instruction following. Meaning, the model can not effectively account for many things at the same time when outputting a response (think response style or structure in addition to the answer itself being correct/functional), and will adhere to your instructions less if they are too specific or intricate.
@ID0M
parameters count for sure. No 7B parameter of today even trained extensively can "beat" a 70B (unless the 70B is really bad).
@mistral-ai
in this is great, but I doubt they can make a 7B-14B model as clever as Claude is today (or as clever as their 123B model)
@ID0M parameters count for sure. No 7B parameter of today even trained extensively can "beat" a 70B (unless the 70B is really bad).
I wouldn't be so sure about that. Qwen2-7b will comfortably beat Llama2-70B. And we don't know the size of some closed source cost efficient models, those are unlikely to be made public for obvious reasons. It's all about the progress and your perspective of it. Like on the surface it might look like Meta has caught up with gpt4o (which is Turbo further optimized for cost), but if their model is 3+ times bigger, this isn't exactly the case...
@ID0M
interesting perspective. I made a test lately by adding noise to models weights. Even on a 7b model adding a lot of noise did not impact much the model abilities (and after a certain threshold it impacted reasoning first and knowledge after and then even speech).
Check this out (and the related colab): https://huggingface.co/ZeroWw/Mistral-7B-Instruct-v0.3-SILLY