Post
953
Reflection-Llama-3.1-70B burst onto the scene, surprising everyone! It claimed to outperform others with its novel Reflection-Tuning technique, promising not just to match but to surpass the likes of Claude 3.5 and GPT-4o, leveraging its 70 billion parameters to redefine what open-source could achieve.
And now, everything is crumbling!
The model's performance metrics, especially its 99.2% accuracy on the high school math dataset GSM 8K, have raised eyebrows. While it looked like a valedictorian, based on the open weights, it hardly performs like one.
The model card in the Transformers behaves as Llama 3 and not 3.1.
While the weights were released publicly, they are having issues aligning with the claims. The tuning has been restarted, and the author claims to upload the updated weights soon!
And the big one: the black-boxed API shared is not at all like the open weights. Even more, when pushed hard, the API endpoint claims to be an LLM by Anthropic!
But you might ask, didn't this model beat Anthropic Claude 3.5? Yes, it did.
So, did Claude 3.5 beat Claude 3.5? No, the benchmark is zero-shot, and the claims are that the results are not under zero-shot but under CoT/few-shot!
And to top it all off, the reflecting back idea is not new. But I don't think that's a big deal.
I took some time to look through everything, and now, once tested, this model looks to be worse than Llama 3.1 70B
I still believe the Reflection-Tuning technique is promising. These are the papers discussing its efficacy:
- "Think Before You Speak: Training Language Models With Pause Tokens"
- "Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning"
PS: Matt Shumer/@mattshumer_ (Twitter Handle) (Reflection-Llama-3.1-70B creator) is a great researcher. Let's wait for his updated weights!
Great YT video: https://youtu.be/Xtr_Ll_A9ms
Hugging Face Clem Delangue 🤗?
Can you please help here if possible? This will be the pinnacle of open-source!
And now, everything is crumbling!
The model's performance metrics, especially its 99.2% accuracy on the high school math dataset GSM 8K, have raised eyebrows. While it looked like a valedictorian, based on the open weights, it hardly performs like one.
The model card in the Transformers behaves as Llama 3 and not 3.1.
While the weights were released publicly, they are having issues aligning with the claims. The tuning has been restarted, and the author claims to upload the updated weights soon!
And the big one: the black-boxed API shared is not at all like the open weights. Even more, when pushed hard, the API endpoint claims to be an LLM by Anthropic!
But you might ask, didn't this model beat Anthropic Claude 3.5? Yes, it did.
So, did Claude 3.5 beat Claude 3.5? No, the benchmark is zero-shot, and the claims are that the results are not under zero-shot but under CoT/few-shot!
And to top it all off, the reflecting back idea is not new. But I don't think that's a big deal.
I took some time to look through everything, and now, once tested, this model looks to be worse than Llama 3.1 70B
I still believe the Reflection-Tuning technique is promising. These are the papers discussing its efficacy:
- "Think Before You Speak: Training Language Models With Pause Tokens"
- "Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning"
PS: Matt Shumer/@mattshumer_ (Twitter Handle) (Reflection-Llama-3.1-70B creator) is a great researcher. Let's wait for his updated weights!
Great YT video: https://youtu.be/Xtr_Ll_A9ms
Hugging Face Clem Delangue 🤗?
Can you please help here if possible? This will be the pinnacle of open-source!