Llama 3.1 405B Instruct beats GPT-4o on MixEval-Hard
Just ran MixEval for 405B, Sonnet-3.5 and 4o, with 405B landing right between the other two at 66.19
The GPT-4o result of 64.7 replicated locally but Sonnet-3.5 actually scored 70.25/69.45 in my replications š¤ Still well ahead of the other 2 though.