Text Generation
Transformers
PyTorch
mistral
openchat
C-RLFT
conversational
Inference Endpoints
text-generation-inference

MMLU of ChatGPT/GPT3.5-turbo is 69~70, GSM8K 78.2

#1
by JosephusCheung - opened

See MMLU 69.1 GSM8K 78.2
on https://opencompass.org.cn/leaderboard-llm updated:2023/9/1, and MMLU scoring 70 from other sources.

JosephusCheung changed discussion title from MMLU of ChatGPT/GPT3.5-turbo is 69~70 to MMLU of ChatGPT/GPT3.5-turbo is 69~70, GSM8K 78.2
OpenChat org

Our MMLU and GSM8k results come from Chain-of-Thought Hub

We use the same prompts and answer matching as Chain-of-Thought Hub, so the comparison should be fair.

Model # Params Average MT-Bench AGIEval BBH MC TruthfulQA MMLU HumanEval BBH CoT GSM8K
OpenChat-3.5 7B 61.6 7.81 47.4 47.6 59.1 64.3 55.5 63.5 77.3
ChatGPT (Yours) ? 61.5 7.94 47.1 47.6 57.7 67.3 48.1 70.1 74.9
ChatGPT (Other Sources*) ? 65.3 7.94 47.1 47.6 57.7 69.1* 73.2* 70.1 78.2*
OpenChat org

Thank you for your interest in our results. As you've rightly pointed out, the performance of ChatGPT has evolved over time, and there are numerous reports from different time periods. For a clearer comparison, our reported results are based on the data available around March, which we label as ChatGPT (March), sourced from Chain-of-Thought Hub and OpenAI's technical report.

imone changed discussion status to closed

Sign up or log in to comment