logical-reasoning / data /best_metrics.csv
dh-mc's picture
openai batch
921fa92
raw
history blame
1.2 kB
index,model,run,accuracy,precision,recall,f1,ratio_valid_classifications
1,internlm2_5-7b-chat,internlm2_5-7b-chat,0.7496666666666667,0.8041871978859686,0.7496666666666667,0.7660159670998776,1.0
2,internlm2_5-7b-chat-1m,internlm2_5-7b-chat-1m,0.803,0.8031411888150441,0.803,0.8028064320197301,1.0
3,Mistral-7B-v0.3-Chinese-Chat,Mistral-7B-v0.3-Chinese-Chat,0.75,0.7885868317699068,0.75,0.7648234347578796,1.0
4,Qwen2-7B-Instruct,Qwen2-7B-Instruct,0.759,0.8005303465799652,0.759,0.7748745026535183,1.0
5,Llama3.1-8B-Chinese-Chat,Llama3.1-8B-Chinese-Chat,0.78,0.810582723471486,0.78,0.7924651054056209,1.0
6,Llama3.1-70B-Chinese-Chat,Llama3.1-70B-Chinese-Chat,0.7963333333333333,0.8248972880055918,0.7963333333333333,0.8076868978089201,1.0
7,Qwen2-72B-Instruct,Qwen2-72B-Instruct,0.784,0.8354349234761956,0.784,0.804194683154365,1.0
8,Ensemble Model,Ensemble Model,0.8193333333333334,0.8407464756633664,0.8193333333333334,0.828054127213081,1.0
9,gpt-4o-mini (0-shot),gpt-4o-mini (0-shot),0.7176666666666667,0.785706730193659,0.7176666666666667,0.7296061848734905,1.0
10,gpt-4o (10-shot),gpt-4o (10-shot),0.7916666666666666,0.8227707658360168,0.7916666666666666,0.803614688453356,0.9996666666666667