LiyuanLucasLiu commited on
Commit
e87098c
1 Parent(s): c4cfe3e

Update README.md

Browse files

added phi3.5 result

Files changed (1) hide show
  1. README.md +23 -22
README.md CHANGED
@@ -69,29 +69,30 @@ To understand the capabilities, we compare GRIN MoE with a set of models over a
69
 
70
  ### Popular Benchmarks
71
 
72
- | | GRIN MoE (16x3.8B) | Mixtral (8x7B) | Mixtral (8x22B) | Llama3 (8B) | Llama3 (70B) | GPT3.5 | GPT4o |
73
- |---------------|-----------|---------|---------|--------|--------|--------|-------|
74
- | MMLU | 79.4 | 70.5 | 76.2 | 66.5 | 80.2 | 71.4 | 86.9 |
75
- | HellaSwag | 83.7 | 70.4 | 79.0 | 71.1 | 82.6 | 78.8 | 91.7 |
76
- | ANLI | 60.6 | 55.2 | 65.2 | 57.3 | 68.3 | 58.1 | 75.7 |
77
- | GSM-8K | 90.4 | 64.7 | 83.8 | 77.4 | 93.5 | 78.1 | 93.8 |
78
- | Math | 58.9 | 11.1 | 41.8 | 28.2 | 51.2 | 45.3 | 67.8 |
 
79
  | MedQA | 70.4 | 62.2 | 67.9 | 60.5 | 78.5 | 63.4 | 88.9 |
80
- | AGIEval | 48.2 | 45.2 | 54.0 | 42.0 | 56.9 | 48.4 | 37.6 |
81
- | TriviaQA | 73.9 | 78.5 | 82.2 | 67.7 | 84.5 | 85.8 | 66.0 |
82
- | Arc-C | 92.0 | 87.3 | 91.3 | 82.8 | 93.0 | 87.4 | 97.0 |
83
- | Arc-E | 98.0 | 95.6 | 96.9 | 93.4 | 98.2 | 96.3 | 99.0 |
84
- | PIQA | 89.0 | 86.0 | 85.0 | 75.7 | 85.3 | 86.6 | 92.9 |
85
- | SociQA | 79.5 | 75.9 | 78.2 | 73.9 | 81.1 | 68.3 | 81.4 |
86
- | BigBench-Hard | 81.4 | 69.7 | 81.8 | 51.5 | 80.2 | 68.3 | 81.2 |
87
- | WinoGrande | 81.4 | 62.0 | 75.3 | 65.0 | 83.3 | 68.8 | 89.3 |
88
- | OpenBookQA | 89.8 | 85.8 | 88.6 | 82.6 | 91.8 | 86.0 | 95.2 |
89
- | BoolQ | 83.4 | 77.6 | 82.7 | 80.9 | 89.1 | 79.1 | 90.6 |
90
- | CommonSenseQA | 81.8 | 78.1 | 82.0 | 79.0 | 84.4 | 79.6 | 88.5 |
91
- | TruthfulQA | 74.5 | 60.1 | 67.4 | 63.2 | 81.9 | 85.8 | 85.6 |
92
- | HumanEval | 74.4 | 37.8 | 39.6 | 60.4 | 78.7 | 62.2 | 92.1 |
93
- | MBPP | 80.3 | 60.2 | 70.7 | 67.7 | 81.3 | 77.8 | 90.4 |
94
- | Average | 78.6 | 66.7 | 74.5 | 67.3 | 81.2 | 73.8 | 84.8 |
95
 
96
  ### Livebench
97
  Performance on LiveBench-2024-07-25. Models are ranked by their average score (AVG). *Baseline results are referenced from the official benchmark.
 
69
 
70
  ### Popular Benchmarks
71
 
72
+ Note a different version of mid-training and post-training, emphasizing long context and multilingual ability, has been conducted and has been released at https://huggingface.co/microsoft/Phi-3.5-MoE-instruct.
73
+
74
+ | | GRIN MoE (16x3.8B) | Phi-3.5-MoE (16x3.8B) | Mixtral (8x7B) | Mixtral (8x22B) | Llama3 (8B) | Llama3 (70B) | GPT3.5 | GPT4o |
75
+ |---------------|-----------|---------|---------|---------|--------|--------|--------|-------|
76
+ | MMLU | 79.4 | 78.9 | 70.5 | 76.2 | 66.5 | 80.2 | 71.4 | 86.9 |
77
+ | HellaSwag | 83.7 | 83.8 | 70.4 | 79.0 | 71.1 | 82.6 | 78.8 | 91.7 |
78
+ | ANLI | 60.6 | 59.8 | 55.2 | 65.2 | 57.3 | 68.3 | 58.1 | 75.7 |
79
+ | GSM-8K | 90.4 | 88.7 | 64.7 | 83.8 | 77.4 | 93.5 | 78.1 | 93.8 |
80
  | MedQA | 70.4 | 62.2 | 67.9 | 60.5 | 78.5 | 63.4 | 88.9 |
81
+ | AGIEval | 48.2 | 50.3 | 45.2 | 54.0 | 42.0 | 56.9 | 48.4 | 37.6 |
82
+ | TriviaQA | 73.9 | 71.6 | 78.5 | 82.2 | 67.7 | 84.5 | 85.8 | 66.0 |
83
+ | Arc-C | 92.0 | 91.0 | 87.3 | 91.3 | 82.8 | 93.0 | 87.4 | 97.0 |
84
+ | Arc-E | 98.0 | 97.1 | 95.6 | 96.9 | 93.4 | 98.2 | 96.3 | 99.0 |
85
+ | PIQA | 89.0 | 88.6 | 86.0 | 85.0 | 75.7 | 85.3 | 86.6 | 92.9 |
86
+ | SociQA | 79.5 | 78.0 | 75.9 | 78.2 | 73.9 | 81.1 | 68.3 | 81.4 |
87
+ | BigBench-Hard | 81.4 | 79.1 | 69.7 | 81.8 | 51.5 | 80.2 | 68.3 | 81.2 |
88
+ | WinoGrande | 81.4 | 81.3 | 62.0 | 75.3 | 65.0 | 83.3 | 68.8 | 89.3 |
89
+ | OpenBookQA | 89.8 | 89.6 | 85.8 | 88.6 | 82.6 | 91.8 | 86.0 | 95.2 |
90
+ | BoolQ | 83.4 | 84.5 | 77.6 | 82.7 | 80.9 | 89.1 | 79.1 | 90.6 |
91
+ | CommonSenseQA | 81.8 | 83.5 | 78.1 | 82.0 | 79.0 | 84.4 | 79.6 | 88.5 |
92
+ | TruthfulQA | 74.5 | 77.5 | 60.1 | 67.4 | 63.2 | 81.9 | 85.8 | 85.6 |
93
+ | HumanEval | 74.4 | 70.7 | 37.8 | 39.6 | 60.4 | 78.7 | 62.2 | 92.1 |
94
+ | MBPP | 80.3 | 80.8 | 60.2 | 70.7 | 67.7 | 81.3 | 77.8 | 90.4 |
95
+ | Average | 79.6 | 79.2 | 69.6 | 76.2 | 69.4 | 82.8 | 75.2 | 85.7 |
96
 
97
  ### Livebench
98
  Performance on LiveBench-2024-07-25. Models are ranked by their average score (AVG). *Baseline results are referenced from the official benchmark.