stjohn2007 commited on
Commit
fec59af
1 Parent(s): 6c7f395

Update README.md

Browse files

Add scores, explanations of MTBench

Files changed (1) hide show
  1. README.md +33 -3
README.md CHANGED
@@ -53,7 +53,11 @@ This repository provides large language models developed by [TokyoTech-LLM](http
53
  ### MT-Bench JA
54
 
55
  * NOTE that the models with the `v0.1` suffix are newer versions compared to their original counterparts with the `hf`.
56
- * We will add the scores of `Swallow-70b-instruct-hf` and existing models soon.
 
 
 
 
57
 
58
  |Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities|
59
  |---|---|---|---|---|---|---|---|---|---|
@@ -62,19 +66,45 @@ This repository provides large language models developed by [TokyoTech-LLM](http
62
  | Swallow-13b-instruct-v0.1 |0.3669|0.4816|0.5562|0.2769|0.1020|0.1505|0.4179|0.4347|0.5150|
63
  | Swallow-13b-instruct-hf |0.2004|0.1932|0.2552|0.1507|0.1184|0.1285|0.2641|0.2434|0.2500|
64
  | Swallow-70b-instruct-v0.1 |0.4513|0.4822|0.5353|0.3497|0.3492|0.2668|0.5553|0.4955|0.5767|
65
- | Swallow-70b-instruct-hf |N/A|N/A|N/A|N/A|N/A|N/A|N/A|N/A|N/A|
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
  ## Evaluation Benchmarks
68
 
69
  ### MT-Bench JA
70
 
71
  We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question) to assess the instruction-following capabilities of models.
72
- We utilized the following artifacts:
73
 
74
  - Implemantation: FastChat [Zheng+, 2023] (commit #e86e70d0)
75
  - Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
76
  - Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
77
  - Prompt for Judge: [Nejumi LLM-Lederboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1)
 
 
 
78
 
79
 
80
  ## Usage
 
53
  ### MT-Bench JA
54
 
55
  * NOTE that the models with the `v0.1` suffix are newer versions compared to their original counterparts with the `hf`.
56
+ * We report overall (i.e., average over scores of the first and second turns), first, and second turn scores.
57
+
58
+
59
+ #### Overall
60
+
61
 
62
  |Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities|
63
  |---|---|---|---|---|---|---|---|---|---|
 
66
  | Swallow-13b-instruct-v0.1 |0.3669|0.4816|0.5562|0.2769|0.1020|0.1505|0.4179|0.4347|0.5150|
67
  | Swallow-13b-instruct-hf |0.2004|0.1932|0.2552|0.1507|0.1184|0.1285|0.2641|0.2434|0.2500|
68
  | Swallow-70b-instruct-v0.1 |0.4513|0.4822|0.5353|0.3497|0.3492|0.2668|0.5553|0.4955|0.5767|
69
+ | Swallow-70b-instruct-hf |0.3259|0.2925|0.4283|0.3447|0.1562|0.1856|0.5634|0.3315|0.3071|
70
+
71
+ #### First Turn
72
+
73
+ |Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities|
74
+ |---|---|---|---|---|---|---|---|---|---|
75
+ | Swallow-7b-instruct-v0.1 |0.3829|0.4960|0.4800|0.2220|0.2820|0.2164|0.3220|0.5440|0.4980|
76
+ | Swallow-7b-instruct-hf |0.2216|0.2830|0.2150|0.1590|0.1080|0.1470|0.3542|0.2450|0.2650|
77
+ | Swallow-13b-instruct-v0.1 |0.3948|0.5400|0.5220|0.3020|0.1040|0.1760|0.5040|0.5180|0.4920|
78
+ | Swallow-13b-instruct-hf |0.2304|0.2460|0.2640|0.1610|0.1360|0.1330|0.3070|0.3010|0.2950|
79
+ | Swallow-70b-instruct-v0.1 |0.4849|0.5720|0.5020|0.4780|0.3680|0.2467|0.5400|0.5720|0.5960|
80
+ | Swallow-70b-instruct-hf |0.3631|0.3420|0.4007|0.4220|0.1580|0.2044|0.6120|0.4280|0.3360|
81
+
82
+ #### Second Turn
83
+
84
+ |Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities|
85
+ |---|---|---|---|---|---|---|---|---|---|
86
+ | Swallow-7b-instruct-v0.1 |0.3059|0.3940|0.4640|0.1441|0.1000|0.2253|0.2811|0.3724|0.4449|
87
+ | Swallow-7b-instruct-hf |0.1432|0.1567|0.1798|0.1603|0.1010|0.1085|0.1767|0.1343|0.1295|
88
+ | Swallow-13b-instruct-v0.1 |0.3353|0.4213|0.5911|0.2516|0.1000|0.1244|0.3194|0.3473|0.5394|
89
+ | Swallow-13b-instruct-hf |0.1692|0.1364|0.2453|0.1401|0.1000|0.1237|0.2199|0.1850|0.2050|
90
+ | Swallow-70b-instruct-v0.1 |0.4179|0.3913|0.5689|0.2184|0.3280|0.2884|0.5711|0.4171|0.5562|
91
+ | Swallow-70b-instruct-hf |0.2872|0.2398|0.4564|0.2647|0.1540|0.1676|0.5118|0.2311|0.2762|
92
+
93
 
94
  ## Evaluation Benchmarks
95
 
96
  ### MT-Bench JA
97
 
98
  We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question) to assess the instruction-following capabilities of models.
99
+ We utilized the following settings:
100
 
101
  - Implemantation: FastChat [Zheng+, 2023] (commit #e86e70d0)
102
  - Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
103
  - Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
104
  - Prompt for Judge: [Nejumi LLM-Lederboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1)
105
+ - Judge: `gpt-4-1106-preview`
106
+ - Scoring: Absolute scale normalized to a 0-1 range, averaged over five runs.
107
+
108
 
109
 
110
  ## Usage