stjohn2007
commited on
Commit
•
fec59af
1
Parent(s):
6c7f395
Update README.md
Browse filesAdd scores, explanations of MTBench
README.md
CHANGED
@@ -53,7 +53,11 @@ This repository provides large language models developed by [TokyoTech-LLM](http
|
|
53 |
### MT-Bench JA
|
54 |
|
55 |
* NOTE that the models with the `v0.1` suffix are newer versions compared to their original counterparts with the `hf`.
|
56 |
-
* We
|
|
|
|
|
|
|
|
|
57 |
|
58 |
|Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities|
|
59 |
|---|---|---|---|---|---|---|---|---|---|
|
@@ -62,19 +66,45 @@ This repository provides large language models developed by [TokyoTech-LLM](http
|
|
62 |
| Swallow-13b-instruct-v0.1 |0.3669|0.4816|0.5562|0.2769|0.1020|0.1505|0.4179|0.4347|0.5150|
|
63 |
| Swallow-13b-instruct-hf |0.2004|0.1932|0.2552|0.1507|0.1184|0.1285|0.2641|0.2434|0.2500|
|
64 |
| Swallow-70b-instruct-v0.1 |0.4513|0.4822|0.5353|0.3497|0.3492|0.2668|0.5553|0.4955|0.5767|
|
65 |
-
| Swallow-70b-instruct-hf |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
66 |
|
67 |
## Evaluation Benchmarks
|
68 |
|
69 |
### MT-Bench JA
|
70 |
|
71 |
We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question) to assess the instruction-following capabilities of models.
|
72 |
-
We utilized the following
|
73 |
|
74 |
- Implemantation: FastChat [Zheng+, 2023] (commit #e86e70d0)
|
75 |
- Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
|
76 |
- Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
|
77 |
- Prompt for Judge: [Nejumi LLM-Lederboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1)
|
|
|
|
|
|
|
78 |
|
79 |
|
80 |
## Usage
|
|
|
53 |
### MT-Bench JA
|
54 |
|
55 |
* NOTE that the models with the `v0.1` suffix are newer versions compared to their original counterparts with the `hf`.
|
56 |
+
* We report overall (i.e., average over scores of the first and second turns), first, and second turn scores.
|
57 |
+
|
58 |
+
|
59 |
+
#### Overall
|
60 |
+
|
61 |
|
62 |
|Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities|
|
63 |
|---|---|---|---|---|---|---|---|---|---|
|
|
|
66 |
| Swallow-13b-instruct-v0.1 |0.3669|0.4816|0.5562|0.2769|0.1020|0.1505|0.4179|0.4347|0.5150|
|
67 |
| Swallow-13b-instruct-hf |0.2004|0.1932|0.2552|0.1507|0.1184|0.1285|0.2641|0.2434|0.2500|
|
68 |
| Swallow-70b-instruct-v0.1 |0.4513|0.4822|0.5353|0.3497|0.3492|0.2668|0.5553|0.4955|0.5767|
|
69 |
+
| Swallow-70b-instruct-hf |0.3259|0.2925|0.4283|0.3447|0.1562|0.1856|0.5634|0.3315|0.3071|
|
70 |
+
|
71 |
+
#### First Turn
|
72 |
+
|
73 |
+
|Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities|
|
74 |
+
|---|---|---|---|---|---|---|---|---|---|
|
75 |
+
| Swallow-7b-instruct-v0.1 |0.3829|0.4960|0.4800|0.2220|0.2820|0.2164|0.3220|0.5440|0.4980|
|
76 |
+
| Swallow-7b-instruct-hf |0.2216|0.2830|0.2150|0.1590|0.1080|0.1470|0.3542|0.2450|0.2650|
|
77 |
+
| Swallow-13b-instruct-v0.1 |0.3948|0.5400|0.5220|0.3020|0.1040|0.1760|0.5040|0.5180|0.4920|
|
78 |
+
| Swallow-13b-instruct-hf |0.2304|0.2460|0.2640|0.1610|0.1360|0.1330|0.3070|0.3010|0.2950|
|
79 |
+
| Swallow-70b-instruct-v0.1 |0.4849|0.5720|0.5020|0.4780|0.3680|0.2467|0.5400|0.5720|0.5960|
|
80 |
+
| Swallow-70b-instruct-hf |0.3631|0.3420|0.4007|0.4220|0.1580|0.2044|0.6120|0.4280|0.3360|
|
81 |
+
|
82 |
+
#### Second Turn
|
83 |
+
|
84 |
+
|Model|Average|Writing|Roleplay|Reasoning|Math|Coding|Extraction|STEM|Humanities|
|
85 |
+
|---|---|---|---|---|---|---|---|---|---|
|
86 |
+
| Swallow-7b-instruct-v0.1 |0.3059|0.3940|0.4640|0.1441|0.1000|0.2253|0.2811|0.3724|0.4449|
|
87 |
+
| Swallow-7b-instruct-hf |0.1432|0.1567|0.1798|0.1603|0.1010|0.1085|0.1767|0.1343|0.1295|
|
88 |
+
| Swallow-13b-instruct-v0.1 |0.3353|0.4213|0.5911|0.2516|0.1000|0.1244|0.3194|0.3473|0.5394|
|
89 |
+
| Swallow-13b-instruct-hf |0.1692|0.1364|0.2453|0.1401|0.1000|0.1237|0.2199|0.1850|0.2050|
|
90 |
+
| Swallow-70b-instruct-v0.1 |0.4179|0.3913|0.5689|0.2184|0.3280|0.2884|0.5711|0.4171|0.5562|
|
91 |
+
| Swallow-70b-instruct-hf |0.2872|0.2398|0.4564|0.2647|0.1540|0.1676|0.5118|0.2311|0.2762|
|
92 |
+
|
93 |
|
94 |
## Evaluation Benchmarks
|
95 |
|
96 |
### MT-Bench JA
|
97 |
|
98 |
We used [Japanese MT-Bench](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question) to assess the instruction-following capabilities of models.
|
99 |
+
We utilized the following settings:
|
100 |
|
101 |
- Implemantation: FastChat [Zheng+, 2023] (commit #e86e70d0)
|
102 |
- Question: [Nejumi LLM-Leaderboard NEO, mtbench_ja_question_v3](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_question/v3)
|
103 |
- Reference Answer: [Nejumi LLM-Leaderboard NEO, mtbench_ja_referenceanswer_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_referenceanswer/v1)
|
104 |
- Prompt for Judge: [Nejumi LLM-Lederboard NEO, mtbench_ja_prompt_v1](https://wandb.ai/wandb-japan/llm-leaderboard/artifacts/dataset/mtbench_ja_prompt/v1)
|
105 |
+
- Judge: `gpt-4-1106-preview`
|
106 |
+
- Scoring: Absolute scale normalized to a 0-1 range, averaged over five runs.
|
107 |
+
|
108 |
|
109 |
|
110 |
## Usage
|