YC-Chen commited on
Commit
29e7be5
·
verified ·
1 Parent(s): 88c0b10

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +59 -38
README.md CHANGED
@@ -18,7 +18,7 @@ This achievement marks a significant milestone as it is the first instance of vo
18
  [Breeze-7B-Instruct-v0.1](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v0.1) derives from the base model Breeze-7B-Base-v0.1
19
  and has undergone supervised fine-tuning with over 1 million instances to
20
  sharpen its capabilities. This fine-tuned model demonstrates impressive performance in benchmarks for both English and Traditional Chinese, surpassing the results of
21
- Taiwan-LLM-7B-v2.1-Chat, Taiwan-LLM-13B-v2.0-Chat and Qwen-7B-Chat in Traditional Chinese assessments. It also excels in some benchmarks against Yi-6B-Chat.
22
  In English evaluations, Breeze-7B-Instruct-v0.1 shows comparable results to Mistral-7B-Instruct-v0.1 on the MMLU and MT-Bench benchmarks. [See [Chat Model Performance](#chat-model-performance).]
23
 
24
 
@@ -61,6 +61,12 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
61
 
62
  ## Base Model Performance
63
 
 
 
 
 
 
 
64
  | Models | | TMMLU+ (ACC) | DRCD (EM) | Table (ACC) | MMLU (ACC) |
65
  |----------------------------------------------|--------|--------------|-------------|-------------|------------|
66
  | | |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Knowledge|
@@ -74,8 +80,10 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
74
 
75
 
76
  \* Few-shot learning cannot effectively guide the model to generate the proper answer.
 
 
77
 
78
- | Category ACC of TMMLU+ (5 shot) | STEM | Social Science | Humanities | Other |
79
  |-----------------------------------------------------|--------------|----------------|------------|------------|
80
  | Yi-34B | 56.03 | 73.06 | 61.12 | 62.19 |
81
  | Qwen-14B | 46.51 | 58.20 | 51.12 | 49.38 |
@@ -85,42 +93,9 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
85
  | Mistral-7B-v0.1 | 33.01 | 42.23 | 35.86 | 37.63 |
86
 
87
 
88
- **TMMLU+**, **DRCD**, and **Table** source from [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2).
89
- [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
90
- and [ikala/tmmluplus](https://huggingface.co/datasets/ikala/tmmluplus). **MMLU** sources from [hails/mmlu_no_train](https://huggingface.co/datasets/hails/mmlu_no_train).
91
- We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**.
92
-
93
-
94
- ## Chat Model Performance
95
-
96
- | Models | | TMMLU+ (ACC) | TMMLU+ (ACC) | DRCD (EM) | Table (ACC) | MT-Bench-tw (Score) | MMLU (ACC) | MMLU (ACC) | MT-Bench (Score) |
97
- |--------------------------------------------|--------|--------------|--------------|-----------|-------------|--------|------------|------------|------------------|
98
- | | |TC, Knowledge |TC, Knowledge |TC, Reasoning|TC, Reasoning|TC, Chat |EN, Knowledge|EN, Knowledge|EN, Chat |
99
- | | | 0 shot | 5 shot | 3 shot | 0 shot | 0 shot | 0 shot | 5 shot | 0 shot |
100
- | [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat) | 34B | 54.87 | | | 36.81 | 6.9 | 71.04 | | 7.6 |
101
- | [Qwen-14B-Chat](https://huggingface.co/Qwen/Qwen-14B-Chat) | 14B | 48.41 | | | 41.67 | 6.4 | 64.91 | | 7.2 |
102
- | [Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat) | 6B | 44.79 | | | 25.69 | 5.0 | 59.45 | | 6.0 |
103
- | [gpt-3.5-turbo](https://openai.com) | | 41.76 | | | | 7.1 | 70.00 | | 7.9 |
104
- | [**Breeze-7B-Instruct-v0.1**](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v0.1) | 7B | 41.61 | | | 45.83 | 5.7 | 63.26 | | 7.1 |
105
- | [**Breeze-7B-Instruct-64k-v0.1**](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-64k-v0.1) | 7B | 40.99 | | | 36.11 | 5.5 | 63.68 | | 7.1 |
106
- | [Qwen-7B-Chat](https://huggingface.co/Qwen/Qwen-7B-Chat) | 7B | 40.02 | | | 33.33 | 5.4 | 55.94 | | 6.2 |
107
- | [Taiwan-LLM-13B-v2.0-chat](https://huggingface.co/yentinglin/Taiwan-LLM-13B-v2.0-chat) | 13B | 29.47 | | | 23.61 | 5.0 | 50.50 | | -* |
108
- | [Taiwan-LLM-7B-v2.1-chat](https://huggingface.co/yentinglin/Taiwan-LLM-7B-v2.1-chat) | 7B | 28.08 | | | 31.25 | 4.2 | 42.72 | | -* |
109
 
110
 
111
- \* Taiwan-LLM models responds to multi-turn questions (English) in Traditional Chinese.
112
-
113
- | Category ACC of TMMLU+ (0 shot) | STEM | Social Science | Humanities | Other |
114
- |-----------------------------------------------------|--------------|----------------|------------|------------|
115
- | Yi-34B-Chat | 47.65 | 64.25 | 52.73 | 54.91 |
116
- | Qwen-14B-Chat | 43.83 | 55.00 | 48.55 | 46.22 |
117
- | Yi-6B-Chat | 37.80 | 51.74 | 45.36 | 44.25 |
118
- | gpt-3.5-turbo | 41.56 | 46.72 | 36.73 | 42.03 |
119
- | **Breeze-7B-Instruct-v0.1** | 37.41 | 46.81 | 42.06 | 40.16 |
120
- | **Breeze-7B-Instruct-64k-v0.1** | 37.88 | 46.35 | 40.31 | 39.40 |
121
- | Qwen-7B-Chat | 35.44 | 46.22 | 38.35 | 40.06 |
122
- | Taiwan-LLM-13B-v2.0-chat | 27.74 | 33.69 | 27.03 | 29.43 |
123
- | Taiwan-LLM-7B-v2.1-chat | 25.58 | 31.76 | 27.36 | 27.61 |
124
 
125
  **TMMLU+**, **DRCD**, **Table**, and **MT-Bench-tw** source from [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2).
126
  [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
@@ -130,13 +105,59 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
130
  We use the code revised from [fastchat llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) to evaluate **MT-Bench-tw** and **MT-Bench**.
131
 
132
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
133
  ## Inference Performance
134
  In this test, we use the first 700 characters of the [web article](https://health.udn.com/health/story/5976/7699252?from=udn_ch1005_main_index) as the input and ask the model to write the same article again.
135
  All inferences run on 2 RTX A6000 GPUs (using `vllm`, with a tensor-parallel size of 2).
136
 
137
  | Models | Inference Time (sec)|Estimated Max Input Length (Char)|
138
  |--------------------------------------------------------------------|-------------------|--------------------------|
139
- | Yi-6B | 10.62 | 4.5k |
140
  | **Breeze-7B-Instruct-v0.1** | 10.74 | 11.1k |
141
  | **Breeze-7B-Instruct-64k-v0.1** | 10.74 | 88.8k |
142
  | Qwen-7B | 10.86 | 9.8k |
@@ -187,4 +208,4 @@ where `SYS_PROMPT`, `QUERY1`, `RESPONSE1`, and `QUERY2` can be provided by the u
187
  The suggested default `SYS_PROMPT` is
188
  ```txt
189
  You are a helpful AI assistant built by MediaTek Research. The user you are helping speaks Traditional Chinese and comes from Taiwan.
190
- ```
 
18
  [Breeze-7B-Instruct-v0.1](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v0.1) derives from the base model Breeze-7B-Base-v0.1
19
  and has undergone supervised fine-tuning with over 1 million instances to
20
  sharpen its capabilities. This fine-tuned model demonstrates impressive performance in benchmarks for both English and Traditional Chinese, surpassing the results of
21
+ Taiwan-LLM-7B-v2.1-chat, Taiwan-LLM-13B-v2.0-chat and Qwen-7B-chat in Traditional Chinese assessments. It also excels in some benchmarks against Yi-6B-Chat.
22
  In English evaluations, Breeze-7B-Instruct-v0.1 shows comparable results to Mistral-7B-Instruct-v0.1 on the MMLU and MT-Bench benchmarks. [See [Chat Model Performance](#chat-model-performance).]
23
 
24
 
 
61
 
62
  ## Base Model Performance
63
 
64
+ **TMMLU+**, **DRCD**, and **Table** source from [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2).
65
+ [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
66
+ and [ikala/tmmluplus](https://huggingface.co/datasets/ikala/tmmluplus). **MMLU** sources from [hails/mmlu_no_train](https://huggingface.co/datasets/hails/mmlu_no_train).
67
+ We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**.
68
+
69
+
70
  | Models | | TMMLU+ (ACC) | DRCD (EM) | Table (ACC) | MMLU (ACC) |
71
  |----------------------------------------------|--------|--------------|-------------|-------------|------------|
72
  | | |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Knowledge|
 
80
 
81
 
82
  \* Few-shot learning cannot effectively guide the model to generate the proper answer.
83
+
84
+ **Category ACC of TMMLU+ (5 shot)**
85
 
86
+ | Models | STEM | Social Science | Humanities | Other |
87
  |-----------------------------------------------------|--------------|----------------|------------|------------|
88
  | Yi-34B | 56.03 | 73.06 | 61.12 | 62.19 |
89
  | Qwen-14B | 46.51 | 58.20 | 51.12 | 49.38 |
 
93
  | Mistral-7B-v0.1 | 33.01 | 42.23 | 35.86 | 37.63 |
94
 
95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96
 
97
 
98
+ ## Chat Model Performance
 
 
 
 
 
 
 
 
 
 
 
 
99
 
100
  **TMMLU+**, **DRCD**, **Table**, and **MT-Bench-tw** source from [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2).
101
  [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
 
105
  We use the code revised from [fastchat llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) to evaluate **MT-Bench-tw** and **MT-Bench**.
106
 
107
 
108
+ | Models | |MT-Bench-tw (Score) | TMMLU+ (ACC) | TMMLU+ (ACC) | DRCD (EM) | Table (ACC) | MT-Bench (Score) | MMLU (ACC) | MMLU (ACC) |
109
+ |---------------------------------------------------------------------------------------------------------|--------|--------------------|--------------|--------------|-------------|-------------|------------------|-------------|-------------|
110
+ | | |TC, Chat |TC, Knowledge |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Chat |EN, Knowledge|EN, Knowledge|
111
+ | | |0 shot | 0 shot | 5 shot | 3 shot | 0 shot |0 shot | 0 shot | 5 shot |
112
+ | [gpt-3.5-turbo](https://openai.com) | |7.1 | 41.76 | | | |7.9 | 70.00 | |
113
+ | [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat) | 34B |6.9 | 54.87 | | | 36.81 |7.6 | 71.04 | |
114
+ | [Qwen-14B-Chat](https://huggingface.co/Qwen/Qwen-14B-Chat) | 14B |6.4 | 48.41 | | | 41.67 |7.2 | 64.91 | |
115
+ | [**Breeze-7B-Instruct-v0.1**](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v0.1) | 7B |5.7 | 41.61 | | | 45.83 |7.1 | 63.26 | |
116
+ | [**Breeze-7B-Instruct-64k-v0.1**](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-64k-v0.1) | 7B |5.5 | 40.99 | | | 36.11 |7.1 | 63.68 | |
117
+ | [Qwen-7B-Chat](https://huggingface.co/Qwen/Qwen-7B-Chat) | 7B |5.4 | 40.02 | | | 33.33 |6.2 | 55.94 | |
118
+ | [Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat) | 6B |5.0 | 44.79 | | | 25.69 |6.0 | 59.45 | |
119
+ | [Taiwan-LLM-13B-v2.0-chat](https://huggingface.co/yentinglin/Taiwan-LLM-13B-v2.0-chat) | 13B |5.0 | 29.47 | | | 23.61 |-* | 50.50 | |
120
+ | [Taiwan-LLM-7B-v2.1-chat](https://huggingface.co/yentinglin/Taiwan-LLM-7B-v2.1-chat) | 7B |4.2 | 28.08 | | | 31.25 | -* | 42.72 | |
121
+
122
+ \* Taiwan-LLM models responds to multi-turn questions (English) in Traditional Chinese.
123
+
124
+ **Category Score of MT-Bench-tw (0 shot)**
125
+
126
+ | Models | STEM |Extraction|Reasoning| Math | Coding | Roleplay| Writing |Humanities|Average|
127
+ |-----------------------------------------------------|---------|---------|---------|---------|---------|---------|---------|---------|--------|
128
+ | gpt-3.5-turbo | | | | | | | | | |
129
+ | Yi-34B-Chat | | | | | | | | | |
130
+ | Qwen-14B-Chat | | | | | | | | | |
131
+ | **Breeze-7B-Instruct-v0.1** | | | | | | | | | |
132
+ | **Breeze-7B-Instruct-64k-v0.1** | | | | | | | | | |
133
+ | Qwen-7B-Chat | | | | | | | | | |
134
+ | Yi-6B-Chat | | | | | | | | | |
135
+ | Taiwan-LLM-13B-v2.0-chat | | | | | | | | | |
136
+ | Taiwan-LLM-7B-v2.1-chat | | | | | | | | | |
137
+
138
+ **Category ACC of TMMLU+ (0 shot)**
139
+
140
+ | Model | STEM | Social Science | Humanities | Other | Average |
141
+ |-----------------------------------------------------|--------------|----------------|------------|------------|---------|
142
+ | gpt-3.5-turbo | 41.56 | 46.72 | 36.73 | 42.03 | |
143
+ | Yi-34B-Chat | 47.65 | 64.25 | 52.73 | 54.91 | |
144
+ | Qwen-14B-Chat | 43.83 | 55.00 | 48.55 | 46.22 | |
145
+ | **Breeze-7B-Instruct-v0.1** | 37.41 | 46.81 | 42.06 | 40.16 | |
146
+ | **Breeze-7B-Instruct-64k-v0.1** | 37.88 | 46.35 | 40.31 | 39.40 | |
147
+ | Qwen-7B-Chat | 35.44 | 46.22 | 38.35 | 40.06 | |
148
+ | Yi-6B-Chat | 37.80 | 51.74 | 45.36 | 44.25 | |
149
+ | Taiwan-LLM-13B-v2.0-chat | 27.74 | 33.69 | 27.03 | 29.43 | |
150
+ | Taiwan-LLM-7B-v2.1-chat | 25.58 | 31.76 | 27.36 | 27.61 | |
151
+
152
+
153
+
154
  ## Inference Performance
155
  In this test, we use the first 700 characters of the [web article](https://health.udn.com/health/story/5976/7699252?from=udn_ch1005_main_index) as the input and ask the model to write the same article again.
156
  All inferences run on 2 RTX A6000 GPUs (using `vllm`, with a tensor-parallel size of 2).
157
 
158
  | Models | Inference Time (sec)|Estimated Max Input Length (Char)|
159
  |--------------------------------------------------------------------|-------------------|--------------------------|
160
+ | Yi-6B | 10.62 | 5.2k |
161
  | **Breeze-7B-Instruct-v0.1** | 10.74 | 11.1k |
162
  | **Breeze-7B-Instruct-64k-v0.1** | 10.74 | 88.8k |
163
  | Qwen-7B | 10.86 | 9.8k |
 
208
  The suggested default `SYS_PROMPT` is
209
  ```txt
210
  You are a helpful AI assistant built by MediaTek Research. The user you are helping speaks Traditional Chinese and comes from Taiwan.
211
+ ```