Update README.md
Browse files
README.md
CHANGED
@@ -18,7 +18,7 @@ This achievement marks a significant milestone as it is the first instance of vo
|
|
18 |
[Breeze-7B-Instruct-v0.1](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v0.1) derives from the base model Breeze-7B-Base-v0.1
|
19 |
and has undergone supervised fine-tuning with over 1 million instances to
|
20 |
sharpen its capabilities. This fine-tuned model demonstrates impressive performance in benchmarks for both English and Traditional Chinese, surpassing the results of
|
21 |
-
Taiwan-LLM-7B-v2.1-
|
22 |
In English evaluations, Breeze-7B-Instruct-v0.1 shows comparable results to Mistral-7B-Instruct-v0.1 on the MMLU and MT-Bench benchmarks. [See [Chat Model Performance](#chat-model-performance).]
|
23 |
|
24 |
|
@@ -61,6 +61,12 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
|
|
61 |
|
62 |
## Base Model Performance
|
63 |
|
|
|
|
|
|
|
|
|
|
|
|
|
64 |
| Models | | TMMLU+ (ACC) | DRCD (EM) | Table (ACC) | MMLU (ACC) |
|
65 |
|----------------------------------------------|--------|--------------|-------------|-------------|------------|
|
66 |
| | |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Knowledge|
|
@@ -74,8 +80,10 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
|
|
74 |
|
75 |
|
76 |
\* Few-shot learning cannot effectively guide the model to generate the proper answer.
|
|
|
|
|
77 |
|
78 |
-
|
|
79 |
|-----------------------------------------------------|--------------|----------------|------------|------------|
|
80 |
| Yi-34B | 56.03 | 73.06 | 61.12 | 62.19 |
|
81 |
| Qwen-14B | 46.51 | 58.20 | 51.12 | 49.38 |
|
@@ -85,42 +93,9 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
|
|
85 |
| Mistral-7B-v0.1 | 33.01 | 42.23 | 35.86 | 37.63 |
|
86 |
|
87 |
|
88 |
-
**TMMLU+**, **DRCD**, and **Table** source from [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2).
|
89 |
-
[MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
|
90 |
-
and [ikala/tmmluplus](https://huggingface.co/datasets/ikala/tmmluplus). **MMLU** sources from [hails/mmlu_no_train](https://huggingface.co/datasets/hails/mmlu_no_train).
|
91 |
-
We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**.
|
92 |
-
|
93 |
-
|
94 |
-
## Chat Model Performance
|
95 |
-
|
96 |
-
| Models | | TMMLU+ (ACC) | TMMLU+ (ACC) | DRCD (EM) | Table (ACC) | MT-Bench-tw (Score) | MMLU (ACC) | MMLU (ACC) | MT-Bench (Score) |
|
97 |
-
|--------------------------------------------|--------|--------------|--------------|-----------|-------------|--------|------------|------------|------------------|
|
98 |
-
| | |TC, Knowledge |TC, Knowledge |TC, Reasoning|TC, Reasoning|TC, Chat |EN, Knowledge|EN, Knowledge|EN, Chat |
|
99 |
-
| | | 0 shot | 5 shot | 3 shot | 0 shot | 0 shot | 0 shot | 5 shot | 0 shot |
|
100 |
-
| [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat) | 34B | 54.87 | | | 36.81 | 6.9 | 71.04 | | 7.6 |
|
101 |
-
| [Qwen-14B-Chat](https://huggingface.co/Qwen/Qwen-14B-Chat) | 14B | 48.41 | | | 41.67 | 6.4 | 64.91 | | 7.2 |
|
102 |
-
| [Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat) | 6B | 44.79 | | | 25.69 | 5.0 | 59.45 | | 6.0 |
|
103 |
-
| [gpt-3.5-turbo](https://openai.com) | | 41.76 | | | | 7.1 | 70.00 | | 7.9 |
|
104 |
-
| [**Breeze-7B-Instruct-v0.1**](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v0.1) | 7B | 41.61 | | | 45.83 | 5.7 | 63.26 | | 7.1 |
|
105 |
-
| [**Breeze-7B-Instruct-64k-v0.1**](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-64k-v0.1) | 7B | 40.99 | | | 36.11 | 5.5 | 63.68 | | 7.1 |
|
106 |
-
| [Qwen-7B-Chat](https://huggingface.co/Qwen/Qwen-7B-Chat) | 7B | 40.02 | | | 33.33 | 5.4 | 55.94 | | 6.2 |
|
107 |
-
| [Taiwan-LLM-13B-v2.0-chat](https://huggingface.co/yentinglin/Taiwan-LLM-13B-v2.0-chat) | 13B | 29.47 | | | 23.61 | 5.0 | 50.50 | | -* |
|
108 |
-
| [Taiwan-LLM-7B-v2.1-chat](https://huggingface.co/yentinglin/Taiwan-LLM-7B-v2.1-chat) | 7B | 28.08 | | | 31.25 | 4.2 | 42.72 | | -* |
|
109 |
|
110 |
|
111 |
-
|
112 |
-
|
113 |
-
| Category ACC of TMMLU+ (0 shot) | STEM | Social Science | Humanities | Other |
|
114 |
-
|-----------------------------------------------------|--------------|----------------|------------|------------|
|
115 |
-
| Yi-34B-Chat | 47.65 | 64.25 | 52.73 | 54.91 |
|
116 |
-
| Qwen-14B-Chat | 43.83 | 55.00 | 48.55 | 46.22 |
|
117 |
-
| Yi-6B-Chat | 37.80 | 51.74 | 45.36 | 44.25 |
|
118 |
-
| gpt-3.5-turbo | 41.56 | 46.72 | 36.73 | 42.03 |
|
119 |
-
| **Breeze-7B-Instruct-v0.1** | 37.41 | 46.81 | 42.06 | 40.16 |
|
120 |
-
| **Breeze-7B-Instruct-64k-v0.1** | 37.88 | 46.35 | 40.31 | 39.40 |
|
121 |
-
| Qwen-7B-Chat | 35.44 | 46.22 | 38.35 | 40.06 |
|
122 |
-
| Taiwan-LLM-13B-v2.0-chat | 27.74 | 33.69 | 27.03 | 29.43 |
|
123 |
-
| Taiwan-LLM-7B-v2.1-chat | 25.58 | 31.76 | 27.36 | 27.61 |
|
124 |
|
125 |
**TMMLU+**, **DRCD**, **Table**, and **MT-Bench-tw** source from [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2).
|
126 |
[MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
|
@@ -130,13 +105,59 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
|
|
130 |
We use the code revised from [fastchat llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) to evaluate **MT-Bench-tw** and **MT-Bench**.
|
131 |
|
132 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
133 |
## Inference Performance
|
134 |
In this test, we use the first 700 characters of the [web article](https://health.udn.com/health/story/5976/7699252?from=udn_ch1005_main_index) as the input and ask the model to write the same article again.
|
135 |
All inferences run on 2 RTX A6000 GPUs (using `vllm`, with a tensor-parallel size of 2).
|
136 |
|
137 |
| Models | Inference Time (sec)|Estimated Max Input Length (Char)|
|
138 |
|--------------------------------------------------------------------|-------------------|--------------------------|
|
139 |
-
| Yi-6B | 10.62 |
|
140 |
| **Breeze-7B-Instruct-v0.1** | 10.74 | 11.1k |
|
141 |
| **Breeze-7B-Instruct-64k-v0.1** | 10.74 | 88.8k |
|
142 |
| Qwen-7B | 10.86 | 9.8k |
|
@@ -187,4 +208,4 @@ where `SYS_PROMPT`, `QUERY1`, `RESPONSE1`, and `QUERY2` can be provided by the u
|
|
187 |
The suggested default `SYS_PROMPT` is
|
188 |
```txt
|
189 |
You are a helpful AI assistant built by MediaTek Research. The user you are helping speaks Traditional Chinese and comes from Taiwan.
|
190 |
-
```
|
|
|
18 |
[Breeze-7B-Instruct-v0.1](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v0.1) derives from the base model Breeze-7B-Base-v0.1
|
19 |
and has undergone supervised fine-tuning with over 1 million instances to
|
20 |
sharpen its capabilities. This fine-tuned model demonstrates impressive performance in benchmarks for both English and Traditional Chinese, surpassing the results of
|
21 |
+
Taiwan-LLM-7B-v2.1-chat, Taiwan-LLM-13B-v2.0-chat and Qwen-7B-chat in Traditional Chinese assessments. It also excels in some benchmarks against Yi-6B-Chat.
|
22 |
In English evaluations, Breeze-7B-Instruct-v0.1 shows comparable results to Mistral-7B-Instruct-v0.1 on the MMLU and MT-Bench benchmarks. [See [Chat Model Performance](#chat-model-performance).]
|
23 |
|
24 |
|
|
|
61 |
|
62 |
## Base Model Performance
|
63 |
|
64 |
+
**TMMLU+**, **DRCD**, and **Table** source from [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2).
|
65 |
+
[MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
|
66 |
+
and [ikala/tmmluplus](https://huggingface.co/datasets/ikala/tmmluplus). **MMLU** sources from [hails/mmlu_no_train](https://huggingface.co/datasets/hails/mmlu_no_train).
|
67 |
+
We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**.
|
68 |
+
|
69 |
+
|
70 |
| Models | | TMMLU+ (ACC) | DRCD (EM) | Table (ACC) | MMLU (ACC) |
|
71 |
|----------------------------------------------|--------|--------------|-------------|-------------|------------|
|
72 |
| | |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Knowledge|
|
|
|
80 |
|
81 |
|
82 |
\* Few-shot learning cannot effectively guide the model to generate the proper answer.
|
83 |
+
|
84 |
+
**Category ACC of TMMLU+ (5 shot)**
|
85 |
|
86 |
+
| Models | STEM | Social Science | Humanities | Other |
|
87 |
|-----------------------------------------------------|--------------|----------------|------------|------------|
|
88 |
| Yi-34B | 56.03 | 73.06 | 61.12 | 62.19 |
|
89 |
| Qwen-14B | 46.51 | 58.20 | 51.12 | 49.38 |
|
|
|
93 |
| Mistral-7B-v0.1 | 33.01 | 42.23 | 35.86 | 37.63 |
|
94 |
|
95 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
96 |
|
97 |
|
98 |
+
## Chat Model Performance
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
99 |
|
100 |
**TMMLU+**, **DRCD**, **Table**, and **MT-Bench-tw** source from [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2).
|
101 |
[MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
|
|
|
105 |
We use the code revised from [fastchat llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) to evaluate **MT-Bench-tw** and **MT-Bench**.
|
106 |
|
107 |
|
108 |
+
| Models | |MT-Bench-tw (Score) | TMMLU+ (ACC) | TMMLU+ (ACC) | DRCD (EM) | Table (ACC) | MT-Bench (Score) | MMLU (ACC) | MMLU (ACC) |
|
109 |
+
|---------------------------------------------------------------------------------------------------------|--------|--------------------|--------------|--------------|-------------|-------------|------------------|-------------|-------------|
|
110 |
+
| | |TC, Chat |TC, Knowledge |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Chat |EN, Knowledge|EN, Knowledge|
|
111 |
+
| | |0 shot | 0 shot | 5 shot | 3 shot | 0 shot |0 shot | 0 shot | 5 shot |
|
112 |
+
| [gpt-3.5-turbo](https://openai.com) | |7.1 | 41.76 | | | |7.9 | 70.00 | |
|
113 |
+
| [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat) | 34B |6.9 | 54.87 | | | 36.81 |7.6 | 71.04 | |
|
114 |
+
| [Qwen-14B-Chat](https://huggingface.co/Qwen/Qwen-14B-Chat) | 14B |6.4 | 48.41 | | | 41.67 |7.2 | 64.91 | |
|
115 |
+
| [**Breeze-7B-Instruct-v0.1**](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v0.1) | 7B |5.7 | 41.61 | | | 45.83 |7.1 | 63.26 | |
|
116 |
+
| [**Breeze-7B-Instruct-64k-v0.1**](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-64k-v0.1) | 7B |5.5 | 40.99 | | | 36.11 |7.1 | 63.68 | |
|
117 |
+
| [Qwen-7B-Chat](https://huggingface.co/Qwen/Qwen-7B-Chat) | 7B |5.4 | 40.02 | | | 33.33 |6.2 | 55.94 | |
|
118 |
+
| [Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat) | 6B |5.0 | 44.79 | | | 25.69 |6.0 | 59.45 | |
|
119 |
+
| [Taiwan-LLM-13B-v2.0-chat](https://huggingface.co/yentinglin/Taiwan-LLM-13B-v2.0-chat) | 13B |5.0 | 29.47 | | | 23.61 |-* | 50.50 | |
|
120 |
+
| [Taiwan-LLM-7B-v2.1-chat](https://huggingface.co/yentinglin/Taiwan-LLM-7B-v2.1-chat) | 7B |4.2 | 28.08 | | | 31.25 | -* | 42.72 | |
|
121 |
+
|
122 |
+
\* Taiwan-LLM models responds to multi-turn questions (English) in Traditional Chinese.
|
123 |
+
|
124 |
+
**Category Score of MT-Bench-tw (0 shot)**
|
125 |
+
|
126 |
+
| Models | STEM |Extraction|Reasoning| Math | Coding | Roleplay| Writing |Humanities|Average|
|
127 |
+
|-----------------------------------------------------|---------|---------|---------|---------|---------|---------|---------|---------|--------|
|
128 |
+
| gpt-3.5-turbo | | | | | | | | | |
|
129 |
+
| Yi-34B-Chat | | | | | | | | | |
|
130 |
+
| Qwen-14B-Chat | | | | | | | | | |
|
131 |
+
| **Breeze-7B-Instruct-v0.1** | | | | | | | | | |
|
132 |
+
| **Breeze-7B-Instruct-64k-v0.1** | | | | | | | | | |
|
133 |
+
| Qwen-7B-Chat | | | | | | | | | |
|
134 |
+
| Yi-6B-Chat | | | | | | | | | |
|
135 |
+
| Taiwan-LLM-13B-v2.0-chat | | | | | | | | | |
|
136 |
+
| Taiwan-LLM-7B-v2.1-chat | | | | | | | | | |
|
137 |
+
|
138 |
+
**Category ACC of TMMLU+ (0 shot)**
|
139 |
+
|
140 |
+
| Model | STEM | Social Science | Humanities | Other | Average |
|
141 |
+
|-----------------------------------------------------|--------------|----------------|------------|------------|---------|
|
142 |
+
| gpt-3.5-turbo | 41.56 | 46.72 | 36.73 | 42.03 | |
|
143 |
+
| Yi-34B-Chat | 47.65 | 64.25 | 52.73 | 54.91 | |
|
144 |
+
| Qwen-14B-Chat | 43.83 | 55.00 | 48.55 | 46.22 | |
|
145 |
+
| **Breeze-7B-Instruct-v0.1** | 37.41 | 46.81 | 42.06 | 40.16 | |
|
146 |
+
| **Breeze-7B-Instruct-64k-v0.1** | 37.88 | 46.35 | 40.31 | 39.40 | |
|
147 |
+
| Qwen-7B-Chat | 35.44 | 46.22 | 38.35 | 40.06 | |
|
148 |
+
| Yi-6B-Chat | 37.80 | 51.74 | 45.36 | 44.25 | |
|
149 |
+
| Taiwan-LLM-13B-v2.0-chat | 27.74 | 33.69 | 27.03 | 29.43 | |
|
150 |
+
| Taiwan-LLM-7B-v2.1-chat | 25.58 | 31.76 | 27.36 | 27.61 | |
|
151 |
+
|
152 |
+
|
153 |
+
|
154 |
## Inference Performance
|
155 |
In this test, we use the first 700 characters of the [web article](https://health.udn.com/health/story/5976/7699252?from=udn_ch1005_main_index) as the input and ask the model to write the same article again.
|
156 |
All inferences run on 2 RTX A6000 GPUs (using `vllm`, with a tensor-parallel size of 2).
|
157 |
|
158 |
| Models | Inference Time (sec)|Estimated Max Input Length (Char)|
|
159 |
|--------------------------------------------------------------------|-------------------|--------------------------|
|
160 |
+
| Yi-6B | 10.62 | 5.2k |
|
161 |
| **Breeze-7B-Instruct-v0.1** | 10.74 | 11.1k |
|
162 |
| **Breeze-7B-Instruct-64k-v0.1** | 10.74 | 88.8k |
|
163 |
| Qwen-7B | 10.86 | 9.8k |
|
|
|
208 |
The suggested default `SYS_PROMPT` is
|
209 |
```txt
|
210 |
You are a helpful AI assistant built by MediaTek Research. The user you are helping speaks Traditional Chinese and comes from Taiwan.
|
211 |
+
```
|