MediaTek-Research
/

Breeze-7B-Instruct-v0_1

@@ -18,7 +18,7 @@ This achievement marks a significant milestone as it is the first instance of vo
 [Breeze-7B-Instruct-v0.1](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v0.1) derives from the base model Breeze-7B-Base-v0.1
 and has undergone supervised fine-tuning with over 1 million instances to
 sharpen its capabilities. This fine-tuned model demonstrates impressive performance in benchmarks for both English and Traditional Chinese, surpassing the results of
-Taiwan-LLM-7B-v2.1-Chat, Taiwan-LLM-13B-v2.0-Chat and Qwen-7B-Chat in Traditional Chinese assessments. It also excels in some benchmarks against Yi-6B-Chat.
 In English evaluations, Breeze-7B-Instruct-v0.1 shows comparable results to Mistral-7B-Instruct-v0.1 on the MMLU and MT-Bench benchmarks. [See [Chat Model Performance](#chat-model-performance).]
@@ -61,6 +61,12 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
 ## Base Model Performance
 | Models                                       |        | TMMLU+ (ACC) | DRCD (EM)   | Table (ACC) | MMLU (ACC) |
 |----------------------------------------------|--------|--------------|-------------|-------------|------------|
 |                                              |        |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Knowledge|
@@ -74,8 +80,10 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
 \* Few-shot learning cannot effectively guide the model to generate the proper answer.
-| Category ACC of TMMLU+ (5 shot)                     | STEM         | Social Science | Humanities | Other      |
 |-----------------------------------------------------|--------------|----------------|------------|------------|
 | Yi-34B                                        | 56.03        | 73.06          | 61.12      | 62.19      |
 | Qwen-14B                                       | 46.51        | 58.20          | 51.12      | 49.38      |
@@ -85,42 +93,9 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
 | Mistral-7B-v0.1                           | 33.01        | 42.23          | 35.86      | 37.63      |
-**TMMLU+**, **DRCD**, and **Table** source from [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2).
-[MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
- and [ikala/tmmluplus](https://huggingface.co/datasets/ikala/tmmluplus). **MMLU** sources from [hails/mmlu_no_train](https://huggingface.co/datasets/hails/mmlu_no_train).
- We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**.
-## Chat Model Performance
-| Models                                     |        | TMMLU+ (ACC) | TMMLU+ (ACC) | DRCD (EM) | Table (ACC) | MT-Bench-tw (Score) | MMLU (ACC) | MMLU (ACC) | MT-Bench (Score) |
-|--------------------------------------------|--------|--------------|--------------|-----------|-------------|--------|------------|------------|------------------|
-|                                                                                                         |        |TC, Knowledge |TC, Knowledge |TC, Reasoning|TC, Reasoning|TC, Chat           |EN, Knowledge|EN, Knowledge|EN, Chat        |
-|                                                                                                         |        | 0 shot       | 5 shot       | 3 shot    | 0 shot | 0 shot              | 0 shot     | 5 shot    | 0 shot           |
-| [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat)                                                 | 34B    | 54.87        |              |           | 36.81 |   6.9             | 71.04      |           |    7.6            |
-| [Qwen-14B-Chat](https://huggingface.co/Qwen/Qwen-14B-Chat)                                              | 14B    | 48.41        |              |           | 41.67 |   6.4             | 64.91      |           |    7.2            |
-| [Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat)                                                   | 6B     | 44.79        |              |           | 25.69 |   5.0             | 59.45      |           |    6.0            |
-| [gpt-3.5-turbo](https://openai.com)                                                                                     |        | 41.76        |              |           |  |    7.1             |   70.00      |           |    7.9            |
-| [**Breeze-7B-Instruct-v0.1**](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v0.1)         | 7B     | 41.61        |              |           | 45.83  |   5.7             | 63.26      |           |    7.1            |
-| [**Breeze-7B-Instruct-64k-v0.1**](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-64k-v0.1) | 7B     | 40.99        |              |           | 36.11 |   5.5             | 63.68      |           |    7.1            |
-| [Qwen-7B-Chat](https://huggingface.co/Qwen/Qwen-7B-Chat)                                                | 7B     | 40.02        |              |           | 33.33 |   5.4             | 55.94      |           |    6.2            |
-| [Taiwan-LLM-13B-v2.0-chat](https://huggingface.co/yentinglin/Taiwan-LLM-13B-v2.0-chat)                  | 13B    | 29.47        |              |           | 23.61 |   5.0             | 50.50      |           |     -*            |
-| [Taiwan-LLM-7B-v2.1-chat](https://huggingface.co/yentinglin/Taiwan-LLM-7B-v2.1-chat)                    | 7B     | 28.08        |              |           | 31.25 |   4.2             | 42.72      |           |     -*            |
-\* Taiwan-LLM models responds to multi-turn questions (English) in Traditional Chinese.
-| Category ACC of TMMLU+ (0 shot)                     | STEM         | Social Science | Humanities | Other      |
-|-----------------------------------------------------|--------------|----------------|------------|------------|
-| Yi-34B-Chat                                         | 47.65        | 64.25          | 52.73      | 54.91      |
-| Qwen-14B-Chat                                       | 43.83        | 55.00          | 48.55      | 46.22      |
-| Yi-6B-Chat                                          | 37.80        | 51.74          | 45.36      | 44.25      |
-| gpt-3.5-turbo                                       | 41.56        | 46.72          | 36.73      | 42.03      |
-| **Breeze-7B-Instruct-v0.1**                             | 37.41        | 46.81          | 42.06      | 40.16      |
-| **Breeze-7B-Instruct-64k-v0.1**                         | 37.88        | 46.35          | 40.31      | 39.40      |
-| Qwen-7B-Chat                                        | 35.44        | 46.22          | 38.35      | 40.06      |
-| Taiwan-LLM-13B-v2.0-chat                            | 27.74        | 33.69          | 27.03      | 29.43      |
-| Taiwan-LLM-7B-v2.1-chat                             | 25.58        | 31.76          | 27.36      | 27.61      |
 **TMMLU+**, **DRCD**, **Table**, and **MT-Bench-tw** source from [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2).
 [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
@@ -130,13 +105,59 @@ Breeze-7B-Instruct-64k-v0.1 can solve tasks such as question answering and summa
  We use the code revised from [fastchat llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) to evaluate **MT-Bench-tw** and **MT-Bench**.
 ## Inference Performance
 In this test, we use the first 700 characters of the [web article](https://health.udn.com/health/story/5976/7699252?from=udn_ch1005_main_index) as the input and ask the model to write the same article again.
 All inferences run on 2 RTX A6000 GPUs (using `vllm`, with a tensor-parallel size of 2).
 | Models                                                             | Inference Time (sec)|Estimated Max Input Length (Char)|
 |--------------------------------------------------------------------|-------------------|--------------------------|
-| Yi-6B                                                        |   10.62  |   4.5k                |
 | **Breeze-7B-Instruct-v0.1**                              |  10.74  |    11.1k                 |
 | **Breeze-7B-Instruct-64k-v0.1**                              | 10.74       |  88.8k            |
 | Qwen-7B                                                       |   10.86         |    9.8k                  |
@@ -187,4 +208,4 @@ where `SYS_PROMPT`, `QUERY1`, `RESPONSE1`, and `QUERY2` can be provided by the u
 The suggested default `SYS_PROMPT` is
 ```txt
 You are a helpful AI assistant built by MediaTek Research. The user you are helping speaks Traditional Chinese and comes from Taiwan.
-```

 [Breeze-7B-Instruct-v0.1](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v0.1) derives from the base model Breeze-7B-Base-v0.1
 and has undergone supervised fine-tuning with over 1 million instances to
 sharpen its capabilities. This fine-tuned model demonstrates impressive performance in benchmarks for both English and Traditional Chinese, surpassing the results of
+Taiwan-LLM-7B-v2.1-chat, Taiwan-LLM-13B-v2.0-chat and Qwen-7B-chat in Traditional Chinese assessments. It also excels in some benchmarks against Yi-6B-Chat.
 In English evaluations, Breeze-7B-Instruct-v0.1 shows comparable results to Mistral-7B-Instruct-v0.1 on the MMLU and MT-Bench benchmarks. [See [Chat Model Performance](#chat-model-performance).]
 ## Base Model Performance
+**TMMLU+**, **DRCD**, and **Table** source from [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2).
+[MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
+ and [ikala/tmmluplus](https://huggingface.co/datasets/ikala/tmmluplus). **MMLU** sources from [hails/mmlu_no_train](https://huggingface.co/datasets/hails/mmlu_no_train).
+ We use the code revised from [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate **TMMLU+**, **DRCD**, **Table**, and **MMLU**.
 | Models                                       |        | TMMLU+ (ACC) | DRCD (EM)   | Table (ACC) | MMLU (ACC) |
 |----------------------------------------------|--------|--------------|-------------|-------------|------------|
 |                                              |        |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Knowledge|
 \* Few-shot learning cannot effectively guide the model to generate the proper answer.
+**Category ACC of TMMLU+ (5 shot)**
+| Models                           | STEM         | Social Science | Humanities | Other      |
 |-----------------------------------------------------|--------------|----------------|------------|------------|
 | Yi-34B                                        | 56.03        | 73.06          | 61.12      | 62.19      |
 | Qwen-14B                                       | 46.51        | 58.20          | 51.12      | 49.38      |
 | Mistral-7B-v0.1                           | 33.01        | 42.23          | 35.86      | 37.63      |
+## Chat Model Performance
 **TMMLU+**, **DRCD**, **Table**, and **MT-Bench-tw** source from [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2).
 [MediaTek-Research/TCEval-v2](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2) derives from [TCEval-v1](https://github.com/mtkresearch/MR-Models/tree/main/TC-Eval)
  We use the code revised from [fastchat llm_judge](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) to evaluate **MT-Bench-tw** and **MT-Bench**.
+| Models                                                                                                  |        |MT-Bench-tw (Score) | TMMLU+ (ACC) | TMMLU+ (ACC) | DRCD (EM)   | Table (ACC) | MT-Bench (Score) | MMLU (ACC)  | MMLU (ACC)  |
+|---------------------------------------------------------------------------------------------------------|--------|--------------------|--------------|--------------|-------------|-------------|------------------|-------------|-------------|
+|                                                                                                         |        |TC, Chat            |TC, Knowledge |TC, Knowledge |TC, Reasoning|TC, Reasoning|EN, Chat          |EN, Knowledge|EN, Knowledge|
+|                                                                                                         |        |0 shot              | 0 shot       | 5 shot       | 3 shot      | 0 shot      |0 shot            |  0 shot     | 5 shot      |
+| [gpt-3.5-turbo](https://openai.com)                                                                     |        |7.1                 | 41.76        |              |             |             |7.9               |  70.00      |             |
+| [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat)                                                 | 34B    |6.9                 | 54.87        |              |             | 36.81       |7.6               |   71.04     |             |
+| [Qwen-14B-Chat](https://huggingface.co/Qwen/Qwen-14B-Chat)                                              | 14B    |6.4                 | 48.41        |              |             | 41.67       |7.2               |    64.91    |             |
+| [**Breeze-7B-Instruct-v0.1**](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-v0.1)         | 7B     |5.7                 | 41.61        |              |             | 45.83       |7.1               |    63.26    |             |
+| [**Breeze-7B-Instruct-64k-v0.1**](https://huggingface.co/MediaTek-Research/Breeze-7B-Instruct-64k-v0.1) | 7B     |5.5                 | 40.99        |              |             | 36.11       |7.1               |    63.68    |             |
+| [Qwen-7B-Chat](https://huggingface.co/Qwen/Qwen-7B-Chat)                                                | 7B     |5.4                 | 40.02        |              |             | 33.33       |6.2               |    55.94    |             |
+| [Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat)                                                   | 6B     |5.0                 | 44.79        |              |             | 25.69       |6.0               |    59.45    |             |
+| [Taiwan-LLM-13B-v2.0-chat](https://huggingface.co/yentinglin/Taiwan-LLM-13B-v2.0-chat)                  | 13B    |5.0                 | 29.47        |              |             | 23.61       |-*                |    50.50    |             |
+| [Taiwan-LLM-7B-v2.1-chat](https://huggingface.co/yentinglin/Taiwan-LLM-7B-v2.1-chat)                    | 7B     |4.2                 | 28.08        |              |             | 31.25       | -*               |    42.72    |             |
+\* Taiwan-LLM models responds to multi-turn questions (English) in Traditional Chinese.
+**Category Score of MT-Bench-tw (0 shot)**
+| Models                                              | STEM    |Extraction|Reasoning| Math   | Coding  | Roleplay| Writing |Humanities|Average|
+|-----------------------------------------------------|---------|---------|---------|---------|---------|---------|---------|---------|--------|
+| gpt-3.5-turbo                                       |         |         |         |         |         |         |         |         |         |
+| Yi-34B-Chat                                         |         |         |         |         |         |         |         |         |         |
+| Qwen-14B-Chat                                       |         |         |         |         |         |         |         |         |         |
+| **Breeze-7B-Instruct-v0.1**                         |         |         |         |         |         |         |         |         |         |
+| **Breeze-7B-Instruct-64k-v0.1**                     |         |         |         |         |         |         |         |         |         |
+| Qwen-7B-Chat                                        |         |         |         |         |         |         |         |         |         |
+| Yi-6B-Chat                                          |         |         |         |         |         |         |         |         |         |
+| Taiwan-LLM-13B-v2.0-chat                            |         |         |         |         |         |         |         |         |         |
+| Taiwan-LLM-7B-v2.1-chat                             |         |         |         |         |         |         |         |         |         |
+**Category ACC of TMMLU+ (0 shot)**
+| Model                                               | STEM         | Social Science | Humanities | Other      | Average |
+|-----------------------------------------------------|--------------|----------------|------------|------------|---------|
+| gpt-3.5-turbo                                       | 41.56        | 46.72          | 36.73      | 42.03      |         |
+| Yi-34B-Chat                                         | 47.65        | 64.25          | 52.73      | 54.91      |         |
+| Qwen-14B-Chat                                       | 43.83        | 55.00          | 48.55      | 46.22      |         |
+| **Breeze-7B-Instruct-v0.1**                         | 37.41        | 46.81          | 42.06      | 40.16      |         |
+| **Breeze-7B-Instruct-64k-v0.1**                     | 37.88        | 46.35          | 40.31      | 39.40      |         |
+| Qwen-7B-Chat                                        | 35.44        | 46.22          | 38.35      | 40.06      |         |
+| Yi-6B-Chat                                          | 37.80        | 51.74          | 45.36      | 44.25      |         |
+| Taiwan-LLM-13B-v2.0-chat                            | 27.74        | 33.69          | 27.03      | 29.43      |         |
+| Taiwan-LLM-7B-v2.1-chat                             | 25.58        | 31.76          | 27.36      | 27.61      |         |
 ## Inference Performance
 In this test, we use the first 700 characters of the [web article](https://health.udn.com/health/story/5976/7699252?from=udn_ch1005_main_index) as the input and ask the model to write the same article again.
 All inferences run on 2 RTX A6000 GPUs (using `vllm`, with a tensor-parallel size of 2).
 | Models                                                             | Inference Time (sec)|Estimated Max Input Length (Char)|
 |--------------------------------------------------------------------|-------------------|--------------------------|
+| Yi-6B                                                        |   10.62  |   5.2k                |
 | **Breeze-7B-Instruct-v0.1**                              |  10.74  |    11.1k                 |
 | **Breeze-7B-Instruct-64k-v0.1**                              | 10.74       |  88.8k            |
 | Qwen-7B                                                       |   10.86         |    9.8k                  |
 The suggested default `SYS_PROMPT` is
 ```txt
 You are a helpful AI assistant built by MediaTek Research. The user you are helping speaks Traditional Chinese and comes from Taiwan.
+```