Qwen
/

Qwen-7B-Chat-Int4

@@ -108,10 +108,10 @@ For more information, please refer to our [GitHub repo](https://github.com/QwenL
 We illustrate the zero-shot performance of both BF16 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:
-|  Quantization |   MMLU     |  CEval (val) |  GSM8K |  Humaneval |
-| ------------- | :--------: | :----------: | :----: | :--------: |
-| BF16          |    55.8    |     59.7     |  50.3  |    37.2    |
-| Int4          |    55.1    |     59.2     |  49.7  |    35.4    |
 ### 推理速度 (Inference Speed)
@@ -119,10 +119,10 @@ We illustrate the zero-shot performance of both BF16 and Int4 models on the benc
 We measured the average inference speed of generating 2048 and 8192 tokens under BF16 precision and Int4 quantization level, respectively.
-|  Quantization | Speed (2048 tokens) | Speed (8192 tokens) |
-| ------------- | :------------------:| :------------------:|
-|      BF16     | 30.53               | 28.51               |
-|      Int4     | 45.60               | 33.83               |
 具体而言，我们记录在长度为1的上下文的条件下生成8192个token的性能。评测运行于单张A100-SXM4-80G GPU，使用PyTorch 2.0.1和CUDA 11.4。推理速度是生成8192个token的速度均值。
@@ -135,7 +135,7 @@ In detail, the setting of profiling is generating 8192 new tokens with 1 context
 We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under BF16 or Int4 quantization level, respectively. The results are shown below.
 | Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
-| ------------------ | :---------------------------------: | :-----------------------------------: |
 | BF16               |               18.99GB               |                24.40GB                |
 | Int4               |               10.20GB               |                15.61GB                |
@@ -162,12 +162,12 @@ Our tokenizer based on tiktoken is different from other tokenizers, e.g., senten
 The details of the model architecture of Qwen-7B-Chat are listed as follows:
 | Hyperparameter  | Value  |
-| :-------------  | :----: |
-| n_layers        | 32     |
-| n_heads         | 32     |
-| d_model         | 4096   |
 | vocab size      | 151851 |
-| sequence length | 8192   |
 在位置编码、FFN激活函数和normalization的实现方式上，我们也采用了目前最流行的做法，
 即RoPE相对位置编码、SwiGLU激活函数、RMSNorm（可选安装flash-attention加速）。
@@ -204,7 +204,7 @@ Note: Due to rounding errors caused by hardware and framework, differences in re
 We demonstrate the 0-shot & 5-shot accuracy of Qwen-7B-Chat on C-Eval validation set
 |              Model               | Avg. Acc. |
-|:--------------------------------:| :-------: |
 |          LLaMA2-7B-Chat          |   31.9    |
 |         LLaMA2-13B-Chat          |   36.2    |
 |         LLaMA2-70B-Chat          |   44.3    |
@@ -246,7 +246,7 @@ The 0-shot & 5-shot accuracy of Qwen-7B-Chat on MMLU is provided below.
 The performance of Qwen-7B-Chat still on the top between other human-aligned models with comparable size.
 |              Model               | Avg. Acc. |
-|:--------------------------------:| :-------: |
 |         ChatGLM2-6B-Chat         |   46.0    |
 |          LLaMA2-7B-Chat          |   46.2    |
 |         InternLM-7B-Chat         |   51.1    |
@@ -266,18 +266,18 @@ Qwen-7B-Chat在[HumanEval](https://github.com/openai/human-eval)的zero-shot Pas
 The zero-shot Pass@1 of Qwen-7B-Chat on [HumanEval](https://github.com/openai/human-eval) is demonstrated below
-|          Model          |  Pass@1   |
-|:-----------------------:| :-------: |
-|    ChatGLM2-6B-Chat     |   11.0    |
-|     LLaMA2-7B-Chat      |   12.2    |
-|    InternLM-7B-Chat     |   14.6    |
-|    Baichuan2-7B-Chat    |   13.4    |
-|     LLaMA2-13B-Chat     |   18.9    |
-|   Baichuan2-13B-Chat    |   17.7    |
-|     LLaMA2-70B-Chat     |   32.3    |
-| Qwen-7B-Chat (original) |   24.4    |
-|    **Qwen-7B-Chat**     |   37.2    |
-|    **Qwen-14B-Chat**    | **43.9**  |
 ### 数学评测（Mathematics Evaluation）
@@ -285,20 +285,20 @@ The zero-shot Pass@1 of Qwen-7B-Chat on [HumanEval](https://github.com/openai/hu
 The accuracy of Qwen-7B-Chat on GSM8K is shown below
-|              Model               |   Acc.    |
-|:--------------------------------:| :-------: |
-|          LLaMA2-7B-Chat          |   26.3    |
-|         ChatGLM2-6B-Chat         |   28.8    |
-|        Baichuan2-7B-Chat         |   32.8    |
-|         InternLM-7B-Chat         |   33.0    |
-|         LLaMA2-13B-Chat          |   37.1    |
-|        Baichuan2-13B-Chat        |   55.3    |
-|         LLaMA2-70B-Chat          |   59.3    |
-| Qwen-7B-Chat (original) (0-shot) |   41.1    |
-|    **Qwen-7B-Chat (0-shot)**     |   50.3    |
-|    **Qwen-7B-Chat (8-shot)**     |   54.1    |
-|    **Qwen-14B-Chat (0-shot)**    | **60.1**  |
-|    **Qwen-14B-Chat (8-shot)**    |   59.3    |
 ### 长序列评测（Long-Context Understanding）
@@ -311,7 +311,7 @@ We introduce NTK-aware interpolation, LogN attention scaling to extend the conte
 **(To use these tricks, please set `use_dynamic_ntk` and `use_long_attn` to true in config.json.)**
 | Model             | VCSUM (zh) |
-| :---------------- | :--------: |
 | GPT-3.5-Turbo-16k |    16.0    |
 | LLama2-7B-Chat    |    0.2     |
 | InternLM-7B-Chat  |    13.0    |

 We illustrate the zero-shot performance of both BF16 and Int4 models on the benchmark, and we find that the quantized model does not suffer from significant performance degradation. Results are shown below:
+| Quantization | MMLU | CEval (val) | GSM8K | Humaneval |
+|--------------|:----:|:-----------:|:-----:|:---------:|
+| BF16         | 55.8 |    59.7     | 50.3  |   37.2    |
+| Int4         | 55.1 |    59.2     | 49.7  |   35.4    |
 ### 推理速度 (Inference Speed)
 We measured the average inference speed of generating 2048 and 8192 tokens under BF16 precision and Int4 quantization level, respectively.
+| Quantization | Speed (2048 tokens) | Speed (8192 tokens) |
+|--------------|:-------------------:|:-------------------:|
+| BF16         |        30.53        |        28.51        |
+| Int4         |        45.60        |        33.83        |
 具体而言，我们记录在长度为1的上下文的条件下生成8192个token的性能。评测运行于单张A100-SXM4-80G GPU，使用PyTorch 2.0.1和CUDA 11.4。推理速度是生成8192个token的速度均值。
 We also profile the peak GPU memory usage for encoding 2048 tokens as context (and generating single token) and generating 8192 tokens (with single token as context) under BF16 or Int4 quantization level, respectively. The results are shown below.
 | Quantization Level | Peak Usage for Encoding 2048 Tokens | Peak Usage for Generating 8192 Tokens |
+|--------------------|:-----------------------------------:|:-------------------------------------:|
 | BF16               |               18.99GB               |                24.40GB                |
 | Int4               |               10.20GB               |                15.61GB                |
 The details of the model architecture of Qwen-7B-Chat are listed as follows:
 | Hyperparameter  | Value  |
+|:----------------|:------:|
+| n_layers        |   32   |
+| n_heads         |   32   |
+| d_model         |  4096  |
 | vocab size      | 151851 |
+| sequence length |  8192  |
 在位置编码、FFN激活函数和normalization的实现方式上，我们也采用了目前最流行的做法，
 即RoPE相对位置编码、SwiGLU激活函数、RMSNorm（可选安装flash-attention加速）。
 We demonstrate the 0-shot & 5-shot accuracy of Qwen-7B-Chat on C-Eval validation set
 |              Model               | Avg. Acc. |
+|:--------------------------------:|:---------:|
 |          LLaMA2-7B-Chat          |   31.9    |
 |         LLaMA2-13B-Chat          |   36.2    |
 |         LLaMA2-70B-Chat          |   44.3    |
 The performance of Qwen-7B-Chat still on the top between other human-aligned models with comparable size.
 |              Model               | Avg. Acc. |
+|:--------------------------------:|:---------:|
 |         ChatGLM2-6B-Chat         |   46.0    |
 |          LLaMA2-7B-Chat          |   46.2    |
 |         InternLM-7B-Chat         |   51.1    |
 The zero-shot Pass@1 of Qwen-7B-Chat on [HumanEval](https://github.com/openai/human-eval) is demonstrated below
+|          Model          |  Pass@1  |
+|:-----------------------:|:--------:|
+|    ChatGLM2-6B-Chat     |   11.0   |
+|     LLaMA2-7B-Chat      |   12.2   |
+|    InternLM-7B-Chat     |   14.6   |
+|    Baichuan2-7B-Chat    |   13.4   |
+|     LLaMA2-13B-Chat     |   18.9   |
+|   Baichuan2-13B-Chat    |   17.7   |
+|     LLaMA2-70B-Chat     |   32.3   |
+| Qwen-7B-Chat (original) |   24.4   |
+|    **Qwen-7B-Chat**     |   37.2   |
+|    **Qwen-14B-Chat**    | **43.9** |
 ### 数学评测（Mathematics Evaluation）
 The accuracy of Qwen-7B-Chat on GSM8K is shown below
+|              Model               |   Acc.   |
+|:--------------------------------:|:--------:|
+|          LLaMA2-7B-Chat          |   26.3   |
+|         ChatGLM2-6B-Chat         |   28.8   |
+|        Baichuan2-7B-Chat         |   32.8   |
+|         InternLM-7B-Chat         |   33.0   |
+|         LLaMA2-13B-Chat          |   37.1   |
+|        Baichuan2-13B-Chat        |   55.3   |
+|         LLaMA2-70B-Chat          |   59.3   |
+| Qwen-7B-Chat (original) (0-shot) |   41.1   |
+|    **Qwen-7B-Chat (0-shot)**     |   50.3   |
+|    **Qwen-7B-Chat (8-shot)**     |   54.1   |
+|    **Qwen-14B-Chat (0-shot)**    | **60.1** |
+|    **Qwen-14B-Chat (8-shot)**    |   59.3   |
 ### 长序列评测（Long-Context Understanding）
 **(To use these tricks, please set `use_dynamic_ntk` and `use_long_attn` to true in config.json.)**
 | Model             | VCSUM (zh) |
+|:------------------|:----------:|
 | GPT-3.5-Turbo-16k |    16.0    |
 | LLama2-7B-Chat    |    0.2     |
 | InternLM-7B-Chat  |    13.0    |