llm-blender
/

PairRM

Text Generation

Model card Files Files and versions Community

Dongfu Jiang commited on Nov 11, 2023

Commit

14d4a72

·

1 Parent(s): a2f8211

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -59,8 +59,8 @@ We test the pairwise comparison on
 |    Vicuna -13B-v1.5   |    30.6   |    23.6   |     35    |    28.3   |    36.1   |    37.5   |  45.5 |   39.8   |    37.3   |
 |   WizardLM -13B-v1.2  |    22.2   |    20.8   |    32.5   |    19.2   |    28.7   |    25.4   |  29.2 |    33    |    27.8   |
 |   LLAMA -2-chat -70B  |    34.7   |    33.3   |    36.7   |    35.8   |    51.4   |    54.2   |  47.2 |   47.7   |    45.9   |
-|       AUTO -J 1       |    45.8   |    38.9   |    59.2   |    47.5   |    54.6   |    57.1   |   **58**  |   57.6   |    54.8   |
-|         PairRM        | **56.94** | **52.78** | **58.33** | **55.83** | **61.57** | **59.17** | 57.64 | **62.5** | **59.05** |
 #### HHH-Alignment and MT-bench human judgements
@@ -77,7 +77,7 @@ We test the pairwise comparison on
 |    GPT -3.5-TURBO -0613   |     76.27     |   87.93   |   67.21   |   86.05  |    78.73    |         57.12         |
 |       PROMETHEUS 7B       |     69.49     |   84.48   |   78.69   |   90.7   |    80.09    |         55.14         |
 |       PROMETHEUS 13B      |     81.36     |   82.76   |   75.41   |   76.74  |    79.19    |         57.72         |
-|           PairRM          |   **84.75**   |   84.48   | **80.33** | **90.7** |  **84.62**  |         **59**        |
 |        GPT -4-0613        |     91.53     |    93.1   |   85.25   |   83.72  |    88.69    |         63.87         |
 **While PairRM is a extremely small model (0.4B) based on deberta, the pairwise comparison aggrement performance approches GPT-4's performance!**

 |    Vicuna -13B-v1.5   |    30.6   |    23.6   |     35    |    28.3   |    36.1   |    37.5   |  45.5 |   39.8   |    37.3   |
 |   WizardLM -13B-v1.2  |    22.2   |    20.8   |    32.5   |    19.2   |    28.7   |    25.4   |  29.2 |    33    |    27.8   |
 |   LLAMA -2-chat -70B  |    34.7   |    33.3   |    36.7   |    35.8   |    51.4   |    54.2   |  47.2 |   47.7   |    45.9   |
+|       AUTO -J (13b)       |    45.8   |    38.9   |    59.2   |    47.5   |    54.6   |    57.1   |   **58**  |   57.6   |    54.8   |
+|         **PairRM (0.4b)**       | **56.94** | **52.78** | **58.33** | **55.83** | **61.57** | **59.17** | 57.64 | **62.5** | **59.05** |
 #### HHH-Alignment and MT-bench human judgements
 |    GPT -3.5-TURBO -0613   |     76.27     |   87.93   |   67.21   |   86.05  |    78.73    |         57.12         |
 |       PROMETHEUS 7B       |     69.49     |   84.48   |   78.69   |   90.7   |    80.09    |         55.14         |
 |       PROMETHEUS 13B      |     81.36     |   82.76   |   75.41   |   76.74  |    79.19    |         57.72         |
+|           **PairRM (0.4b)**          |   **84.75**   |   84.48   | **80.33** | **90.7** |  **84.62**  |         **59**        |
 |        GPT -4-0613        |     91.53     |    93.1   |   85.25   |   83.72  |    88.69    |         63.87         |
 **While PairRM is a extremely small model (0.4B) based on deberta, the pairwise comparison aggrement performance approches GPT-4's performance!**