Wauplin HF staff commited on
Commit
6581330
·
1 Parent(s): 8d44a71

[WIP] Add evaluation results to model card metadata

Browse files

This is a work in progress. The goal is to list evaluation results in the model card metadata, especially the results from the Open LLM Leaderboard. This PR has **not** been created automatically.

#### Pending questions:
1. Should we report all metrics for each task? (especially the `_stderr` ones?) Or only the one that is displayed in the LLM Leaderboard?
2. Are the dataset `type`/`name`/`config`/`split`/`num_few_shot` accurate in the suggested changes?
3. How to report the MMLU results? There are 57 different `hendrycksTest` datasets for a total of 228 metrics? 😵
4. How to report MT-Bench results? (asking since they are reported in the model card but not in the metadata)
5. How to report AlpacaEval results? (asking since they are reported in the model card but not in the metadata)

Related thread: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/370#65663f60589e212284db2ffc.
Related PR in the Hub docs: https://github.com/huggingface/hub-docs/pull/1144.

cc

@clefourrier


@julien-c


@lewtun


@Weyaxi

Files changed (1) hide show
  1. README.md +154 -18
README.md CHANGED
@@ -1,9 +1,6 @@
1
  ---
2
  tags:
3
  - generated_from_trainer
4
- model-index:
5
- - name: zephyr-7b-beta
6
- results: []
7
  license: mit
8
  datasets:
9
  - HuggingFaceH4/ultrachat_200k
@@ -16,8 +13,161 @@ widget:
16
  output:
17
  text: "Arr! 'Tis a puzzlin' matter, me hearty! A llama on yer lawn be a rare sight, but I've got a plan that might help ye get rid of 'im. Ye'll need to gather some carrots and hay, and then lure the llama away with the promise of a tasty treat. Once he's gone, ye can clean up yer lawn and enjoy the peace and quiet once again. But beware, me hearty, for there may be more llamas where that one came from! Arr!"
18
  pipeline_tag: text-generation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  ---
20
-
21
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
22
  should probably proofread and complete it, then remove this comment. -->
23
 
@@ -86,12 +236,9 @@ Here's how you can run the model using the `pipeline()` function from 🤗 Trans
86
  # Install transformers from source - only needed for versions <= v4.34
87
  # pip install git+https://github.com/huggingface/transformers.git
88
  # pip install accelerate
89
-
90
  import torch
91
  from transformers import pipeline
92
-
93
  pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-beta", torch_dtype=torch.bfloat16, device_map="auto")
94
-
95
  # We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
96
  messages = [
97
  {
@@ -149,12 +296,8 @@ The following hyperparameters were used during training:
149
  - lr_scheduler_type: linear
150
  - lr_scheduler_warmup_ratio: 0.1
151
  - num_epochs: 3.0
152
-
153
  ### Training results
154
-
155
  The table below shows the full set of DPO training metrics:
156
-
157
-
158
  | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
159
  |:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
160
  | 0.6284 | 0.05 | 100 | 0.6098 | 0.0425 | -0.1872 | 0.7344 | 0.2297 | -258.8416 | -253.8099 | -2.7976 | -2.8234 |
@@ -215,19 +358,13 @@ The table below shows the full set of DPO training metrics:
215
  | 0.0077 | 2.89 | 5600 | 0.7520 | -4.5586 | -8.3485 | 0.7969 | 3.7899 | -340.4545 | -299.8206 | -2.3078 | -2.3517 |
216
  | 0.0094 | 2.94 | 5700 | 0.7527 | -4.5542 | -8.3509 | 0.7812 | 3.7967 | -340.4790 | -299.7773 | -2.3062 | -2.3510 |
217
  | 0.0054 | 2.99 | 5800 | 0.7520 | -4.5169 | -8.3079 | 0.7812 | 3.7911 | -340.0493 | -299.4038 | -2.3081 | -2.3530 |
218
-
219
-
220
  ### Framework versions
221
-
222
  - Transformers 4.35.0.dev0
223
  - Pytorch 2.0.1+cu118
224
  - Datasets 2.12.0
225
  - Tokenizers 0.14.0
226
-
227
  ## Citation
228
-
229
  If you find Zephyr-7B-β is useful in your work, please cite it with:
230
-
231
  ```
232
  @misc{tunstall2023zephyr,
233
  title={Zephyr: Direct Distillation of LM Alignment},
@@ -240,7 +377,6 @@ If you find Zephyr-7B-β is useful in your work, please cite it with:
240
  ```
241
  # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
242
  Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_HuggingFaceH4__zephyr-7b-beta)
243
-
244
  | Metric | Value |
245
  |-----------------------|---------------------------|
246
  | Avg. | 52.15 |
 
1
  ---
2
  tags:
3
  - generated_from_trainer
 
 
 
4
  license: mit
5
  datasets:
6
  - HuggingFaceH4/ultrachat_200k
 
13
  output:
14
  text: "Arr! 'Tis a puzzlin' matter, me hearty! A llama on yer lawn be a rare sight, but I've got a plan that might help ye get rid of 'im. Ye'll need to gather some carrots and hay, and then lure the llama away with the promise of a tasty treat. Once he's gone, ye can clean up yer lawn and enjoy the peace and quiet once again. But beware, me hearty, for there may be more llamas where that one came from! Arr!"
15
  pipeline_tag: text-generation
16
+ model-index:
17
+ - name: zephyr-7b-beta
18
+ results:
19
+
20
+ # AI2 Reasoning Challenge (25-Shot) (Open LLM Leaderboard)
21
+ - task:
22
+ type: text-generation
23
+ name: Text Generation
24
+ dataset:
25
+ name: AI2 Reasoning Challenge (25-Shot)
26
+ type: ai2_arc
27
+ config: ARC-Challenge
28
+ split: test
29
+ args:
30
+ num_few_shot: 25
31
+ metrics:
32
+ - type: acc
33
+ name: accuracy
34
+ value: 0.590443686006826
35
+ - type: acc_stderr
36
+ value: 0.014370358632472437
37
+ - type: acc_norm
38
+ name: normalized accuracy
39
+ value: 0.6203071672354948
40
+ - type: acc_norm_stderr
41
+ value: 0.01418211986697487
42
+ source:
43
+ name: Open LLM Leaderboard
44
+ url: https://huggingface.co/datasets/open-llm-leaderboard/details_HuggingFaceH4__zephyr-7b-beta_public
45
+
46
+ # HellaSwag (10-shot) (Open LLM Leaderboard)
47
+ - task:
48
+ type: text-generation
49
+ name: Text Generation
50
+ dataset:
51
+ name: HellaSwag (10-Shot)
52
+ type: Rowan/hellaswag
53
+ split: test # or validation?
54
+ args:
55
+ num_few_shot: 10
56
+ metrics:
57
+ - type: acc
58
+ name: accuracy
59
+ value: 0.6491734714200359
60
+ - type: acc_stderr
61
+ value: 0.004762534245488399
62
+ - type: acc_norm
63
+ name: normalized accuracy
64
+ value: 0.8435570603465445
65
+ - type: acc_norm_stderr
66
+ value: 0.003625323221166244
67
+ source:
68
+ name: Open LLM Leaderboard
69
+ url: https://huggingface.co/datasets/open-llm-leaderboard/details_HuggingFaceH4__zephyr-7b-beta_public
70
+
71
+ # DROP (3-shot) (Open LLM Leaderboard)
72
+ - task:
73
+ type: text-generation
74
+ name: Text Generation
75
+ dataset:
76
+ name: Drop (3-Shot)
77
+ type: drop
78
+ split: test
79
+ args:
80
+ num_few_shot: 3
81
+ metrics:
82
+ - type: em
83
+ name: exact match
84
+ value: 0.004928691275167785
85
+ - type: em_stderr
86
+ value: 0.0007171872517059793
87
+ - type: f1
88
+ name: f1 score
89
+ value: 0.09662437080536909
90
+ - type: f1_stderr
91
+ value: 0.0018807376338089597
92
+ source:
93
+ name: Open LLM Leaderboard
94
+ url: https://huggingface.co/datasets/open-llm-leaderboard/details_HuggingFaceH4__zephyr-7b-beta_public
95
+
96
+ # TruthfulQA (0-shot) (Open LLM Leaderboard)
97
+ - task:
98
+ type: text-generation
99
+ name: Text Generation
100
+ dataset:
101
+ name: TruthfulQA (0-shot)
102
+ type: truthful_qa
103
+ config: multiple_choice
104
+ split: validation
105
+ args:
106
+ num_few_shot: 0
107
+ metrics:
108
+ - type: mc1
109
+ value: 0.40636474908200737
110
+ - type: mc1_stderr
111
+ value: 0.017193835812093893
112
+ - type: mc2
113
+ value: 0.5744916942762855
114
+ - type: mc2_stderr
115
+ value: 0.015742095840959796
116
+ source:
117
+ name: Open LLM Leaderboard
118
+ url: https://huggingface.co/datasets/open-llm-leaderboard/details_HuggingFaceH4__zephyr-7b-beta_public
119
+
120
+ # GSM8k (5-shot) (Open LLM Leaderboard)
121
+ - task:
122
+ type: text-generation
123
+ name: Text Generation
124
+ dataset:
125
+ name: GSM8k (5-shot)
126
+ type: gsm8k
127
+ split: test
128
+ args:
129
+ num_few_shot: 5
130
+ metrics:
131
+ - type: acc
132
+ name: accuracy
133
+ value: 0.12736921910538287
134
+ - type: acc_stderr
135
+ value: 0.009183110326737829
136
+ source:
137
+ name: Open LLM Leaderboard
138
+ url: https://huggingface.co/datasets/open-llm-leaderboard/details_HuggingFaceH4__zephyr-7b-beta_public
139
+
140
+ # MMLU (5-Shot) (Open LLM Leaderboard)
141
+ # ???
142
+
143
+ # AlpacaEval (taken from model card)
144
+ - task:
145
+ type: text-generation
146
+ name: Text Generation
147
+ dataset:
148
+ name: AlpacaEval
149
+ type: unknown
150
+ metrics:
151
+ - type: unknown
152
+ name: win rate
153
+ value: 0.9060
154
+ source:
155
+ url: https://tatsu-lab.github.io/alpaca_eval/
156
+
157
+ # MT-Bench (taken from model card)
158
+ - task:
159
+ type: text-generation
160
+ name: Text Generation
161
+ dataset:
162
+ name: MT-Bench
163
+ type: unknown
164
+ metrics:
165
+ - type: unknown
166
+ name: score
167
+ value: 7.34
168
+ source:
169
+ url: https://huggingface.co/spaces/lmsys/mt-bench
170
  ---
 
171
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
172
  should probably proofread and complete it, then remove this comment. -->
173
 
 
236
  # Install transformers from source - only needed for versions <= v4.34
237
  # pip install git+https://github.com/huggingface/transformers.git
238
  # pip install accelerate
 
239
  import torch
240
  from transformers import pipeline
 
241
  pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-beta", torch_dtype=torch.bfloat16, device_map="auto")
 
242
  # We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
243
  messages = [
244
  {
 
296
  - lr_scheduler_type: linear
297
  - lr_scheduler_warmup_ratio: 0.1
298
  - num_epochs: 3.0
 
299
  ### Training results
 
300
  The table below shows the full set of DPO training metrics:
 
 
301
  | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
302
  |:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
303
  | 0.6284 | 0.05 | 100 | 0.6098 | 0.0425 | -0.1872 | 0.7344 | 0.2297 | -258.8416 | -253.8099 | -2.7976 | -2.8234 |
 
358
  | 0.0077 | 2.89 | 5600 | 0.7520 | -4.5586 | -8.3485 | 0.7969 | 3.7899 | -340.4545 | -299.8206 | -2.3078 | -2.3517 |
359
  | 0.0094 | 2.94 | 5700 | 0.7527 | -4.5542 | -8.3509 | 0.7812 | 3.7967 | -340.4790 | -299.7773 | -2.3062 | -2.3510 |
360
  | 0.0054 | 2.99 | 5800 | 0.7520 | -4.5169 | -8.3079 | 0.7812 | 3.7911 | -340.0493 | -299.4038 | -2.3081 | -2.3530 |
 
 
361
  ### Framework versions
 
362
  - Transformers 4.35.0.dev0
363
  - Pytorch 2.0.1+cu118
364
  - Datasets 2.12.0
365
  - Tokenizers 0.14.0
 
366
  ## Citation
 
367
  If you find Zephyr-7B-β is useful in your work, please cite it with:
 
368
  ```
369
  @misc{tunstall2023zephyr,
370
  title={Zephyr: Direct Distillation of LM Alignment},
 
377
  ```
378
  # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
379
  Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_HuggingFaceH4__zephyr-7b-beta)
 
380
  | Metric | Value |
381
  |-----------------------|---------------------------|
382
  | Avg. | 52.15 |