Wenboz commited on
Commit
71e95f7
1 Parent(s): 914e2f1

Model save

Browse files
Files changed (4) hide show
  1. README.md +80 -0
  2. all_results.json +9 -0
  3. train_results.json +9 -0
  4. trainer_state.json +977 -0
README.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: peft
4
+ tags:
5
+ - trl
6
+ - dpo
7
+ - generated_from_trainer
8
+ base_model: microsoft/Phi-3-mini-4k-instruct
9
+ model-index:
10
+ - name: phi3-offline-dpo-lora-noise-0.0-5e-6-42
11
+ results: []
12
+ ---
13
+
14
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
+ should probably proofread and complete it, then remove this comment. -->
16
+
17
+ [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="200" height="32"/>](https://wandb.ai/causal/huggingface/runs/ds27l9yx)
18
+ # phi3-offline-dpo-lora-noise-0.0-5e-6-42
19
+
20
+ This model is a fine-tuned version of [microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) on the None dataset.
21
+ It achieves the following results on the evaluation set:
22
+ - Loss: 0.6633
23
+ - Rewards/chosen: -0.1262
24
+ - Rewards/rejected: -0.1959
25
+ - Rewards/accuracies: 0.7540
26
+ - Rewards/margins: 0.0697
27
+ - Logps/rejected: -403.3280
28
+ - Logps/chosen: -421.1901
29
+ - Logits/rejected: 12.0952
30
+ - Logits/chosen: 13.8997
31
+
32
+ ## Model description
33
+
34
+ More information needed
35
+
36
+ ## Intended uses & limitations
37
+
38
+ More information needed
39
+
40
+ ## Training and evaluation data
41
+
42
+ More information needed
43
+
44
+ ## Training procedure
45
+
46
+ ### Training hyperparameters
47
+
48
+ The following hyperparameters were used during training:
49
+ - learning_rate: 5e-06
50
+ - train_batch_size: 4
51
+ - eval_batch_size: 4
52
+ - seed: 42
53
+ - distributed_type: multi-GPU
54
+ - num_devices: 4
55
+ - gradient_accumulation_steps: 4
56
+ - total_train_batch_size: 64
57
+ - total_eval_batch_size: 16
58
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
59
+ - lr_scheduler_type: cosine
60
+ - lr_scheduler_warmup_ratio: 0.1
61
+ - num_epochs: 1
62
+
63
+ ### Training results
64
+
65
+ | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
66
+ |:-------------:|:------:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:|
67
+ | 0.6931 | 0.1778 | 100 | 0.6835 | -0.0511 | -0.0728 | 0.6905 | 0.0218 | -391.0186 | -413.6780 | 12.3764 | 14.1803 |
68
+ | 0.689 | 0.3556 | 200 | 0.6682 | -0.1441 | -0.2014 | 0.7460 | 0.0573 | -403.8743 | -422.9761 | 12.1803 | 13.9841 |
69
+ | 0.6923 | 0.5333 | 300 | 0.6673 | -0.1140 | -0.1749 | 0.7897 | 0.0609 | -401.2295 | -419.9747 | 12.1769 | 13.9748 |
70
+ | 0.6914 | 0.7111 | 400 | 0.6655 | -0.1195 | -0.1839 | 0.7698 | 0.0644 | -402.1236 | -420.5240 | 12.1267 | 13.9317 |
71
+ | 0.696 | 0.8889 | 500 | 0.6633 | -0.1262 | -0.1959 | 0.7540 | 0.0697 | -403.3280 | -421.1901 | 12.0952 | 13.8997 |
72
+
73
+
74
+ ### Framework versions
75
+
76
+ - PEFT 0.7.1
77
+ - Transformers 4.42.3
78
+ - Pytorch 2.3.0+cu121
79
+ - Datasets 2.14.6
80
+ - Tokenizers 0.19.1
all_results.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 0.9991111111111111,
3
+ "total_flos": 0.0,
4
+ "train_loss": 0.6916975294143703,
5
+ "train_runtime": 7520.1237,
6
+ "train_samples": 36000,
7
+ "train_samples_per_second": 4.787,
8
+ "train_steps_per_second": 0.075
9
+ }
train_results.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 0.9991111111111111,
3
+ "total_flos": 0.0,
4
+ "train_loss": 0.6916975294143703,
5
+ "train_runtime": 7520.1237,
6
+ "train_samples": 36000,
7
+ "train_samples_per_second": 4.787,
8
+ "train_steps_per_second": 0.075
9
+ }
trainer_state.json ADDED
@@ -0,0 +1,977 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 0.9991111111111111,
5
+ "eval_steps": 100,
6
+ "global_step": 562,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.0017777777777777779,
13
+ "grad_norm": 0.22311078073034826,
14
+ "learning_rate": 8.771929824561404e-08,
15
+ "logits/chosen": 14.539060592651367,
16
+ "logits/rejected": 15.870795249938965,
17
+ "logps/chosen": -470.04345703125,
18
+ "logps/rejected": -509.49163818359375,
19
+ "loss": 0.6931,
20
+ "rewards/accuracies": 0.0,
21
+ "rewards/chosen": 0.0,
22
+ "rewards/margins": 0.0,
23
+ "rewards/rejected": 0.0,
24
+ "step": 1
25
+ },
26
+ {
27
+ "epoch": 0.017777777777777778,
28
+ "grad_norm": 0.1976540584216542,
29
+ "learning_rate": 8.771929824561404e-07,
30
+ "logits/chosen": 12.96641731262207,
31
+ "logits/rejected": 13.155448913574219,
32
+ "logps/chosen": -400.9219665527344,
33
+ "logps/rejected": -399.60699462890625,
34
+ "loss": 0.693,
35
+ "rewards/accuracies": 0.4513888955116272,
36
+ "rewards/chosen": -0.00211474671959877,
37
+ "rewards/margins": -0.001011626678518951,
38
+ "rewards/rejected": -0.0011031196918338537,
39
+ "step": 10
40
+ },
41
+ {
42
+ "epoch": 0.035555555555555556,
43
+ "grad_norm": 0.21429385122960512,
44
+ "learning_rate": 1.7543859649122807e-06,
45
+ "logits/chosen": 13.817936897277832,
46
+ "logits/rejected": 13.85405158996582,
47
+ "logps/chosen": -420.9461975097656,
48
+ "logps/rejected": -404.7037658691406,
49
+ "loss": 0.6938,
50
+ "rewards/accuracies": 0.4625000059604645,
51
+ "rewards/chosen": -6.537516310345381e-05,
52
+ "rewards/margins": -0.0014555875677615404,
53
+ "rewards/rejected": 0.0013902120990678668,
54
+ "step": 20
55
+ },
56
+ {
57
+ "epoch": 0.05333333333333334,
58
+ "grad_norm": 0.1898289423532557,
59
+ "learning_rate": 2.631578947368421e-06,
60
+ "logits/chosen": 13.255971908569336,
61
+ "logits/rejected": 13.399693489074707,
62
+ "logps/chosen": -402.4463806152344,
63
+ "logps/rejected": -412.36480712890625,
64
+ "loss": 0.6937,
65
+ "rewards/accuracies": 0.5,
66
+ "rewards/chosen": -0.0020085538271814585,
67
+ "rewards/margins": -0.0013788806973025203,
68
+ "rewards/rejected": -0.00062967324629426,
69
+ "step": 30
70
+ },
71
+ {
72
+ "epoch": 0.07111111111111111,
73
+ "grad_norm": 0.2162185808879369,
74
+ "learning_rate": 3.5087719298245615e-06,
75
+ "logits/chosen": 13.550105094909668,
76
+ "logits/rejected": 13.091300964355469,
77
+ "logps/chosen": -379.90777587890625,
78
+ "logps/rejected": -373.92138671875,
79
+ "loss": 0.6935,
80
+ "rewards/accuracies": 0.4937500059604645,
81
+ "rewards/chosen": 0.0008380211656913161,
82
+ "rewards/margins": -0.00024208976537920535,
83
+ "rewards/rejected": 0.0010801110183820128,
84
+ "step": 40
85
+ },
86
+ {
87
+ "epoch": 0.08888888888888889,
88
+ "grad_norm": 0.22810063623939997,
89
+ "learning_rate": 4.385964912280702e-06,
90
+ "logits/chosen": 13.239973068237305,
91
+ "logits/rejected": 13.241312980651855,
92
+ "logps/chosen": -396.4941101074219,
93
+ "logps/rejected": -430.57049560546875,
94
+ "loss": 0.6932,
95
+ "rewards/accuracies": 0.5249999761581421,
96
+ "rewards/chosen": -0.0004129047447349876,
97
+ "rewards/margins": 4.834262654185295e-05,
98
+ "rewards/rejected": -0.00046124737127684057,
99
+ "step": 50
100
+ },
101
+ {
102
+ "epoch": 0.10666666666666667,
103
+ "grad_norm": 0.2512545993725567,
104
+ "learning_rate": 4.999564631597802e-06,
105
+ "logits/chosen": 13.0819673538208,
106
+ "logits/rejected": 13.337594985961914,
107
+ "logps/chosen": -388.62628173828125,
108
+ "logps/rejected": -414.85382080078125,
109
+ "loss": 0.6928,
110
+ "rewards/accuracies": 0.518750011920929,
111
+ "rewards/chosen": -0.002572793047875166,
112
+ "rewards/margins": 0.0024341598618775606,
113
+ "rewards/rejected": -0.005006952676922083,
114
+ "step": 60
115
+ },
116
+ {
117
+ "epoch": 0.12444444444444444,
118
+ "grad_norm": 0.22761734037588918,
119
+ "learning_rate": 4.991828966534002e-06,
120
+ "logits/chosen": 13.887951850891113,
121
+ "logits/rejected": 13.595651626586914,
122
+ "logps/chosen": -450.8296813964844,
123
+ "logps/rejected": -444.7730407714844,
124
+ "loss": 0.6936,
125
+ "rewards/accuracies": 0.48750001192092896,
126
+ "rewards/chosen": -0.006924263201653957,
127
+ "rewards/margins": -0.001168354763649404,
128
+ "rewards/rejected": -0.005755907855927944,
129
+ "step": 70
130
+ },
131
+ {
132
+ "epoch": 0.14222222222222222,
133
+ "grad_norm": 0.21722031386603705,
134
+ "learning_rate": 4.974452899279292e-06,
135
+ "logits/chosen": 13.351213455200195,
136
+ "logits/rejected": 12.537083625793457,
137
+ "logps/chosen": -425.934814453125,
138
+ "logps/rejected": -383.2563781738281,
139
+ "loss": 0.6932,
140
+ "rewards/accuracies": 0.5375000238418579,
141
+ "rewards/chosen": -0.015285441651940346,
142
+ "rewards/margins": -0.0011091658379882574,
143
+ "rewards/rejected": -0.014176277443766594,
144
+ "step": 80
145
+ },
146
+ {
147
+ "epoch": 0.16,
148
+ "grad_norm": 0.27601757784168,
149
+ "learning_rate": 4.947503654462277e-06,
150
+ "logits/chosen": 13.753236770629883,
151
+ "logits/rejected": 13.5679292678833,
152
+ "logps/chosen": -435.3922424316406,
153
+ "logps/rejected": -426.5663146972656,
154
+ "loss": 0.693,
155
+ "rewards/accuracies": 0.4437499940395355,
156
+ "rewards/chosen": -0.03359255567193031,
157
+ "rewards/margins": -0.0016514979070052505,
158
+ "rewards/rejected": -0.03194105625152588,
159
+ "step": 90
160
+ },
161
+ {
162
+ "epoch": 0.17777777777777778,
163
+ "grad_norm": 0.23435867570612137,
164
+ "learning_rate": 4.911085493475802e-06,
165
+ "logits/chosen": 12.843754768371582,
166
+ "logits/rejected": 12.69409465789795,
167
+ "logps/chosen": -412.6481018066406,
168
+ "logps/rejected": -412.96405029296875,
169
+ "loss": 0.6931,
170
+ "rewards/accuracies": 0.518750011920929,
171
+ "rewards/chosen": -0.0513535737991333,
172
+ "rewards/margins": 0.002654131967574358,
173
+ "rewards/rejected": -0.05400770157575607,
174
+ "step": 100
175
+ },
176
+ {
177
+ "epoch": 0.17777777777777778,
178
+ "eval_logits/chosen": 14.180258750915527,
179
+ "eval_logits/rejected": 12.376383781433105,
180
+ "eval_logps/chosen": -413.67803955078125,
181
+ "eval_logps/rejected": -391.0186462402344,
182
+ "eval_loss": 0.6835460066795349,
183
+ "eval_rewards/accuracies": 0.6904761791229248,
184
+ "eval_rewards/chosen": -0.05108103156089783,
185
+ "eval_rewards/margins": 0.021760080009698868,
186
+ "eval_rewards/rejected": -0.072841115295887,
187
+ "eval_runtime": 90.1605,
188
+ "eval_samples_per_second": 11.091,
189
+ "eval_steps_per_second": 0.699,
190
+ "step": 100
191
+ },
192
+ {
193
+ "epoch": 0.19555555555555557,
194
+ "grad_norm": 0.2803874830493494,
195
+ "learning_rate": 4.86533931110987e-06,
196
+ "logits/chosen": 13.455083847045898,
197
+ "logits/rejected": 13.341041564941406,
198
+ "logps/chosen": -415.30401611328125,
199
+ "logps/rejected": -425.7816467285156,
200
+ "loss": 0.6933,
201
+ "rewards/accuracies": 0.48124998807907104,
202
+ "rewards/chosen": -0.06350487470626831,
203
+ "rewards/margins": 0.00028126564575359225,
204
+ "rewards/rejected": -0.06378613412380219,
205
+ "step": 110
206
+ },
207
+ {
208
+ "epoch": 0.21333333333333335,
209
+ "grad_norm": 0.35378680708306215,
210
+ "learning_rate": 4.810442090457072e-06,
211
+ "logits/chosen": 12.907583236694336,
212
+ "logits/rejected": 13.067194938659668,
213
+ "logps/chosen": -389.30303955078125,
214
+ "logps/rejected": -398.48333740234375,
215
+ "loss": 0.6925,
216
+ "rewards/accuracies": 0.5,
217
+ "rewards/chosen": -0.06600853055715561,
218
+ "rewards/margins": -0.003265662584453821,
219
+ "rewards/rejected": -0.06274287402629852,
220
+ "step": 120
221
+ },
222
+ {
223
+ "epoch": 0.2311111111111111,
224
+ "grad_norm": 0.2869743496766919,
225
+ "learning_rate": 4.7466062181993855e-06,
226
+ "logits/chosen": 13.347814559936523,
227
+ "logits/rejected": 13.254618644714355,
228
+ "logps/chosen": -396.9159851074219,
229
+ "logps/rejected": -428.38861083984375,
230
+ "loss": 0.6921,
231
+ "rewards/accuracies": 0.5375000238418579,
232
+ "rewards/chosen": -0.08628938347101212,
233
+ "rewards/margins": 0.005706036929041147,
234
+ "rewards/rejected": -0.09199541807174683,
235
+ "step": 130
236
+ },
237
+ {
238
+ "epoch": 0.24888888888888888,
239
+ "grad_norm": 0.3379629862483891,
240
+ "learning_rate": 4.6740786629253595e-06,
241
+ "logits/chosen": 13.342312812805176,
242
+ "logits/rejected": 13.019983291625977,
243
+ "logps/chosen": -384.1279602050781,
244
+ "logps/rejected": -393.2745666503906,
245
+ "loss": 0.6912,
246
+ "rewards/accuracies": 0.5874999761581421,
247
+ "rewards/chosen": -0.0947018712759018,
248
+ "rewards/margins": 0.0077432007528841496,
249
+ "rewards/rejected": -0.10244506597518921,
250
+ "step": 140
251
+ },
252
+ {
253
+ "epoch": 0.26666666666666666,
254
+ "grad_norm": 0.350219971213616,
255
+ "learning_rate": 4.5931400196566256e-06,
256
+ "logits/chosen": 13.074136734008789,
257
+ "logits/rejected": 13.148015022277832,
258
+ "logps/chosen": -425.1592712402344,
259
+ "logps/rejected": -443.909423828125,
260
+ "loss": 0.6903,
261
+ "rewards/accuracies": 0.512499988079071,
262
+ "rewards/chosen": -0.11653944104909897,
263
+ "rewards/margins": 0.003865143982693553,
264
+ "rewards/rejected": -0.12040458619594574,
265
+ "step": 150
266
+ },
267
+ {
268
+ "epoch": 0.28444444444444444,
269
+ "grad_norm": 0.37627729798456006,
270
+ "learning_rate": 4.504103424280267e-06,
271
+ "logits/chosen": 13.34211540222168,
272
+ "logits/rejected": 13.010579109191895,
273
+ "logps/chosen": -439.13916015625,
274
+ "logps/rejected": -433.8727111816406,
275
+ "loss": 0.6908,
276
+ "rewards/accuracies": 0.5687500238418579,
277
+ "rewards/chosen": -0.16217225790023804,
278
+ "rewards/margins": 0.007592835463583469,
279
+ "rewards/rejected": -0.16976511478424072,
280
+ "step": 160
281
+ },
282
+ {
283
+ "epoch": 0.3022222222222222,
284
+ "grad_norm": 0.3180727458224856,
285
+ "learning_rate": 4.407313342086906e-06,
286
+ "logits/chosen": 13.196157455444336,
287
+ "logits/rejected": 13.456674575805664,
288
+ "logps/chosen": -426.7943420410156,
289
+ "logps/rejected": -437.723388671875,
290
+ "loss": 0.6937,
291
+ "rewards/accuracies": 0.5625,
292
+ "rewards/chosen": -0.16415511071681976,
293
+ "rewards/margins": 0.009541703388094902,
294
+ "rewards/rejected": -0.17369681596755981,
295
+ "step": 170
296
+ },
297
+ {
298
+ "epoch": 0.32,
299
+ "grad_norm": 0.31672781627835284,
300
+ "learning_rate": 4.303144235101412e-06,
301
+ "logits/chosen": 13.912513732910156,
302
+ "logits/rejected": 13.92699146270752,
303
+ "logps/chosen": -400.5416564941406,
304
+ "logps/rejected": -418.99554443359375,
305
+ "loss": 0.6907,
306
+ "rewards/accuracies": 0.5,
307
+ "rewards/chosen": -0.18776951730251312,
308
+ "rewards/margins": 0.00795517023652792,
309
+ "rewards/rejected": -0.1957246959209442,
310
+ "step": 180
311
+ },
312
+ {
313
+ "epoch": 0.3377777777777778,
314
+ "grad_norm": 0.3187670525496018,
315
+ "learning_rate": 4.1919991133620705e-06,
316
+ "logits/chosen": 12.957951545715332,
317
+ "logits/rejected": 12.835040092468262,
318
+ "logps/chosen": -428.75921630859375,
319
+ "logps/rejected": -429.77801513671875,
320
+ "loss": 0.6943,
321
+ "rewards/accuracies": 0.4437499940395355,
322
+ "rewards/chosen": -0.18785884976387024,
323
+ "rewards/margins": -0.00687809195369482,
324
+ "rewards/rejected": -0.18098074197769165,
325
+ "step": 190
326
+ },
327
+ {
328
+ "epoch": 0.35555555555555557,
329
+ "grad_norm": 0.31966290081913296,
330
+ "learning_rate": 4.074307975753044e-06,
331
+ "logits/chosen": 13.527238845825195,
332
+ "logits/rejected": 13.133882522583008,
333
+ "logps/chosen": -411.04052734375,
334
+ "logps/rejected": -405.96820068359375,
335
+ "loss": 0.689,
336
+ "rewards/accuracies": 0.550000011920929,
337
+ "rewards/chosen": -0.1447705328464508,
338
+ "rewards/margins": 0.013265645131468773,
339
+ "rewards/rejected": -0.15803618729114532,
340
+ "step": 200
341
+ },
342
+ {
343
+ "epoch": 0.35555555555555557,
344
+ "eval_logits/chosen": 13.984086036682129,
345
+ "eval_logits/rejected": 12.180322647094727,
346
+ "eval_logps/chosen": -422.97613525390625,
347
+ "eval_logps/rejected": -403.87432861328125,
348
+ "eval_loss": 0.6681899428367615,
349
+ "eval_rewards/accuracies": 0.7460317611694336,
350
+ "eval_rewards/chosen": -0.1440620720386505,
351
+ "eval_rewards/margins": 0.05733573064208031,
352
+ "eval_rewards/rejected": -0.20139780640602112,
353
+ "eval_runtime": 90.195,
354
+ "eval_samples_per_second": 11.087,
355
+ "eval_steps_per_second": 0.698,
356
+ "step": 200
357
+ },
358
+ {
359
+ "epoch": 0.37333333333333335,
360
+ "grad_norm": 0.2899334647869614,
361
+ "learning_rate": 3.950526146422213e-06,
362
+ "logits/chosen": 13.182365417480469,
363
+ "logits/rejected": 13.274526596069336,
364
+ "logps/chosen": -412.0303649902344,
365
+ "logps/rejected": -435.66754150390625,
366
+ "loss": 0.6921,
367
+ "rewards/accuracies": 0.48124998807907104,
368
+ "rewards/chosen": -0.1916799396276474,
369
+ "rewards/margins": -0.0004939109203405678,
370
+ "rewards/rejected": -0.19118604063987732,
371
+ "step": 210
372
+ },
373
+ {
374
+ "epoch": 0.39111111111111113,
375
+ "grad_norm": 0.3113343433252108,
376
+ "learning_rate": 3.821132513220511e-06,
377
+ "logits/chosen": 13.472271919250488,
378
+ "logits/rejected": 12.98419189453125,
379
+ "logps/chosen": -407.59857177734375,
380
+ "logps/rejected": -413.1893615722656,
381
+ "loss": 0.6852,
382
+ "rewards/accuracies": 0.543749988079071,
383
+ "rewards/chosen": -0.18176871538162231,
384
+ "rewards/margins": 0.02464126981794834,
385
+ "rewards/rejected": -0.2064099758863449,
386
+ "step": 220
387
+ },
388
+ {
389
+ "epoch": 0.4088888888888889,
390
+ "grad_norm": 0.37212373445120756,
391
+ "learning_rate": 3.686627674977858e-06,
392
+ "logits/chosen": 12.922555923461914,
393
+ "logits/rejected": 13.052223205566406,
394
+ "logps/chosen": -420.0254821777344,
395
+ "logps/rejected": -438.8111267089844,
396
+ "loss": 0.6927,
397
+ "rewards/accuracies": 0.5249999761581421,
398
+ "rewards/chosen": -0.2267444133758545,
399
+ "rewards/margins": 0.00632152333855629,
400
+ "rewards/rejected": -0.23306593298912048,
401
+ "step": 230
402
+ },
403
+ {
404
+ "epoch": 0.4266666666666667,
405
+ "grad_norm": 0.4194286554865179,
406
+ "learning_rate": 3.547532004783539e-06,
407
+ "logits/chosen": 13.442713737487793,
408
+ "logits/rejected": 12.382542610168457,
409
+ "logps/chosen": -448.96044921875,
410
+ "logps/rejected": -424.40631103515625,
411
+ "loss": 0.6883,
412
+ "rewards/accuracies": 0.643750011920929,
413
+ "rewards/chosen": -0.2144048511981964,
414
+ "rewards/margins": 0.04138387367129326,
415
+ "rewards/rejected": -0.2557887136936188,
416
+ "step": 240
417
+ },
418
+ {
419
+ "epoch": 0.4444444444444444,
420
+ "grad_norm": 0.2831313864511856,
421
+ "learning_rate": 3.404383636763809e-06,
422
+ "logits/chosen": 13.222406387329102,
423
+ "logits/rejected": 12.874628067016602,
424
+ "logps/chosen": -460.45782470703125,
425
+ "logps/rejected": -457.3854064941406,
426
+ "loss": 0.6915,
427
+ "rewards/accuracies": 0.5562499761581421,
428
+ "rewards/chosen": -0.22632627189159393,
429
+ "rewards/margins": 0.018462661653757095,
430
+ "rewards/rejected": -0.24478892982006073,
431
+ "step": 250
432
+ },
433
+ {
434
+ "epoch": 0.4622222222222222,
435
+ "grad_norm": 0.3260671824484102,
436
+ "learning_rate": 3.2577363841455063e-06,
437
+ "logits/chosen": 13.391107559204102,
438
+ "logits/rejected": 13.3804931640625,
439
+ "logps/chosen": -380.6063537597656,
440
+ "logps/rejected": -394.7472839355469,
441
+ "loss": 0.6954,
442
+ "rewards/accuracies": 0.53125,
443
+ "rewards/chosen": -0.20835450291633606,
444
+ "rewards/margins": 0.013799709267914295,
445
+ "rewards/rejected": -0.22215421497821808,
446
+ "step": 260
447
+ },
448
+ {
449
+ "epoch": 0.48,
450
+ "grad_norm": 0.3777555315785377,
451
+ "learning_rate": 3.1081575966602627e-06,
452
+ "logits/chosen": 12.585054397583008,
453
+ "logits/rejected": 12.490537643432617,
454
+ "logps/chosen": -439.46844482421875,
455
+ "logps/rejected": -421.29962158203125,
456
+ "loss": 0.6929,
457
+ "rewards/accuracies": 0.4124999940395355,
458
+ "rewards/chosen": -0.19497910141944885,
459
+ "rewards/margins": -0.00843017641454935,
460
+ "rewards/rejected": -0.18654890358448029,
461
+ "step": 270
462
+ },
463
+ {
464
+ "epoch": 0.49777777777777776,
465
+ "grad_norm": 0.2554633592042652,
466
+ "learning_rate": 2.9562259655786067e-06,
467
+ "logits/chosen": 13.010714530944824,
468
+ "logits/rejected": 12.371983528137207,
469
+ "logps/chosen": -389.00665283203125,
470
+ "logps/rejected": -389.6372985839844,
471
+ "loss": 0.6933,
472
+ "rewards/accuracies": 0.518750011920929,
473
+ "rewards/chosen": -0.14177796244621277,
474
+ "rewards/margins": 0.008302886970341206,
475
+ "rewards/rejected": -0.1500808447599411,
476
+ "step": 280
477
+ },
478
+ {
479
+ "epoch": 0.5155555555555555,
480
+ "grad_norm": 0.27965221608690755,
481
+ "learning_rate": 2.802529284865863e-06,
482
+ "logits/chosen": 12.02336597442627,
483
+ "logits/rejected": 12.071900367736816,
484
+ "logps/chosen": -402.3577575683594,
485
+ "logps/rejected": -393.9205017089844,
486
+ "loss": 0.6895,
487
+ "rewards/accuracies": 0.48750001192092896,
488
+ "rewards/chosen": -0.13531939685344696,
489
+ "rewards/margins": 0.004990140907466412,
490
+ "rewards/rejected": -0.14030954241752625,
491
+ "step": 290
492
+ },
493
+ {
494
+ "epoch": 0.5333333333333333,
495
+ "grad_norm": 0.2913462405762709,
496
+ "learning_rate": 2.6476621771214865e-06,
497
+ "logits/chosen": 13.3645601272583,
498
+ "logits/rejected": 13.140283584594727,
499
+ "logps/chosen": -407.8977966308594,
500
+ "logps/rejected": -416.1576232910156,
501
+ "loss": 0.6923,
502
+ "rewards/accuracies": 0.543749988079071,
503
+ "rewards/chosen": -0.1288231909275055,
504
+ "rewards/margins": 0.010681845247745514,
505
+ "rewards/rejected": -0.1395050287246704,
506
+ "step": 300
507
+ },
508
+ {
509
+ "epoch": 0.5333333333333333,
510
+ "eval_logits/chosen": 13.974791526794434,
511
+ "eval_logits/rejected": 12.176884651184082,
512
+ "eval_logps/chosen": -419.9747009277344,
513
+ "eval_logps/rejected": -401.2295227050781,
514
+ "eval_loss": 0.6673460006713867,
515
+ "eval_rewards/accuracies": 0.7896825671195984,
516
+ "eval_rewards/chosen": -0.11404754966497421,
517
+ "eval_rewards/margins": 0.06090213730931282,
518
+ "eval_rewards/rejected": -0.17494967579841614,
519
+ "eval_runtime": 90.2773,
520
+ "eval_samples_per_second": 11.077,
521
+ "eval_steps_per_second": 0.698,
522
+ "step": 300
523
+ },
524
+ {
525
+ "epoch": 0.5511111111111111,
526
+ "grad_norm": 0.340661457280574,
527
+ "learning_rate": 2.4922237930997435e-06,
528
+ "logits/chosen": 12.930778503417969,
529
+ "logits/rejected": 12.75629711151123,
530
+ "logps/chosen": -408.3033142089844,
531
+ "logps/rejected": -429.09954833984375,
532
+ "loss": 0.6891,
533
+ "rewards/accuracies": 0.5,
534
+ "rewards/chosen": -0.1441982239484787,
535
+ "rewards/margins": 0.008970534428954124,
536
+ "rewards/rejected": -0.15316873788833618,
537
+ "step": 310
538
+ },
539
+ {
540
+ "epoch": 0.5688888888888889,
541
+ "grad_norm": 0.3045715209107194,
542
+ "learning_rate": 2.3368154937118355e-06,
543
+ "logits/chosen": 12.338762283325195,
544
+ "logits/rejected": 12.408100128173828,
545
+ "logps/chosen": -417.6708984375,
546
+ "logps/rejected": -436.1878356933594,
547
+ "loss": 0.6906,
548
+ "rewards/accuracies": 0.53125,
549
+ "rewards/chosen": -0.1610310822725296,
550
+ "rewards/margins": 0.003636928740888834,
551
+ "rewards/rejected": -0.164668008685112,
552
+ "step": 320
553
+ },
554
+ {
555
+ "epoch": 0.5866666666666667,
556
+ "grad_norm": 0.3261096004039151,
557
+ "learning_rate": 2.1820385234773604e-06,
558
+ "logits/chosen": 12.432132720947266,
559
+ "logits/rejected": 12.333971977233887,
560
+ "logps/chosen": -371.9541015625,
561
+ "logps/rejected": -377.29327392578125,
562
+ "loss": 0.6938,
563
+ "rewards/accuracies": 0.5249999761581421,
564
+ "rewards/chosen": -0.14137257635593414,
565
+ "rewards/margins": 0.012616041116416454,
566
+ "rewards/rejected": -0.15398862957954407,
567
+ "step": 330
568
+ },
569
+ {
570
+ "epoch": 0.6044444444444445,
571
+ "grad_norm": 0.3167841738260838,
572
+ "learning_rate": 2.02849168442607e-06,
573
+ "logits/chosen": 13.544441223144531,
574
+ "logits/rejected": 13.638467788696289,
575
+ "logps/chosen": -407.0322265625,
576
+ "logps/rejected": -412.5428771972656,
577
+ "loss": 0.6934,
578
+ "rewards/accuracies": 0.543749988079071,
579
+ "rewards/chosen": -0.1435524821281433,
580
+ "rewards/margins": 0.008516514673829079,
581
+ "rewards/rejected": -0.15206900238990784,
582
+ "step": 340
583
+ },
584
+ {
585
+ "epoch": 0.6222222222222222,
586
+ "grad_norm": 0.30772925863397405,
587
+ "learning_rate": 1.876769019449141e-06,
588
+ "logits/chosen": 12.576923370361328,
589
+ "logits/rejected": 12.56715202331543,
590
+ "logps/chosen": -379.2666015625,
591
+ "logps/rejected": -408.4591979980469,
592
+ "loss": 0.6904,
593
+ "rewards/accuracies": 0.550000011920929,
594
+ "rewards/chosen": -0.12999220192432404,
595
+ "rewards/margins": 0.009667792357504368,
596
+ "rewards/rejected": -0.13966000080108643,
597
+ "step": 350
598
+ },
599
+ {
600
+ "epoch": 0.64,
601
+ "grad_norm": 0.3510251282815541,
602
+ "learning_rate": 1.7274575140626318e-06,
603
+ "logits/chosen": 12.569269180297852,
604
+ "logits/rejected": 12.474992752075195,
605
+ "logps/chosen": -426.98651123046875,
606
+ "logps/rejected": -420.2184143066406,
607
+ "loss": 0.693,
608
+ "rewards/accuracies": 0.46875,
609
+ "rewards/chosen": -0.14840544760227203,
610
+ "rewards/margins": -0.0038787845987826586,
611
+ "rewards/rejected": -0.1445266604423523,
612
+ "step": 360
613
+ },
614
+ {
615
+ "epoch": 0.6577777777777778,
616
+ "grad_norm": 0.29359728465406504,
617
+ "learning_rate": 1.5811348254745574e-06,
618
+ "logits/chosen": 13.278701782226562,
619
+ "logits/rejected": 13.138870239257812,
620
+ "logps/chosen": -419.22552490234375,
621
+ "logps/rejected": -429.7049865722656,
622
+ "loss": 0.6907,
623
+ "rewards/accuracies": 0.550000011920929,
624
+ "rewards/chosen": -0.1437017023563385,
625
+ "rewards/margins": 0.006204875651746988,
626
+ "rewards/rejected": -0.14990659058094025,
627
+ "step": 370
628
+ },
629
+ {
630
+ "epoch": 0.6755555555555556,
631
+ "grad_norm": 0.3041311221496272,
632
+ "learning_rate": 1.4383670477413676e-06,
633
+ "logits/chosen": 12.825994491577148,
634
+ "logits/rejected": 12.180424690246582,
635
+ "logps/chosen": -393.3042297363281,
636
+ "logps/rejected": -379.05279541015625,
637
+ "loss": 0.6881,
638
+ "rewards/accuracies": 0.5375000238418579,
639
+ "rewards/chosen": -0.1392853856086731,
640
+ "rewards/margins": 0.007530958391726017,
641
+ "rewards/rejected": -0.14681634306907654,
642
+ "step": 380
643
+ },
644
+ {
645
+ "epoch": 0.6933333333333334,
646
+ "grad_norm": 0.3072757460893545,
647
+ "learning_rate": 1.2997065216600179e-06,
648
+ "logits/chosen": 12.89039134979248,
649
+ "logits/rejected": 13.162782669067383,
650
+ "logps/chosen": -427.39910888671875,
651
+ "logps/rejected": -431.7848205566406,
652
+ "loss": 0.6928,
653
+ "rewards/accuracies": 0.4937500059604645,
654
+ "rewards/chosen": -0.14786246418952942,
655
+ "rewards/margins": 0.0003430729848332703,
656
+ "rewards/rejected": -0.14820551872253418,
657
+ "step": 390
658
+ },
659
+ {
660
+ "epoch": 0.7111111111111111,
661
+ "grad_norm": 0.27986512697276633,
662
+ "learning_rate": 1.165689697868726e-06,
663
+ "logits/chosen": 12.779977798461914,
664
+ "logits/rejected": 12.16929817199707,
665
+ "logps/chosen": -420.75653076171875,
666
+ "logps/rejected": -415.25244140625,
667
+ "loss": 0.6914,
668
+ "rewards/accuracies": 0.53125,
669
+ "rewards/chosen": -0.14449290931224823,
670
+ "rewards/margins": 0.02024017833173275,
671
+ "rewards/rejected": -0.16473311185836792,
672
+ "step": 400
673
+ },
674
+ {
675
+ "epoch": 0.7111111111111111,
676
+ "eval_logits/chosen": 13.931650161743164,
677
+ "eval_logits/rejected": 12.126729011535645,
678
+ "eval_logps/chosen": -420.5239562988281,
679
+ "eval_logps/rejected": -402.1236267089844,
680
+ "eval_loss": 0.665495753288269,
681
+ "eval_rewards/accuracies": 0.7698412537574768,
682
+ "eval_rewards/chosen": -0.11954014003276825,
683
+ "eval_rewards/margins": 0.06435071676969528,
684
+ "eval_rewards/rejected": -0.18389087915420532,
685
+ "eval_runtime": 90.069,
686
+ "eval_samples_per_second": 11.103,
687
+ "eval_steps_per_second": 0.699,
688
+ "step": 400
689
+ },
690
+ {
691
+ "epoch": 0.7288888888888889,
692
+ "grad_norm": 0.3450004634992278,
693
+ "learning_rate": 1.0368350614236685e-06,
694
+ "logits/chosen": 12.588386535644531,
695
+ "logits/rejected": 12.996353149414062,
696
+ "logps/chosen": -403.56207275390625,
697
+ "logps/rejected": -424.3435974121094,
698
+ "loss": 0.6911,
699
+ "rewards/accuracies": 0.5,
700
+ "rewards/chosen": -0.15337641537189484,
701
+ "rewards/margins": 0.0022048167884349823,
702
+ "rewards/rejected": -0.15558123588562012,
703
+ "step": 410
704
+ },
705
+ {
706
+ "epoch": 0.7466666666666667,
707
+ "grad_norm": 0.2947231406284332,
708
+ "learning_rate": 9.136411258810229e-07,
709
+ "logits/chosen": 13.508552551269531,
710
+ "logits/rejected": 13.856636047363281,
711
+ "logps/chosen": -397.31103515625,
712
+ "logps/rejected": -415.14794921875,
713
+ "loss": 0.6916,
714
+ "rewards/accuracies": 0.53125,
715
+ "rewards/chosen": -0.14091812074184418,
716
+ "rewards/margins": 0.009603964164853096,
717
+ "rewards/rejected": -0.15052208304405212,
718
+ "step": 420
719
+ },
720
+ {
721
+ "epoch": 0.7644444444444445,
722
+ "grad_norm": 0.2949706409885573,
723
+ "learning_rate": 7.965845046448659e-07,
724
+ "logits/chosen": 12.716641426086426,
725
+ "logits/rejected": 12.739425659179688,
726
+ "logps/chosen": -413.5958557128906,
727
+ "logps/rejected": -418.07379150390625,
728
+ "loss": 0.6924,
729
+ "rewards/accuracies": 0.543749988079071,
730
+ "rewards/chosen": -0.13801425695419312,
731
+ "rewards/margins": 0.0007665277225896716,
732
+ "rewards/rejected": -0.13878078758716583,
733
+ "step": 430
734
+ },
735
+ {
736
+ "epoch": 0.7822222222222223,
737
+ "grad_norm": 0.29714148256405104,
738
+ "learning_rate": 6.861180670424983e-07,
739
+ "logits/chosen": 13.228073120117188,
740
+ "logits/rejected": 12.513433456420898,
741
+ "logps/chosen": -436.03326416015625,
742
+ "logps/rejected": -432.7162170410156,
743
+ "loss": 0.6882,
744
+ "rewards/accuracies": 0.5375000238418579,
745
+ "rewards/chosen": -0.14261861145496368,
746
+ "rewards/margins": 0.018722299486398697,
747
+ "rewards/rejected": -0.16134092211723328,
748
+ "step": 440
749
+ },
750
+ {
751
+ "epoch": 0.8,
752
+ "grad_norm": 0.35583565190086214,
753
+ "learning_rate": 5.826691862609987e-07,
754
+ "logits/chosen": 13.008562088012695,
755
+ "logits/rejected": 12.5421724319458,
756
+ "logps/chosen": -393.10614013671875,
757
+ "logps/rejected": -396.7547607421875,
758
+ "loss": 0.6858,
759
+ "rewards/accuracies": 0.543749988079071,
760
+ "rewards/chosen": -0.14102980494499207,
761
+ "rewards/margins": 0.006347469985485077,
762
+ "rewards/rejected": -0.14737728238105774,
763
+ "step": 450
764
+ },
765
+ {
766
+ "epoch": 0.8177777777777778,
767
+ "grad_norm": 1.2849378777119373,
768
+ "learning_rate": 4.866380859233891e-07,
769
+ "logits/chosen": 12.925387382507324,
770
+ "logits/rejected": 13.122329711914062,
771
+ "logps/chosen": -406.62255859375,
772
+ "logps/rejected": -432.58575439453125,
773
+ "loss": 0.6911,
774
+ "rewards/accuracies": 0.543749988079071,
775
+ "rewards/chosen": -0.13780517876148224,
776
+ "rewards/margins": 0.009602868929505348,
777
+ "rewards/rejected": -0.14740803837776184,
778
+ "step": 460
779
+ },
780
+ {
781
+ "epoch": 0.8355555555555556,
782
+ "grad_norm": 0.3135496868277975,
783
+ "learning_rate": 3.98396291701183e-07,
784
+ "logits/chosen": 13.169825553894043,
785
+ "logits/rejected": 12.919093132019043,
786
+ "logps/chosen": -420.40594482421875,
787
+ "logps/rejected": -424.626220703125,
788
+ "loss": 0.6887,
789
+ "rewards/accuracies": 0.4937500059604645,
790
+ "rewards/chosen": -0.14023754000663757,
791
+ "rewards/margins": 0.012833138927817345,
792
+ "rewards/rejected": -0.15307065844535828,
793
+ "step": 470
794
+ },
795
+ {
796
+ "epoch": 0.8533333333333334,
797
+ "grad_norm": 0.3340559430991029,
798
+ "learning_rate": 3.1828519395374095e-07,
799
+ "logits/chosen": 13.121709823608398,
800
+ "logits/rejected": 13.285209655761719,
801
+ "logps/chosen": -428.4640197753906,
802
+ "logps/rejected": -445.868408203125,
803
+ "loss": 0.6934,
804
+ "rewards/accuracies": 0.53125,
805
+ "rewards/chosen": -0.1603536158800125,
806
+ "rewards/margins": 0.00828765518963337,
807
+ "rewards/rejected": -0.16864125430583954,
808
+ "step": 480
809
+ },
810
+ {
811
+ "epoch": 0.8711111111111111,
812
+ "grad_norm": 0.32372602686992025,
813
+ "learning_rate": 2.466147269552893e-07,
814
+ "logits/chosen": 13.549275398254395,
815
+ "logits/rejected": 13.303945541381836,
816
+ "logps/chosen": -398.70245361328125,
817
+ "logps/rejected": -403.9294128417969,
818
+ "loss": 0.6932,
819
+ "rewards/accuracies": 0.59375,
820
+ "rewards/chosen": -0.1653539538383484,
821
+ "rewards/margins": 0.015345364809036255,
822
+ "rewards/rejected": -0.18069931864738464,
823
+ "step": 490
824
+ },
825
+ {
826
+ "epoch": 0.8888888888888888,
827
+ "grad_norm": 0.28909373758975987,
828
+ "learning_rate": 1.8366216981942632e-07,
829
+ "logits/chosen": 12.993377685546875,
830
+ "logits/rejected": 13.172945976257324,
831
+ "logps/chosen": -441.27288818359375,
832
+ "logps/rejected": -444.38983154296875,
833
+ "loss": 0.696,
834
+ "rewards/accuracies": 0.543749988079071,
835
+ "rewards/chosen": -0.17415288090705872,
836
+ "rewards/margins": -0.0011280607432126999,
837
+ "rewards/rejected": -0.17302480340003967,
838
+ "step": 500
839
+ },
840
+ {
841
+ "epoch": 0.8888888888888888,
842
+ "eval_logits/chosen": 13.899726867675781,
843
+ "eval_logits/rejected": 12.095231056213379,
844
+ "eval_logps/chosen": -421.1900939941406,
845
+ "eval_logps/rejected": -403.3279724121094,
846
+ "eval_loss": 0.6633419990539551,
847
+ "eval_rewards/accuracies": 0.7539682388305664,
848
+ "eval_rewards/chosen": -0.1262015700340271,
849
+ "eval_rewards/margins": 0.06973244994878769,
850
+ "eval_rewards/rejected": -0.19593402743339539,
851
+ "eval_runtime": 90.1435,
852
+ "eval_samples_per_second": 11.093,
853
+ "eval_steps_per_second": 0.699,
854
+ "step": 500
855
+ },
856
+ {
857
+ "epoch": 0.9066666666666666,
858
+ "grad_norm": 0.44905140288995754,
859
+ "learning_rate": 1.296710737600934e-07,
860
+ "logits/chosen": 12.890707015991211,
861
+ "logits/rejected": 12.480443000793457,
862
+ "logps/chosen": -397.1539611816406,
863
+ "logps/rejected": -404.51470947265625,
864
+ "loss": 0.6926,
865
+ "rewards/accuracies": 0.53125,
866
+ "rewards/chosen": -0.14948078989982605,
867
+ "rewards/margins": 0.0065328641794621944,
868
+ "rewards/rejected": -0.1560136377811432,
869
+ "step": 510
870
+ },
871
+ {
872
+ "epoch": 0.9244444444444444,
873
+ "grad_norm": 0.3012451953425647,
874
+ "learning_rate": 8.485031983924558e-08,
875
+ "logits/chosen": 13.72374153137207,
876
+ "logits/rejected": 14.290632247924805,
877
+ "logps/chosen": -409.4125061035156,
878
+ "logps/rejected": -427.1560974121094,
879
+ "loss": 0.6908,
880
+ "rewards/accuracies": 0.5062500238418579,
881
+ "rewards/chosen": -0.15699708461761475,
882
+ "rewards/margins": -0.00222613662481308,
883
+ "rewards/rejected": -0.15477094054222107,
884
+ "step": 520
885
+ },
886
+ {
887
+ "epoch": 0.9422222222222222,
888
+ "grad_norm": 0.3269177206871498,
889
+ "learning_rate": 4.93733108466013e-08,
890
+ "logits/chosen": 12.78927993774414,
891
+ "logits/rejected": 13.388028144836426,
892
+ "logps/chosen": -428.2759704589844,
893
+ "logps/rejected": -456.5814514160156,
894
+ "loss": 0.6895,
895
+ "rewards/accuracies": 0.48124998807907104,
896
+ "rewards/chosen": -0.15250369906425476,
897
+ "rewards/margins": 0.005445868708193302,
898
+ "rewards/rejected": -0.15794958174228668,
899
+ "step": 530
900
+ },
901
+ {
902
+ "epoch": 0.96,
903
+ "grad_norm": 0.33934624748600184,
904
+ "learning_rate": 2.3377300437934236e-08,
905
+ "logits/chosen": 13.640138626098633,
906
+ "logits/rejected": 13.247881889343262,
907
+ "logps/chosen": -384.22723388671875,
908
+ "logps/rejected": -377.16546630859375,
909
+ "loss": 0.6933,
910
+ "rewards/accuracies": 0.5625,
911
+ "rewards/chosen": -0.14129753410816193,
912
+ "rewards/margins": 0.019017567858099937,
913
+ "rewards/rejected": -0.16031508147716522,
914
+ "step": 540
915
+ },
916
+ {
917
+ "epoch": 0.9777777777777777,
918
+ "grad_norm": 0.31794002037631625,
919
+ "learning_rate": 6.962862127343206e-09,
920
+ "logits/chosen": 13.490945816040039,
921
+ "logits/rejected": 13.803362846374512,
922
+ "logps/chosen": -422.62939453125,
923
+ "logps/rejected": -427.26959228515625,
924
+ "loss": 0.6894,
925
+ "rewards/accuracies": 0.512499988079071,
926
+ "rewards/chosen": -0.167506605386734,
927
+ "rewards/margins": 0.010436911135911942,
928
+ "rewards/rejected": -0.17794351279735565,
929
+ "step": 550
930
+ },
931
+ {
932
+ "epoch": 0.9955555555555555,
933
+ "grad_norm": 0.29176173477304573,
934
+ "learning_rate": 1.9350018786556956e-10,
935
+ "logits/chosen": 13.073209762573242,
936
+ "logits/rejected": 13.310659408569336,
937
+ "logps/chosen": -431.8447265625,
938
+ "logps/rejected": -418.0799865722656,
939
+ "loss": 0.693,
940
+ "rewards/accuracies": 0.44999998807907104,
941
+ "rewards/chosen": -0.1620911806821823,
942
+ "rewards/margins": -0.006803811527788639,
943
+ "rewards/rejected": -0.15528738498687744,
944
+ "step": 560
945
+ },
946
+ {
947
+ "epoch": 0.9991111111111111,
948
+ "step": 562,
949
+ "total_flos": 0.0,
950
+ "train_loss": 0.6916975294143703,
951
+ "train_runtime": 7520.1237,
952
+ "train_samples_per_second": 4.787,
953
+ "train_steps_per_second": 0.075
954
+ }
955
+ ],
956
+ "logging_steps": 10,
957
+ "max_steps": 562,
958
+ "num_input_tokens_seen": 0,
959
+ "num_train_epochs": 1,
960
+ "save_steps": 100,
961
+ "stateful_callbacks": {
962
+ "TrainerControl": {
963
+ "args": {
964
+ "should_epoch_stop": false,
965
+ "should_evaluate": false,
966
+ "should_log": false,
967
+ "should_save": true,
968
+ "should_training_stop": true
969
+ },
970
+ "attributes": {}
971
+ }
972
+ },
973
+ "total_flos": 0.0,
974
+ "train_batch_size": 4,
975
+ "trial_name": null,
976
+ "trial_params": null
977
+ }