Wenboz commited on
Commit
becb441
·
verified ·
1 Parent(s): 5cc3804

Model save

Browse files
Files changed (4) hide show
  1. README.md +83 -0
  2. all_results.json +9 -0
  3. train_results.json +9 -0
  4. trainer_state.json +1731 -0
README.md ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: princeton-nlp/Llama-3-Base-8B-SFT
3
+ library_name: peft
4
+ tags:
5
+ - trl
6
+ - dpo
7
+ - generated_from_trainer
8
+ model-index:
9
+ - name: llama3-wpo-lora
10
+ results: []
11
+ ---
12
+
13
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
14
+ should probably proofread and complete it, then remove this comment. -->
15
+
16
+ # llama3-wpo-lora
17
+
18
+ This model is a fine-tuned version of [princeton-nlp/Llama-3-Base-8B-SFT](https://huggingface.co/princeton-nlp/Llama-3-Base-8B-SFT) on the None dataset.
19
+ It achieves the following results on the evaluation set:
20
+ - Loss: 0.5134
21
+ - Rewards/chosen: -0.2023
22
+ - Rewards/rejected: -1.1119
23
+ - Rewards/accuracies: 0.7480
24
+ - Rewards/margins: 0.9095
25
+ - Logps/rejected: -287.7953
26
+ - Logps/chosen: -294.5704
27
+ - Logps/ref Response: -0.5364
28
+ - Logits/rejected: -0.1602
29
+ - Logits/chosen: -0.2100
30
+
31
+ ## Model description
32
+
33
+ More information needed
34
+
35
+ ## Intended uses & limitations
36
+
37
+ More information needed
38
+
39
+ ## Training and evaluation data
40
+
41
+ More information needed
42
+
43
+ ## Training procedure
44
+
45
+ ### Training hyperparameters
46
+
47
+ The following hyperparameters were used during training:
48
+ - learning_rate: 5e-06
49
+ - train_batch_size: 1
50
+ - eval_batch_size: 4
51
+ - seed: 42
52
+ - distributed_type: multi-GPU
53
+ - num_devices: 4
54
+ - gradient_accumulation_steps: 16
55
+ - total_train_batch_size: 64
56
+ - total_eval_batch_size: 16
57
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
58
+ - lr_scheduler_type: cosine
59
+ - lr_scheduler_warmup_ratio: 0.1
60
+ - num_epochs: 1
61
+
62
+ ### Training results
63
+
64
+ | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logps/ref Response | Logits/rejected | Logits/chosen |
65
+ |:-------------:|:------:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:------------------:|:---------------:|:-------------:|
66
+ | 0.6044 | 0.1047 | 100 | 0.5889 | 0.1186 | -0.2672 | 0.6840 | 0.3859 | -279.3490 | -291.3607 | -0.5364 | -0.5369 | -0.5447 |
67
+ | 0.5438 | 0.2094 | 200 | 0.5452 | 0.0540 | -0.6279 | 0.7180 | 0.6819 | -282.9556 | -292.0069 | -0.5364 | -0.4631 | -0.4851 |
68
+ | 0.5367 | 0.3141 | 300 | 0.5323 | -0.0871 | -0.8542 | 0.7240 | 0.7671 | -285.2182 | -293.4178 | -0.5364 | -0.3777 | -0.4077 |
69
+ | 0.5196 | 0.4187 | 400 | 0.5236 | -0.0378 | -0.8614 | 0.7320 | 0.8235 | -285.2903 | -292.9255 | -0.5364 | -0.2899 | -0.3281 |
70
+ | 0.509 | 0.5234 | 500 | 0.5185 | -0.2693 | -1.1302 | 0.7360 | 0.8610 | -287.9790 | -295.2397 | -0.5364 | -0.2296 | -0.2739 |
71
+ | 0.5012 | 0.6281 | 600 | 0.5152 | -0.3520 | -1.2471 | 0.7480 | 0.8951 | -289.1475 | -296.0675 | -0.5364 | -0.1926 | -0.2397 |
72
+ | 0.5168 | 0.7328 | 700 | 0.5139 | -0.2521 | -1.1562 | 0.7440 | 0.9041 | -288.2387 | -295.0681 | -0.5364 | -0.1665 | -0.2158 |
73
+ | 0.5156 | 0.8375 | 800 | 0.5135 | -0.2204 | -1.1304 | 0.7520 | 0.9099 | -287.9801 | -294.7516 | -0.5364 | -0.1603 | -0.2103 |
74
+ | 0.506 | 0.9422 | 900 | 0.5134 | -0.2023 | -1.1119 | 0.7480 | 0.9095 | -287.7953 | -294.5704 | -0.5364 | -0.1602 | -0.2100 |
75
+
76
+
77
+ ### Framework versions
78
+
79
+ - PEFT 0.7.1
80
+ - Transformers 4.44.2
81
+ - Pytorch 2.2.1+cu121
82
+ - Datasets 2.14.6
83
+ - Tokenizers 0.19.1
all_results.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 0.9997382884061764,
3
+ "total_flos": 0.0,
4
+ "train_loss": 0.5342667900454936,
5
+ "train_runtime": 19113.1655,
6
+ "train_samples": 61135,
7
+ "train_samples_per_second": 3.199,
8
+ "train_steps_per_second": 0.05
9
+ }
train_results.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 0.9997382884061764,
3
+ "total_flos": 0.0,
4
+ "train_loss": 0.5342667900454936,
5
+ "train_runtime": 19113.1655,
6
+ "train_samples": 61135,
7
+ "train_samples_per_second": 3.199,
8
+ "train_steps_per_second": 0.05
9
+ }
trainer_state.json ADDED
@@ -0,0 +1,1731 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 0.9997382884061764,
5
+ "eval_steps": 100,
6
+ "global_step": 955,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.0010468463752944255,
13
+ "grad_norm": 4.03125,
14
+ "learning_rate": 5.208333333333333e-08,
15
+ "logits/chosen": -0.3494967222213745,
16
+ "logits/rejected": -0.3728627860546112,
17
+ "logps/chosen": -285.8127136230469,
18
+ "logps/ref_response": -0.3494967222213745,
19
+ "logps/rejected": -212.7957000732422,
20
+ "loss": 0.6931,
21
+ "rewards/accuracies": 0.0,
22
+ "rewards/chosen": 0.0,
23
+ "rewards/margins": 0.0,
24
+ "rewards/rejected": 0.0,
25
+ "step": 1
26
+ },
27
+ {
28
+ "epoch": 0.010468463752944255,
29
+ "grad_norm": 3.8125,
30
+ "learning_rate": 5.208333333333334e-07,
31
+ "logits/chosen": -0.5401131510734558,
32
+ "logits/rejected": -0.5498467683792114,
33
+ "logps/chosen": -315.3433532714844,
34
+ "logps/ref_response": -0.5399107336997986,
35
+ "logps/rejected": -278.06756591796875,
36
+ "loss": 0.6924,
37
+ "rewards/accuracies": 0.4444444477558136,
38
+ "rewards/chosen": -0.0011721360497176647,
39
+ "rewards/margins": 0.004719285294413567,
40
+ "rewards/rejected": -0.005891421809792519,
41
+ "step": 10
42
+ },
43
+ {
44
+ "epoch": 0.02093692750588851,
45
+ "grad_norm": 3.859375,
46
+ "learning_rate": 1.0416666666666667e-06,
47
+ "logits/chosen": -0.5040869116783142,
48
+ "logits/rejected": -0.5244153738021851,
49
+ "logps/chosen": -306.72930908203125,
50
+ "logps/ref_response": -0.5032420754432678,
51
+ "logps/rejected": -271.22784423828125,
52
+ "loss": 0.6921,
53
+ "rewards/accuracies": 0.512499988079071,
54
+ "rewards/chosen": 0.004430481232702732,
55
+ "rewards/margins": 0.005479422397911549,
56
+ "rewards/rejected": -0.0010489404667168856,
57
+ "step": 20
58
+ },
59
+ {
60
+ "epoch": 0.031405391258832765,
61
+ "grad_norm": 3.875,
62
+ "learning_rate": 1.5625e-06,
63
+ "logits/chosen": -0.5105286240577698,
64
+ "logits/rejected": -0.5181563496589661,
65
+ "logps/chosen": -290.9847717285156,
66
+ "logps/ref_response": -0.5080639123916626,
67
+ "logps/rejected": -252.4471435546875,
68
+ "loss": 0.6875,
69
+ "rewards/accuracies": 0.59375,
70
+ "rewards/chosen": 0.018009770661592484,
71
+ "rewards/margins": 0.021275093778967857,
72
+ "rewards/rejected": -0.0032653254456818104,
73
+ "step": 30
74
+ },
75
+ {
76
+ "epoch": 0.04187385501177702,
77
+ "grad_norm": 3.25,
78
+ "learning_rate": 2.0833333333333334e-06,
79
+ "logits/chosen": -0.48318833112716675,
80
+ "logits/rejected": -0.5184761881828308,
81
+ "logps/chosen": -305.87347412109375,
82
+ "logps/ref_response": -0.47757530212402344,
83
+ "logps/rejected": -244.558349609375,
84
+ "loss": 0.6771,
85
+ "rewards/accuracies": 0.612500011920929,
86
+ "rewards/chosen": 0.04270617291331291,
87
+ "rewards/margins": 0.042043447494506836,
88
+ "rewards/rejected": 0.0006627263501286507,
89
+ "step": 40
90
+ },
91
+ {
92
+ "epoch": 0.05234231876472128,
93
+ "grad_norm": 2.09375,
94
+ "learning_rate": 2.604166666666667e-06,
95
+ "logits/chosen": -0.5472795963287354,
96
+ "logits/rejected": -0.575782060623169,
97
+ "logps/chosen": -304.7160339355469,
98
+ "logps/ref_response": -0.5367640256881714,
99
+ "logps/rejected": -282.7024841308594,
100
+ "loss": 0.6697,
101
+ "rewards/accuracies": 0.6312500238418579,
102
+ "rewards/chosen": 0.0860854759812355,
103
+ "rewards/margins": 0.049294885247945786,
104
+ "rewards/rejected": 0.03679059445858002,
105
+ "step": 50
106
+ },
107
+ {
108
+ "epoch": 0.06281078251766553,
109
+ "grad_norm": 2.875,
110
+ "learning_rate": 3.125e-06,
111
+ "logits/chosen": -0.5696572661399841,
112
+ "logits/rejected": -0.5703103542327881,
113
+ "logps/chosen": -290.3211975097656,
114
+ "logps/ref_response": -0.5527787804603577,
115
+ "logps/rejected": -254.42190551757812,
116
+ "loss": 0.6511,
117
+ "rewards/accuracies": 0.574999988079071,
118
+ "rewards/chosen": 0.13902851939201355,
119
+ "rewards/margins": 0.06114862486720085,
120
+ "rewards/rejected": 0.0778798907995224,
121
+ "step": 60
122
+ },
123
+ {
124
+ "epoch": 0.07327924627060979,
125
+ "grad_norm": 2.9375,
126
+ "learning_rate": 3.6458333333333333e-06,
127
+ "logits/chosen": -0.559634268283844,
128
+ "logits/rejected": -0.5745820999145508,
129
+ "logps/chosen": -285.9539489746094,
130
+ "logps/ref_response": -0.5369429588317871,
131
+ "logps/rejected": -263.0733947753906,
132
+ "loss": 0.6325,
133
+ "rewards/accuracies": 0.6875,
134
+ "rewards/chosen": 0.234585240483284,
135
+ "rewards/margins": 0.16143682599067688,
136
+ "rewards/rejected": 0.07314838469028473,
137
+ "step": 70
138
+ },
139
+ {
140
+ "epoch": 0.08374771002355404,
141
+ "grad_norm": 2.765625,
142
+ "learning_rate": 4.166666666666667e-06,
143
+ "logits/chosen": -0.4994782507419586,
144
+ "logits/rejected": -0.5257306694984436,
145
+ "logps/chosen": -287.14178466796875,
146
+ "logps/ref_response": -0.46965378522872925,
147
+ "logps/rejected": -273.8600158691406,
148
+ "loss": 0.6093,
149
+ "rewards/accuracies": 0.7250000238418579,
150
+ "rewards/chosen": 0.3439296782016754,
151
+ "rewards/margins": 0.2807127833366394,
152
+ "rewards/rejected": 0.06321687251329422,
153
+ "step": 80
154
+ },
155
+ {
156
+ "epoch": 0.0942161737764983,
157
+ "grad_norm": 2.65625,
158
+ "learning_rate": 4.6875000000000004e-06,
159
+ "logits/chosen": -0.530420184135437,
160
+ "logits/rejected": -0.5506101846694946,
161
+ "logps/chosen": -330.380615234375,
162
+ "logps/ref_response": -0.4922845959663391,
163
+ "logps/rejected": -296.19439697265625,
164
+ "loss": 0.5874,
165
+ "rewards/accuracies": 0.75,
166
+ "rewards/chosen": 0.27289265394210815,
167
+ "rewards/margins": 0.35931333899497986,
168
+ "rewards/rejected": -0.0864206999540329,
169
+ "step": 90
170
+ },
171
+ {
172
+ "epoch": 0.10468463752944256,
173
+ "grad_norm": 2.671875,
174
+ "learning_rate": 4.9997324926814375e-06,
175
+ "logits/chosen": -0.5689066648483276,
176
+ "logits/rejected": -0.5622434020042419,
177
+ "logps/chosen": -276.24310302734375,
178
+ "logps/ref_response": -0.533843994140625,
179
+ "logps/rejected": -291.2969970703125,
180
+ "loss": 0.6044,
181
+ "rewards/accuracies": 0.699999988079071,
182
+ "rewards/chosen": 0.2586072087287903,
183
+ "rewards/margins": 0.3670012056827545,
184
+ "rewards/rejected": -0.10839401185512543,
185
+ "step": 100
186
+ },
187
+ {
188
+ "epoch": 0.10468463752944256,
189
+ "eval_logits/chosen": -0.5446628332138062,
190
+ "eval_logits/rejected": -0.5368726253509521,
191
+ "eval_logps/chosen": -291.36065673828125,
192
+ "eval_logps/ref_response": -0.536393404006958,
193
+ "eval_logps/rejected": -279.3489685058594,
194
+ "eval_loss": 0.5888689160346985,
195
+ "eval_rewards/accuracies": 0.6840000152587891,
196
+ "eval_rewards/chosen": 0.11864880472421646,
197
+ "eval_rewards/margins": 0.38588646054267883,
198
+ "eval_rewards/rejected": -0.26723766326904297,
199
+ "eval_runtime": 351.4888,
200
+ "eval_samples_per_second": 5.69,
201
+ "eval_steps_per_second": 0.356,
202
+ "step": 100
203
+ },
204
+ {
205
+ "epoch": 0.11515310128238682,
206
+ "grad_norm": 2.1875,
207
+ "learning_rate": 4.996723692767927e-06,
208
+ "logits/chosen": -0.6064401865005493,
209
+ "logits/rejected": -0.6292127966880798,
210
+ "logps/chosen": -290.3874206542969,
211
+ "logps/ref_response": -0.5667906999588013,
212
+ "logps/rejected": -279.1947937011719,
213
+ "loss": 0.5769,
214
+ "rewards/accuracies": 0.75,
215
+ "rewards/chosen": 0.12466597557067871,
216
+ "rewards/margins": 0.48810654878616333,
217
+ "rewards/rejected": -0.3634406626224518,
218
+ "step": 110
219
+ },
220
+ {
221
+ "epoch": 0.12562156503533106,
222
+ "grad_norm": 2.328125,
223
+ "learning_rate": 4.9903757462135984e-06,
224
+ "logits/chosen": -0.5538973808288574,
225
+ "logits/rejected": -0.563797652721405,
226
+ "logps/chosen": -263.09698486328125,
227
+ "logps/ref_response": -0.5169209837913513,
228
+ "logps/rejected": -255.43795776367188,
229
+ "loss": 0.5649,
230
+ "rewards/accuracies": 0.6812499761581421,
231
+ "rewards/chosen": -0.05361238867044449,
232
+ "rewards/margins": 0.42452582716941833,
233
+ "rewards/rejected": -0.47813814878463745,
234
+ "step": 120
235
+ },
236
+ {
237
+ "epoch": 0.1360900287882753,
238
+ "grad_norm": 2.390625,
239
+ "learning_rate": 4.980697142834315e-06,
240
+ "logits/chosen": -0.5228760838508606,
241
+ "logits/rejected": -0.539161205291748,
242
+ "logps/chosen": -303.0,
243
+ "logps/ref_response": -0.4790240228176117,
244
+ "logps/rejected": -340.1643981933594,
245
+ "loss": 0.5679,
246
+ "rewards/accuracies": 0.699999988079071,
247
+ "rewards/chosen": -0.004994773771613836,
248
+ "rewards/margins": 0.45376062393188477,
249
+ "rewards/rejected": -0.45875534415245056,
250
+ "step": 130
251
+ },
252
+ {
253
+ "epoch": 0.14655849254121958,
254
+ "grad_norm": 2.09375,
255
+ "learning_rate": 4.967700826904229e-06,
256
+ "logits/chosen": -0.5957541465759277,
257
+ "logits/rejected": -0.6013976335525513,
258
+ "logps/chosen": -283.49285888671875,
259
+ "logps/ref_response": -0.5482783913612366,
260
+ "logps/rejected": -278.0264587402344,
261
+ "loss": 0.5477,
262
+ "rewards/accuracies": 0.699999988079071,
263
+ "rewards/chosen": -0.040642134845256805,
264
+ "rewards/margins": 0.6047846674919128,
265
+ "rewards/rejected": -0.6454268097877502,
266
+ "step": 140
267
+ },
268
+ {
269
+ "epoch": 0.15702695629416383,
270
+ "grad_norm": 2.484375,
271
+ "learning_rate": 4.951404179843963e-06,
272
+ "logits/chosen": -0.5976008176803589,
273
+ "logits/rejected": -0.5590274930000305,
274
+ "logps/chosen": -308.1587829589844,
275
+ "logps/ref_response": -0.5423828363418579,
276
+ "logps/rejected": -281.4222412109375,
277
+ "loss": 0.5468,
278
+ "rewards/accuracies": 0.71875,
279
+ "rewards/chosen": 0.2197073996067047,
280
+ "rewards/margins": 0.6635927557945251,
281
+ "rewards/rejected": -0.4438853859901428,
282
+ "step": 150
283
+ },
284
+ {
285
+ "epoch": 0.16749542004710807,
286
+ "grad_norm": 2.109375,
287
+ "learning_rate": 4.931828996974498e-06,
288
+ "logits/chosen": -0.5375654697418213,
289
+ "logits/rejected": -0.5237765908241272,
290
+ "logps/chosen": -296.62542724609375,
291
+ "logps/ref_response": -0.4895528256893158,
292
+ "logps/rejected": -272.4287414550781,
293
+ "loss": 0.5462,
294
+ "rewards/accuracies": 0.762499988079071,
295
+ "rewards/chosen": 0.23916181921958923,
296
+ "rewards/margins": 0.6912265419960022,
297
+ "rewards/rejected": -0.4520646631717682,
298
+ "step": 160
299
+ },
300
+ {
301
+ "epoch": 0.17796388380005235,
302
+ "grad_norm": 2.46875,
303
+ "learning_rate": 4.909001458367867e-06,
304
+ "logits/chosen": -0.6169471740722656,
305
+ "logits/rejected": -0.5980030298233032,
306
+ "logps/chosen": -289.12432861328125,
307
+ "logps/ref_response": -0.5753272771835327,
308
+ "logps/rejected": -278.28826904296875,
309
+ "loss": 0.551,
310
+ "rewards/accuracies": 0.737500011920929,
311
+ "rewards/chosen": -0.07445995509624481,
312
+ "rewards/margins": 0.5914220213890076,
313
+ "rewards/rejected": -0.6658819913864136,
314
+ "step": 170
315
+ },
316
+ {
317
+ "epoch": 0.1884323475529966,
318
+ "grad_norm": 2.109375,
319
+ "learning_rate": 4.882952093833628e-06,
320
+ "logits/chosen": -0.6207358241081238,
321
+ "logits/rejected": -0.590207040309906,
322
+ "logps/chosen": -304.12164306640625,
323
+ "logps/ref_response": -0.5761692523956299,
324
+ "logps/rejected": -268.29193115234375,
325
+ "loss": 0.5446,
326
+ "rewards/accuracies": 0.7250000238418579,
327
+ "rewards/chosen": -0.1806192249059677,
328
+ "rewards/margins": 0.5938898324966431,
329
+ "rewards/rejected": -0.774509072303772,
330
+ "step": 180
331
+ },
332
+ {
333
+ "epoch": 0.19890081130594087,
334
+ "grad_norm": 2.96875,
335
+ "learning_rate": 4.853715742087947e-06,
336
+ "logits/chosen": -0.5569428205490112,
337
+ "logits/rejected": -0.5339682102203369,
338
+ "logps/chosen": -276.7432556152344,
339
+ "logps/ref_response": -0.5028859972953796,
340
+ "logps/rejected": -284.9825134277344,
341
+ "loss": 0.5467,
342
+ "rewards/accuracies": 0.699999988079071,
343
+ "rewards/chosen": 0.025777459144592285,
344
+ "rewards/margins": 0.5920640230178833,
345
+ "rewards/rejected": -0.5662865042686462,
346
+ "step": 190
347
+ },
348
+ {
349
+ "epoch": 0.2093692750588851,
350
+ "grad_norm": 3.78125,
351
+ "learning_rate": 4.821331504159906e-06,
352
+ "logits/chosen": -0.561359167098999,
353
+ "logits/rejected": -0.5725305676460266,
354
+ "logps/chosen": -297.00811767578125,
355
+ "logps/ref_response": -0.5163358449935913,
356
+ "logps/rejected": -256.9476013183594,
357
+ "loss": 0.5438,
358
+ "rewards/accuracies": 0.7250000238418579,
359
+ "rewards/chosen": 0.04708755761384964,
360
+ "rewards/margins": 0.6549776196479797,
361
+ "rewards/rejected": -0.6078900098800659,
362
+ "step": 200
363
+ },
364
+ {
365
+ "epoch": 0.2093692750588851,
366
+ "eval_logits/chosen": -0.4850742220878601,
367
+ "eval_logits/rejected": -0.4631035327911377,
368
+ "eval_logps/chosen": -292.0068664550781,
369
+ "eval_logps/ref_response": -0.5363935232162476,
370
+ "eval_logps/rejected": -282.95562744140625,
371
+ "eval_loss": 0.545185923576355,
372
+ "eval_rewards/accuracies": 0.7179999947547913,
373
+ "eval_rewards/chosen": 0.05402619019150734,
374
+ "eval_rewards/margins": 0.6819319725036621,
375
+ "eval_rewards/rejected": -0.6279057860374451,
376
+ "eval_runtime": 349.4927,
377
+ "eval_samples_per_second": 5.723,
378
+ "eval_steps_per_second": 0.358,
379
+ "step": 200
380
+ },
381
+ {
382
+ "epoch": 0.21983773881182936,
383
+ "grad_norm": 3.5625,
384
+ "learning_rate": 4.7858426910973435e-06,
385
+ "logits/chosen": -0.6022263169288635,
386
+ "logits/rejected": -0.6005181074142456,
387
+ "logps/chosen": -279.6543273925781,
388
+ "logps/ref_response": -0.5563252568244934,
389
+ "logps/rejected": -274.255126953125,
390
+ "loss": 0.5428,
391
+ "rewards/accuracies": 0.7250000238418579,
392
+ "rewards/chosen": 0.05802757292985916,
393
+ "rewards/margins": 0.6256381869316101,
394
+ "rewards/rejected": -0.5676106214523315,
395
+ "step": 210
396
+ },
397
+ {
398
+ "epoch": 0.23030620256477363,
399
+ "grad_norm": 2.265625,
400
+ "learning_rate": 4.747296766042161e-06,
401
+ "logits/chosen": -0.5727165937423706,
402
+ "logits/rejected": -0.5475823879241943,
403
+ "logps/chosen": -319.9413146972656,
404
+ "logps/ref_response": -0.525614857673645,
405
+ "logps/rejected": -273.38226318359375,
406
+ "loss": 0.5473,
407
+ "rewards/accuracies": 0.731249988079071,
408
+ "rewards/chosen": 0.04127366468310356,
409
+ "rewards/margins": 0.6884390711784363,
410
+ "rewards/rejected": -0.6471654176712036,
411
+ "step": 220
412
+ },
413
+ {
414
+ "epoch": 0.24077466631771788,
415
+ "grad_norm": 2.765625,
416
+ "learning_rate": 4.705745280752586e-06,
417
+ "logits/chosen": -0.6053592562675476,
418
+ "logits/rejected": -0.5618892312049866,
419
+ "logps/chosen": -293.00775146484375,
420
+ "logps/ref_response": -0.5675605535507202,
421
+ "logps/rejected": -291.1396179199219,
422
+ "loss": 0.5482,
423
+ "rewards/accuracies": 0.75,
424
+ "rewards/chosen": -0.07024725526571274,
425
+ "rewards/margins": 0.6758569478988647,
426
+ "rewards/rejected": -0.7461041212081909,
427
+ "step": 230
428
+ },
429
+ {
430
+ "epoch": 0.2512431300706621,
431
+ "grad_norm": 1.9453125,
432
+ "learning_rate": 4.661243806657256e-06,
433
+ "logits/chosen": -0.5776160955429077,
434
+ "logits/rejected": -0.5337271690368652,
435
+ "logps/chosen": -300.6571044921875,
436
+ "logps/ref_response": -0.5330287218093872,
437
+ "logps/rejected": -265.2437744140625,
438
+ "loss": 0.544,
439
+ "rewards/accuracies": 0.6812499761581421,
440
+ "rewards/chosen": -0.01873040571808815,
441
+ "rewards/margins": 0.5816887617111206,
442
+ "rewards/rejected": -0.6004191637039185,
443
+ "step": 240
444
+ },
445
+ {
446
+ "epoch": 0.26171159382360637,
447
+ "grad_norm": 2.25,
448
+ "learning_rate": 4.613851860533367e-06,
449
+ "logits/chosen": -0.5833350419998169,
450
+ "logits/rejected": -0.5385982394218445,
451
+ "logps/chosen": -294.57745361328125,
452
+ "logps/ref_response": -0.5492520928382874,
453
+ "logps/rejected": -261.46356201171875,
454
+ "loss": 0.5475,
455
+ "rewards/accuracies": 0.706250011920929,
456
+ "rewards/chosen": 0.042012982070446014,
457
+ "rewards/margins": 0.5611510276794434,
458
+ "rewards/rejected": -0.5191380381584167,
459
+ "step": 250
460
+ },
461
+ {
462
+ "epoch": 0.2721800575765506,
463
+ "grad_norm": 2.140625,
464
+ "learning_rate": 4.563632824908252e-06,
465
+ "logits/chosen": -0.5507840514183044,
466
+ "logits/rejected": -0.5077590346336365,
467
+ "logps/chosen": -294.02032470703125,
468
+ "logps/ref_response": -0.5089389085769653,
469
+ "logps/rejected": -281.4302978515625,
470
+ "loss": 0.5203,
471
+ "rewards/accuracies": 0.7749999761581421,
472
+ "rewards/chosen": 0.23180679976940155,
473
+ "rewards/margins": 0.9116535186767578,
474
+ "rewards/rejected": -0.6798466444015503,
475
+ "step": 260
476
+ },
477
+ {
478
+ "epoch": 0.2826485213294949,
479
+ "grad_norm": 3.15625,
480
+ "learning_rate": 4.510653863290871e-06,
481
+ "logits/chosen": -0.5404945611953735,
482
+ "logits/rejected": -0.5136505961418152,
483
+ "logps/chosen": -297.4787292480469,
484
+ "logps/ref_response": -0.5091123580932617,
485
+ "logps/rejected": -306.3871765136719,
486
+ "loss": 0.5336,
487
+ "rewards/accuracies": 0.6937500238418579,
488
+ "rewards/chosen": -0.061466194689273834,
489
+ "rewards/margins": 0.6878874897956848,
490
+ "rewards/rejected": -0.7493537068367004,
491
+ "step": 270
492
+ },
493
+ {
494
+ "epoch": 0.29311698508243916,
495
+ "grad_norm": 2.078125,
496
+ "learning_rate": 4.454985830346574e-06,
497
+ "logits/chosen": -0.6031721830368042,
498
+ "logits/rejected": -0.5645710229873657,
499
+ "logps/chosen": -303.2027282714844,
500
+ "logps/ref_response": -0.5748014450073242,
501
+ "logps/rejected": -287.49359130859375,
502
+ "loss": 0.5502,
503
+ "rewards/accuracies": 0.643750011920929,
504
+ "rewards/chosen": -0.08863335102796555,
505
+ "rewards/margins": 0.5962889194488525,
506
+ "rewards/rejected": -0.6849222183227539,
507
+ "step": 280
508
+ },
509
+ {
510
+ "epoch": 0.3035854488353834,
511
+ "grad_norm": 2.34375,
512
+ "learning_rate": 4.396703177135262e-06,
513
+ "logits/chosen": -0.5598694086074829,
514
+ "logits/rejected": -0.5239233374595642,
515
+ "logps/chosen": -287.7163391113281,
516
+ "logps/ref_response": -0.5320878624916077,
517
+ "logps/rejected": -259.9132385253906,
518
+ "loss": 0.5259,
519
+ "rewards/accuracies": 0.75,
520
+ "rewards/chosen": 0.1362595558166504,
521
+ "rewards/margins": 0.7511795163154602,
522
+ "rewards/rejected": -0.6149200201034546,
523
+ "step": 290
524
+ },
525
+ {
526
+ "epoch": 0.31405391258832765,
527
+ "grad_norm": 2.78125,
528
+ "learning_rate": 4.335883851539693e-06,
529
+ "logits/chosen": -0.5759841799736023,
530
+ "logits/rejected": -0.5336117148399353,
531
+ "logps/chosen": -296.90283203125,
532
+ "logps/ref_response": -0.5529105067253113,
533
+ "logps/rejected": -294.5677185058594,
534
+ "loss": 0.5367,
535
+ "rewards/accuracies": 0.706250011920929,
536
+ "rewards/chosen": -0.07880517095327377,
537
+ "rewards/margins": 0.7312062978744507,
538
+ "rewards/rejected": -0.8100113868713379,
539
+ "step": 300
540
+ },
541
+ {
542
+ "epoch": 0.31405391258832765,
543
+ "eval_logits/chosen": -0.4076842665672302,
544
+ "eval_logits/rejected": -0.37772703170776367,
545
+ "eval_logps/chosen": -293.4178161621094,
546
+ "eval_logps/ref_response": -0.5363935232162476,
547
+ "eval_logps/rejected": -285.21820068359375,
548
+ "eval_loss": 0.5322972536087036,
549
+ "eval_rewards/accuracies": 0.7239999771118164,
550
+ "eval_rewards/chosen": -0.08706536889076233,
551
+ "eval_rewards/margins": 0.7670957446098328,
552
+ "eval_rewards/rejected": -0.8541611433029175,
553
+ "eval_runtime": 349.5592,
554
+ "eval_samples_per_second": 5.721,
555
+ "eval_steps_per_second": 0.358,
556
+ "step": 300
557
+ },
558
+ {
559
+ "epoch": 0.3245223763412719,
560
+ "grad_norm": 2.71875,
561
+ "learning_rate": 4.2726091940171055e-06,
562
+ "logits/chosen": -0.5224083065986633,
563
+ "logits/rejected": -0.5524710416793823,
564
+ "logps/chosen": -295.6260681152344,
565
+ "logps/ref_response": -0.5006662607192993,
566
+ "logps/rejected": -342.54486083984375,
567
+ "loss": 0.517,
568
+ "rewards/accuracies": 0.800000011920929,
569
+ "rewards/chosen": 0.12224096059799194,
570
+ "rewards/margins": 0.872395396232605,
571
+ "rewards/rejected": -0.7501543760299683,
572
+ "step": 310
573
+ },
574
+ {
575
+ "epoch": 0.33499084009421615,
576
+ "grad_norm": 1.796875,
577
+ "learning_rate": 4.206963828813555e-06,
578
+ "logits/chosen": -0.5745652914047241,
579
+ "logits/rejected": -0.5253760814666748,
580
+ "logps/chosen": -296.783447265625,
581
+ "logps/ref_response": -0.5563712120056152,
582
+ "logps/rejected": -281.023681640625,
583
+ "loss": 0.5186,
584
+ "rewards/accuracies": 0.7437499761581421,
585
+ "rewards/chosen": 0.0009771495824679732,
586
+ "rewards/margins": 0.8379068374633789,
587
+ "rewards/rejected": -0.8369296193122864,
588
+ "step": 320
589
+ },
590
+ {
591
+ "epoch": 0.34545930384716045,
592
+ "grad_norm": 2.03125,
593
+ "learning_rate": 4.139035550786495e-06,
594
+ "logits/chosen": -0.6098914742469788,
595
+ "logits/rejected": -0.5377870798110962,
596
+ "logps/chosen": -290.60955810546875,
597
+ "logps/ref_response": -0.5800708532333374,
598
+ "logps/rejected": -262.45794677734375,
599
+ "loss": 0.5298,
600
+ "rewards/accuracies": 0.6937500238418579,
601
+ "rewards/chosen": -0.1382167935371399,
602
+ "rewards/margins": 0.6818909645080566,
603
+ "rewards/rejected": -0.820107638835907,
604
+ "step": 330
605
+ },
606
+ {
607
+ "epoch": 0.3559277676001047,
608
+ "grad_norm": 2.40625,
609
+ "learning_rate": 4.068915207986931e-06,
610
+ "logits/chosen": -0.5661717653274536,
611
+ "logits/rejected": -0.49055665731430054,
612
+ "logps/chosen": -298.7618713378906,
613
+ "logps/ref_response": -0.5407181978225708,
614
+ "logps/rejected": -259.991943359375,
615
+ "loss": 0.5262,
616
+ "rewards/accuracies": 0.75,
617
+ "rewards/chosen": -0.3091699182987213,
618
+ "rewards/margins": 0.8509536981582642,
619
+ "rewards/rejected": -1.1601234674453735,
620
+ "step": 340
621
+ },
622
+ {
623
+ "epoch": 0.36639623135304894,
624
+ "grad_norm": 2.25,
625
+ "learning_rate": 3.996696580158211e-06,
626
+ "logits/chosen": -0.5107800364494324,
627
+ "logits/rejected": -0.477796733379364,
628
+ "logps/chosen": -338.06329345703125,
629
+ "logps/ref_response": -0.486247718334198,
630
+ "logps/rejected": -293.3652648925781,
631
+ "loss": 0.5389,
632
+ "rewards/accuracies": 0.71875,
633
+ "rewards/chosen": -0.1124631017446518,
634
+ "rewards/margins": 0.7731461524963379,
635
+ "rewards/rejected": -0.8856091499328613,
636
+ "step": 350
637
+ },
638
+ {
639
+ "epoch": 0.3768646951059932,
640
+ "grad_norm": 2.0,
641
+ "learning_rate": 3.922476253313921e-06,
642
+ "logits/chosen": -0.48775357007980347,
643
+ "logits/rejected": -0.4903165400028229,
644
+ "logps/chosen": -275.6280822753906,
645
+ "logps/ref_response": -0.48780474066734314,
646
+ "logps/rejected": -299.45269775390625,
647
+ "loss": 0.523,
648
+ "rewards/accuracies": 0.6499999761581421,
649
+ "rewards/chosen": -0.24213580787181854,
650
+ "rewards/margins": 0.7627736330032349,
651
+ "rewards/rejected": -1.0049093961715698,
652
+ "step": 360
653
+ },
654
+ {
655
+ "epoch": 0.38733315885893743,
656
+ "grad_norm": 3.171875,
657
+ "learning_rate": 3.846353490562664e-06,
658
+ "logits/chosen": -0.5095352530479431,
659
+ "logits/rejected": -0.5028492212295532,
660
+ "logps/chosen": -289.87359619140625,
661
+ "logps/ref_response": -0.491685152053833,
662
+ "logps/rejected": -264.93988037109375,
663
+ "loss": 0.508,
664
+ "rewards/accuracies": 0.762499988079071,
665
+ "rewards/chosen": 0.03148346394300461,
666
+ "rewards/margins": 0.9142478704452515,
667
+ "rewards/rejected": -0.8827645182609558,
668
+ "step": 370
669
+ },
670
+ {
671
+ "epoch": 0.39780162261188173,
672
+ "grad_norm": 2.171875,
673
+ "learning_rate": 3.768430099352445e-06,
674
+ "logits/chosen": -0.528128981590271,
675
+ "logits/rejected": -0.5178142786026001,
676
+ "logps/chosen": -307.6114501953125,
677
+ "logps/ref_response": -0.5215914845466614,
678
+ "logps/rejected": -281.41888427734375,
679
+ "loss": 0.5227,
680
+ "rewards/accuracies": 0.768750011920929,
681
+ "rewards/chosen": -0.09869714826345444,
682
+ "rewards/margins": 1.0406019687652588,
683
+ "rewards/rejected": -1.1392991542816162,
684
+ "step": 380
685
+ },
686
+ {
687
+ "epoch": 0.408270086364826,
688
+ "grad_norm": 3.234375,
689
+ "learning_rate": 3.6888102953122307e-06,
690
+ "logits/chosen": -0.5722348093986511,
691
+ "logits/rejected": -0.525412380695343,
692
+ "logps/chosen": -266.93212890625,
693
+ "logps/ref_response": -0.5661150813102722,
694
+ "logps/rejected": -268.66094970703125,
695
+ "loss": 0.5332,
696
+ "rewards/accuracies": 0.768750011920929,
697
+ "rewards/chosen": -0.3913359045982361,
698
+ "rewards/margins": 0.8005081415176392,
699
+ "rewards/rejected": -1.1918439865112305,
700
+ "step": 390
701
+ },
702
+ {
703
+ "epoch": 0.4187385501177702,
704
+ "grad_norm": 2.828125,
705
+ "learning_rate": 3.607600562872785e-06,
706
+ "logits/chosen": -0.5220564603805542,
707
+ "logits/rejected": -0.4830014705657959,
708
+ "logps/chosen": -286.83056640625,
709
+ "logps/ref_response": -0.5258094072341919,
710
+ "logps/rejected": -278.11456298828125,
711
+ "loss": 0.5196,
712
+ "rewards/accuracies": 0.6812499761581421,
713
+ "rewards/chosen": -0.10876519978046417,
714
+ "rewards/margins": 0.8259785771369934,
715
+ "rewards/rejected": -0.9347437024116516,
716
+ "step": 400
717
+ },
718
+ {
719
+ "epoch": 0.4187385501177702,
720
+ "eval_logits/chosen": -0.3280640244483948,
721
+ "eval_logits/rejected": -0.2898561656475067,
722
+ "eval_logps/chosen": -292.925537109375,
723
+ "eval_logps/ref_response": -0.5363935232162476,
724
+ "eval_logps/rejected": -285.290283203125,
725
+ "eval_loss": 0.5235576629638672,
726
+ "eval_rewards/accuracies": 0.7319999933242798,
727
+ "eval_rewards/chosen": -0.03783903643488884,
728
+ "eval_rewards/margins": 0.823529839515686,
729
+ "eval_rewards/rejected": -0.8613688349723816,
730
+ "eval_runtime": 349.5237,
731
+ "eval_samples_per_second": 5.722,
732
+ "eval_steps_per_second": 0.358,
733
+ "step": 400
734
+ },
735
+ {
736
+ "epoch": 0.42920701387071447,
737
+ "grad_norm": 2.53125,
738
+ "learning_rate": 3.5249095128531863e-06,
739
+ "logits/chosen": -0.54010409116745,
740
+ "logits/rejected": -0.485682874917984,
741
+ "logps/chosen": -278.5982360839844,
742
+ "logps/ref_response": -0.5564926862716675,
743
+ "logps/rejected": -277.0348205566406,
744
+ "loss": 0.5119,
745
+ "rewards/accuracies": 0.768750011920929,
746
+ "rewards/chosen": -0.04259440302848816,
747
+ "rewards/margins": 0.8908407092094421,
748
+ "rewards/rejected": -0.9334350824356079,
749
+ "step": 410
750
+ },
751
+ {
752
+ "epoch": 0.4396754776236587,
753
+ "grad_norm": 2.25,
754
+ "learning_rate": 3.4408477372034743e-06,
755
+ "logits/chosen": -0.5260821580886841,
756
+ "logits/rejected": -0.4941217303276062,
757
+ "logps/chosen": -310.8683776855469,
758
+ "logps/ref_response": -0.5361344218254089,
759
+ "logps/rejected": -299.2928161621094,
760
+ "loss": 0.5345,
761
+ "rewards/accuracies": 0.762499988079071,
762
+ "rewards/chosen": -0.19507813453674316,
763
+ "rewards/margins": 0.7008574604988098,
764
+ "rewards/rejected": -0.8959355354309082,
765
+ "step": 420
766
+ },
767
+ {
768
+ "epoch": 0.45014394137660296,
769
+ "grad_norm": 3.484375,
770
+ "learning_rate": 3.355527661097728e-06,
771
+ "logits/chosen": -0.5269330739974976,
772
+ "logits/rejected": -0.5198745727539062,
773
+ "logps/chosen": -282.8704833984375,
774
+ "logps/ref_response": -0.5477866530418396,
775
+ "logps/rejected": -284.66448974609375,
776
+ "loss": 0.5363,
777
+ "rewards/accuracies": 0.75,
778
+ "rewards/chosen": -0.3844759464263916,
779
+ "rewards/margins": 0.7087138295173645,
780
+ "rewards/rejected": -1.0931897163391113,
781
+ "step": 430
782
+ },
783
+ {
784
+ "epoch": 0.46061240512954726,
785
+ "grad_norm": 2.265625,
786
+ "learning_rate": 3.269063392575352e-06,
787
+ "logits/chosen": -0.4928794503211975,
788
+ "logits/rejected": -0.4792874753475189,
789
+ "logps/chosen": -329.7843017578125,
790
+ "logps/ref_response": -0.5050511360168457,
791
+ "logps/rejected": -308.486572265625,
792
+ "loss": 0.5184,
793
+ "rewards/accuracies": 0.7250000238418579,
794
+ "rewards/chosen": -0.10263900458812714,
795
+ "rewards/margins": 0.7831138372421265,
796
+ "rewards/rejected": -0.8857528567314148,
797
+ "step": 440
798
+ },
799
+ {
800
+ "epoch": 0.4710808688824915,
801
+ "grad_norm": 1.9453125,
802
+ "learning_rate": 3.181570569931697e-06,
803
+ "logits/chosen": -0.5173367857933044,
804
+ "logits/rejected": -0.49148645997047424,
805
+ "logps/chosen": -287.93524169921875,
806
+ "logps/ref_response": -0.5224987864494324,
807
+ "logps/rejected": -284.7456970214844,
808
+ "loss": 0.5167,
809
+ "rewards/accuracies": 0.7875000238418579,
810
+ "rewards/chosen": -0.2799225449562073,
811
+ "rewards/margins": 0.795501708984375,
812
+ "rewards/rejected": -1.0754241943359375,
813
+ "step": 450
814
+ },
815
+ {
816
+ "epoch": 0.48154933263543576,
817
+ "grad_norm": 1.7109375,
818
+ "learning_rate": 3.09316620706208e-06,
819
+ "logits/chosen": -0.4698413014411926,
820
+ "logits/rejected": -0.4720715582370758,
821
+ "logps/chosen": -310.1413879394531,
822
+ "logps/ref_response": -0.4874509274959564,
823
+ "logps/rejected": -292.0880126953125,
824
+ "loss": 0.5003,
825
+ "rewards/accuracies": 0.793749988079071,
826
+ "rewards/chosen": -0.2300359308719635,
827
+ "rewards/margins": 1.019820213317871,
828
+ "rewards/rejected": -1.2498562335968018,
829
+ "step": 460
830
+ },
831
+ {
832
+ "epoch": 0.49201779638838,
833
+ "grad_norm": 2.28125,
834
+ "learning_rate": 3.0039685369660785e-06,
835
+ "logits/chosen": -0.4714682102203369,
836
+ "logits/rejected": -0.4186578392982483,
837
+ "logps/chosen": -284.99407958984375,
838
+ "logps/ref_response": -0.4861488938331604,
839
+ "logps/rejected": -270.0592041015625,
840
+ "loss": 0.5301,
841
+ "rewards/accuracies": 0.71875,
842
+ "rewards/chosen": -0.1522492617368698,
843
+ "rewards/margins": 0.8750549554824829,
844
+ "rewards/rejected": -1.0273042917251587,
845
+ "step": 470
846
+ },
847
+ {
848
+ "epoch": 0.5024862601413242,
849
+ "grad_norm": 2.15625,
850
+ "learning_rate": 2.91409685362137e-06,
851
+ "logits/chosen": -0.4800891876220703,
852
+ "logits/rejected": -0.4620879590511322,
853
+ "logps/chosen": -282.891845703125,
854
+ "logps/ref_response": -0.5061747431755066,
855
+ "logps/rejected": -280.4508972167969,
856
+ "loss": 0.506,
857
+ "rewards/accuracies": 0.731249988079071,
858
+ "rewards/chosen": -0.3358300030231476,
859
+ "rewards/margins": 0.7954305410385132,
860
+ "rewards/rejected": -1.1312605142593384,
861
+ "step": 480
862
+ },
863
+ {
864
+ "epoch": 0.5129547238942685,
865
+ "grad_norm": 1.4921875,
866
+ "learning_rate": 2.8236713524386085e-06,
867
+ "logits/chosen": -0.5464354753494263,
868
+ "logits/rejected": -0.4956323504447937,
869
+ "logps/chosen": -283.87847900390625,
870
+ "logps/ref_response": -0.5583964586257935,
871
+ "logps/rejected": -262.64935302734375,
872
+ "loss": 0.5042,
873
+ "rewards/accuracies": 0.78125,
874
+ "rewards/chosen": -0.2944543957710266,
875
+ "rewards/margins": 0.9288061857223511,
876
+ "rewards/rejected": -1.223260760307312,
877
+ "step": 490
878
+ },
879
+ {
880
+ "epoch": 0.5234231876472127,
881
+ "grad_norm": 2.28125,
882
+ "learning_rate": 2.7328129695107205e-06,
883
+ "logits/chosen": -0.45227164030075073,
884
+ "logits/rejected": -0.4433063864707947,
885
+ "logps/chosen": -268.2086181640625,
886
+ "logps/ref_response": -0.46682921051979065,
887
+ "logps/rejected": -278.5611572265625,
888
+ "loss": 0.509,
889
+ "rewards/accuracies": 0.8125,
890
+ "rewards/chosen": -0.18779410421848297,
891
+ "rewards/margins": 1.08889639377594,
892
+ "rewards/rejected": -1.2766902446746826,
893
+ "step": 500
894
+ },
895
+ {
896
+ "epoch": 0.5234231876472127,
897
+ "eval_logits/chosen": -0.2739206552505493,
898
+ "eval_logits/rejected": -0.22961243987083435,
899
+ "eval_logps/chosen": -295.23968505859375,
900
+ "eval_logps/ref_response": -0.536393404006958,
901
+ "eval_logps/rejected": -287.97900390625,
902
+ "eval_loss": 0.5184563994407654,
903
+ "eval_rewards/accuracies": 0.7360000014305115,
904
+ "eval_rewards/chosen": -0.2692505419254303,
905
+ "eval_rewards/margins": 0.8609901070594788,
906
+ "eval_rewards/rejected": -1.1302406787872314,
907
+ "eval_runtime": 349.5683,
908
+ "eval_samples_per_second": 5.721,
909
+ "eval_steps_per_second": 0.358,
910
+ "step": 500
911
+ },
912
+ {
913
+ "epoch": 0.533891651400157,
914
+ "grad_norm": 1.8515625,
915
+ "learning_rate": 2.641643219871597e-06,
916
+ "logits/chosen": -0.4826118052005768,
917
+ "logits/rejected": -0.4357355237007141,
918
+ "logps/chosen": -315.3731689453125,
919
+ "logps/ref_response": -0.5090646743774414,
920
+ "logps/rejected": -300.0047607421875,
921
+ "loss": 0.5249,
922
+ "rewards/accuracies": 0.7562500238418579,
923
+ "rewards/chosen": -0.24246744811534882,
924
+ "rewards/margins": 0.7875919342041016,
925
+ "rewards/rejected": -1.0300593376159668,
926
+ "step": 510
927
+ },
928
+ {
929
+ "epoch": 0.5443601151531012,
930
+ "grad_norm": 2.65625,
931
+ "learning_rate": 2.5502840349805074e-06,
932
+ "logits/chosen": -0.4724315106868744,
933
+ "logits/rejected": -0.4490731656551361,
934
+ "logps/chosen": -312.1684265136719,
935
+ "logps/ref_response": -0.5057616829872131,
936
+ "logps/rejected": -300.3685302734375,
937
+ "loss": 0.531,
938
+ "rewards/accuracies": 0.7437499761581421,
939
+ "rewards/chosen": -0.21491758525371552,
940
+ "rewards/margins": 0.9349055290222168,
941
+ "rewards/rejected": -1.1498230695724487,
942
+ "step": 520
943
+ },
944
+ {
945
+ "epoch": 0.5548285789060455,
946
+ "grad_norm": 2.6875,
947
+ "learning_rate": 2.4588575996495797e-06,
948
+ "logits/chosen": -0.43103843927383423,
949
+ "logits/rejected": -0.4253179430961609,
950
+ "logps/chosen": -275.5965881347656,
951
+ "logps/ref_response": -0.45075368881225586,
952
+ "logps/rejected": -266.48785400390625,
953
+ "loss": 0.5277,
954
+ "rewards/accuracies": 0.762499988079071,
955
+ "rewards/chosen": -0.5141997337341309,
956
+ "rewards/margins": 0.9270822405815125,
957
+ "rewards/rejected": -1.441282033920288,
958
+ "step": 530
959
+ },
960
+ {
961
+ "epoch": 0.5652970426589898,
962
+ "grad_norm": 2.984375,
963
+ "learning_rate": 2.367486188632446e-06,
964
+ "logits/chosen": -0.47084522247314453,
965
+ "logits/rejected": -0.46598607301712036,
966
+ "logps/chosen": -288.1656188964844,
967
+ "logps/ref_response": -0.5035119652748108,
968
+ "logps/rejected": -328.1705322265625,
969
+ "loss": 0.5151,
970
+ "rewards/accuracies": 0.731249988079071,
971
+ "rewards/chosen": -0.2480819970369339,
972
+ "rewards/margins": 0.912807822227478,
973
+ "rewards/rejected": -1.1608898639678955,
974
+ "step": 540
975
+ },
976
+ {
977
+ "epoch": 0.575765506411934,
978
+ "grad_norm": 2.234375,
979
+ "learning_rate": 2.276292003092593e-06,
980
+ "logits/chosen": -0.4974418580532074,
981
+ "logits/rejected": -0.46066489815711975,
982
+ "logps/chosen": -259.3627624511719,
983
+ "logps/ref_response": -0.5067554712295532,
984
+ "logps/rejected": -267.8301086425781,
985
+ "loss": 0.5027,
986
+ "rewards/accuracies": 0.7749999761581421,
987
+ "rewards/chosen": -0.1483170986175537,
988
+ "rewards/margins": 0.9989498257637024,
989
+ "rewards/rejected": -1.1472669839859009,
990
+ "step": 550
991
+ },
992
+ {
993
+ "epoch": 0.5862339701648783,
994
+ "grad_norm": 2.0625,
995
+ "learning_rate": 2.1853970071701415e-06,
996
+ "logits/chosen": -0.4845849871635437,
997
+ "logits/rejected": -0.4430512487888336,
998
+ "logps/chosen": -281.133544921875,
999
+ "logps/ref_response": -0.5059608817100525,
1000
+ "logps/rejected": -283.225830078125,
1001
+ "loss": 0.5108,
1002
+ "rewards/accuracies": 0.824999988079071,
1003
+ "rewards/chosen": -0.17830374836921692,
1004
+ "rewards/margins": 0.9193024635314941,
1005
+ "rewards/rejected": -1.0976061820983887,
1006
+ "step": 560
1007
+ },
1008
+ {
1009
+ "epoch": 0.5967024339178225,
1010
+ "grad_norm": 2.671875,
1011
+ "learning_rate": 2.0949227648656194e-06,
1012
+ "logits/chosen": -0.5085529088973999,
1013
+ "logits/rejected": -0.45733585953712463,
1014
+ "logps/chosen": -298.2741394042969,
1015
+ "logps/ref_response": -0.5283219218254089,
1016
+ "logps/rejected": -266.26641845703125,
1017
+ "loss": 0.5149,
1018
+ "rewards/accuracies": 0.7437499761581421,
1019
+ "rewards/chosen": -0.31216269731521606,
1020
+ "rewards/margins": 0.9785518646240234,
1021
+ "rewards/rejected": -1.2907145023345947,
1022
+ "step": 570
1023
+ },
1024
+ {
1025
+ "epoch": 0.6071708976707668,
1026
+ "grad_norm": 1.765625,
1027
+ "learning_rate": 2.00499027745888e-06,
1028
+ "logits/chosen": -0.4733489453792572,
1029
+ "logits/rejected": -0.4443192481994629,
1030
+ "logps/chosen": -302.96319580078125,
1031
+ "logps/ref_response": -0.5130476355552673,
1032
+ "logps/rejected": -302.777099609375,
1033
+ "loss": 0.524,
1034
+ "rewards/accuracies": 0.7250000238418579,
1035
+ "rewards/chosen": -0.31352347135543823,
1036
+ "rewards/margins": 0.8776998519897461,
1037
+ "rewards/rejected": -1.191223382949829,
1038
+ "step": 580
1039
+ },
1040
+ {
1041
+ "epoch": 0.6176393614237111,
1042
+ "grad_norm": 1.953125,
1043
+ "learning_rate": 1.915719821680624e-06,
1044
+ "logits/chosen": -0.5032647848129272,
1045
+ "logits/rejected": -0.425483763217926,
1046
+ "logps/chosen": -291.1716613769531,
1047
+ "logps/ref_response": -0.5210384130477905,
1048
+ "logps/rejected": -287.4841003417969,
1049
+ "loss": 0.5085,
1050
+ "rewards/accuracies": 0.78125,
1051
+ "rewards/chosen": -0.12319432199001312,
1052
+ "rewards/margins": 0.930014967918396,
1053
+ "rewards/rejected": -1.0532093048095703,
1054
+ "step": 590
1055
+ },
1056
+ {
1057
+ "epoch": 0.6281078251766553,
1058
+ "grad_norm": 1.734375,
1059
+ "learning_rate": 1.8272307888529276e-06,
1060
+ "logits/chosen": -0.4218737483024597,
1061
+ "logits/rejected": -0.3671064078807831,
1062
+ "logps/chosen": -267.94598388671875,
1063
+ "logps/ref_response": -0.4653666913509369,
1064
+ "logps/rejected": -285.668212890625,
1065
+ "loss": 0.5012,
1066
+ "rewards/accuracies": 0.768750011920929,
1067
+ "rewards/chosen": -0.35736727714538574,
1068
+ "rewards/margins": 0.9875767827033997,
1069
+ "rewards/rejected": -1.3449440002441406,
1070
+ "step": 600
1071
+ },
1072
+ {
1073
+ "epoch": 0.6281078251766553,
1074
+ "eval_logits/chosen": -0.23965981602668762,
1075
+ "eval_logits/rejected": -0.19264473021030426,
1076
+ "eval_logps/chosen": -296.0675354003906,
1077
+ "eval_logps/ref_response": -0.5363935232162476,
1078
+ "eval_logps/rejected": -289.1474914550781,
1079
+ "eval_loss": 0.5152395963668823,
1080
+ "eval_rewards/accuracies": 0.7480000257492065,
1081
+ "eval_rewards/chosen": -0.3520371615886688,
1082
+ "eval_rewards/margins": 0.895053505897522,
1083
+ "eval_rewards/rejected": -1.2470906972885132,
1084
+ "eval_runtime": 349.5867,
1085
+ "eval_samples_per_second": 5.721,
1086
+ "eval_steps_per_second": 0.358,
1087
+ "step": 600
1088
+ },
1089
+ {
1090
+ "epoch": 0.6385762889295996,
1091
+ "grad_norm": 1.9140625,
1092
+ "learning_rate": 1.739641525213929e-06,
1093
+ "logits/chosen": -0.4473685324192047,
1094
+ "logits/rejected": -0.4271601736545563,
1095
+ "logps/chosen": -269.72998046875,
1096
+ "logps/ref_response": -0.500705897808075,
1097
+ "logps/rejected": -275.93499755859375,
1098
+ "loss": 0.5048,
1099
+ "rewards/accuracies": 0.75,
1100
+ "rewards/chosen": -0.3331184685230255,
1101
+ "rewards/margins": 0.9662030339241028,
1102
+ "rewards/rejected": -1.2993214130401611,
1103
+ "step": 610
1104
+ },
1105
+ {
1106
+ "epoch": 0.6490447526825438,
1107
+ "grad_norm": 2.09375,
1108
+ "learning_rate": 1.6530691736402317e-06,
1109
+ "logits/chosen": -0.4652767777442932,
1110
+ "logits/rejected": -0.41162675619125366,
1111
+ "logps/chosen": -296.17120361328125,
1112
+ "logps/ref_response": -0.502620279788971,
1113
+ "logps/rejected": -286.85284423828125,
1114
+ "loss": 0.5152,
1115
+ "rewards/accuracies": 0.800000011920929,
1116
+ "rewards/chosen": -0.32981282472610474,
1117
+ "rewards/margins": 1.0166393518447876,
1118
+ "rewards/rejected": -1.3464521169662476,
1119
+ "step": 620
1120
+ },
1121
+ {
1122
+ "epoch": 0.6595132164354881,
1123
+ "grad_norm": 2.15625,
1124
+ "learning_rate": 1.5676295169786864e-06,
1125
+ "logits/chosen": -0.48260921239852905,
1126
+ "logits/rejected": -0.43052974343299866,
1127
+ "logps/chosen": -288.948486328125,
1128
+ "logps/ref_response": -0.522256076335907,
1129
+ "logps/rejected": -275.0885009765625,
1130
+ "loss": 0.5087,
1131
+ "rewards/accuracies": 0.7124999761581421,
1132
+ "rewards/chosen": -0.2533746659755707,
1133
+ "rewards/margins": 0.9306432604789734,
1134
+ "rewards/rejected": -1.1840178966522217,
1135
+ "step": 630
1136
+ },
1137
+ {
1138
+ "epoch": 0.6699816801884323,
1139
+ "grad_norm": 1.7109375,
1140
+ "learning_rate": 1.4834368231970922e-06,
1141
+ "logits/chosen": -0.5107460618019104,
1142
+ "logits/rejected": -0.4472557604312897,
1143
+ "logps/chosen": -287.7373352050781,
1144
+ "logps/ref_response": -0.5478745698928833,
1145
+ "logps/rejected": -274.25653076171875,
1146
+ "loss": 0.4972,
1147
+ "rewards/accuracies": 0.7437499761581421,
1148
+ "rewards/chosen": -0.11735512316226959,
1149
+ "rewards/margins": 0.8366876840591431,
1150
+ "rewards/rejected": -0.9540427327156067,
1151
+ "step": 640
1152
+ },
1153
+ {
1154
+ "epoch": 0.6804501439413766,
1155
+ "grad_norm": 2.046875,
1156
+ "learning_rate": 1.4006036925609245e-06,
1157
+ "logits/chosen": -0.48332634568214417,
1158
+ "logits/rejected": -0.41887950897216797,
1159
+ "logps/chosen": -300.3069152832031,
1160
+ "logps/ref_response": -0.5103051662445068,
1161
+ "logps/rejected": -251.2999725341797,
1162
+ "loss": 0.5163,
1163
+ "rewards/accuracies": 0.768750011920929,
1164
+ "rewards/chosen": -0.16883328557014465,
1165
+ "rewards/margins": 0.9550157785415649,
1166
+ "rewards/rejected": -1.1238490343093872,
1167
+ "step": 650
1168
+ },
1169
+ {
1170
+ "epoch": 0.6909186076943209,
1171
+ "grad_norm": 2.125,
1172
+ "learning_rate": 1.3192409070404582e-06,
1173
+ "logits/chosen": -0.4920195937156677,
1174
+ "logits/rejected": -0.45738130807876587,
1175
+ "logps/chosen": -304.8662414550781,
1176
+ "logps/ref_response": -0.5286127328872681,
1177
+ "logps/rejected": -307.9287109375,
1178
+ "loss": 0.5109,
1179
+ "rewards/accuracies": 0.71875,
1180
+ "rewards/chosen": -0.044912584125995636,
1181
+ "rewards/margins": 0.9029264450073242,
1182
+ "rewards/rejected": -0.9478389620780945,
1183
+ "step": 660
1184
+ },
1185
+ {
1186
+ "epoch": 0.7013870714472651,
1187
+ "grad_norm": 2.640625,
1188
+ "learning_rate": 1.2394572821496953e-06,
1189
+ "logits/chosen": -0.4847448468208313,
1190
+ "logits/rejected": -0.44459033012390137,
1191
+ "logps/chosen": -278.811279296875,
1192
+ "logps/ref_response": -0.5491371154785156,
1193
+ "logps/rejected": -260.7911682128906,
1194
+ "loss": 0.5158,
1195
+ "rewards/accuracies": 0.71875,
1196
+ "rewards/chosen": -0.17895063757896423,
1197
+ "rewards/margins": 0.8761111497879028,
1198
+ "rewards/rejected": -1.0550616979599,
1199
+ "step": 670
1200
+ },
1201
+ {
1202
+ "epoch": 0.7118555352002094,
1203
+ "grad_norm": 1.703125,
1204
+ "learning_rate": 1.1613595214152713e-06,
1205
+ "logits/chosen": -0.5145822763442993,
1206
+ "logits/rejected": -0.45838356018066406,
1207
+ "logps/chosen": -288.93072509765625,
1208
+ "logps/ref_response": -0.5694643259048462,
1209
+ "logps/rejected": -278.52642822265625,
1210
+ "loss": 0.5018,
1211
+ "rewards/accuracies": 0.737500011920929,
1212
+ "rewards/chosen": -0.2978869080543518,
1213
+ "rewards/margins": 0.903253436088562,
1214
+ "rewards/rejected": -1.2011405229568481,
1215
+ "step": 680
1216
+ },
1217
+ {
1218
+ "epoch": 0.7223239989531536,
1219
+ "grad_norm": 1.7578125,
1220
+ "learning_rate": 1.0850520736699362e-06,
1221
+ "logits/chosen": -0.46343159675598145,
1222
+ "logits/rejected": -0.40929287672042847,
1223
+ "logps/chosen": -342.8109130859375,
1224
+ "logps/ref_response": -0.4945286810398102,
1225
+ "logps/rejected": -318.6613464355469,
1226
+ "loss": 0.516,
1227
+ "rewards/accuracies": 0.7562500238418579,
1228
+ "rewards/chosen": -0.27149656414985657,
1229
+ "rewards/margins": 0.9835799336433411,
1230
+ "rewards/rejected": -1.25507652759552,
1231
+ "step": 690
1232
+ },
1233
+ {
1234
+ "epoch": 0.7327924627060979,
1235
+ "grad_norm": 1.96875,
1236
+ "learning_rate": 1.0106369933615043e-06,
1237
+ "logits/chosen": -0.5036773681640625,
1238
+ "logits/rejected": -0.42831772565841675,
1239
+ "logps/chosen": -317.4853515625,
1240
+ "logps/ref_response": -0.5506774187088013,
1241
+ "logps/rejected": -265.49896240234375,
1242
+ "loss": 0.5168,
1243
+ "rewards/accuracies": 0.7437499761581421,
1244
+ "rewards/chosen": -0.30581605434417725,
1245
+ "rewards/margins": 0.7802382707595825,
1246
+ "rewards/rejected": -1.0860543251037598,
1247
+ "step": 700
1248
+ },
1249
+ {
1250
+ "epoch": 0.7327924627060979,
1251
+ "eval_logits/chosen": -0.2157570868730545,
1252
+ "eval_logits/rejected": -0.16649317741394043,
1253
+ "eval_logps/chosen": -295.0681457519531,
1254
+ "eval_logps/ref_response": -0.5363935232162476,
1255
+ "eval_logps/rejected": -288.23870849609375,
1256
+ "eval_loss": 0.5139358639717102,
1257
+ "eval_rewards/accuracies": 0.7440000176429749,
1258
+ "eval_rewards/chosen": -0.2521001994609833,
1259
+ "eval_rewards/margins": 0.9041155576705933,
1260
+ "eval_rewards/rejected": -1.156215786933899,
1261
+ "eval_runtime": 349.6297,
1262
+ "eval_samples_per_second": 5.72,
1263
+ "eval_steps_per_second": 0.358,
1264
+ "step": 700
1265
+ },
1266
+ {
1267
+ "epoch": 0.7432609264590422,
1268
+ "grad_norm": 2.265625,
1269
+ "learning_rate": 9.382138040640714e-07,
1270
+ "logits/chosen": -0.509280800819397,
1271
+ "logits/rejected": -0.45220834016799927,
1272
+ "logps/chosen": -266.75555419921875,
1273
+ "logps/ref_response": -0.5634459257125854,
1274
+ "logps/rejected": -281.2605285644531,
1275
+ "loss": 0.5042,
1276
+ "rewards/accuracies": 0.78125,
1277
+ "rewards/chosen": -0.24380390346050262,
1278
+ "rewards/margins": 0.9118436574935913,
1279
+ "rewards/rejected": -1.1556475162506104,
1280
+ "step": 710
1281
+ },
1282
+ {
1283
+ "epoch": 0.7537293902119864,
1284
+ "grad_norm": 1.890625,
1285
+ "learning_rate": 8.678793653740633e-07,
1286
+ "logits/chosen": -0.43401581048965454,
1287
+ "logits/rejected": -0.41436678171157837,
1288
+ "logps/chosen": -264.42425537109375,
1289
+ "logps/ref_response": -0.49243393540382385,
1290
+ "logps/rejected": -265.80517578125,
1291
+ "loss": 0.5067,
1292
+ "rewards/accuracies": 0.762499988079071,
1293
+ "rewards/chosen": -0.12135882675647736,
1294
+ "rewards/margins": 1.0111067295074463,
1295
+ "rewards/rejected": -1.1324656009674072,
1296
+ "step": 720
1297
+ },
1298
+ {
1299
+ "epoch": 0.7641978539649307,
1300
+ "grad_norm": 1.5078125,
1301
+ "learning_rate": 7.997277433690984e-07,
1302
+ "logits/chosen": -0.4582897126674652,
1303
+ "logits/rejected": -0.38019412755966187,
1304
+ "logps/chosen": -303.3408203125,
1305
+ "logps/ref_response": -0.4944031834602356,
1306
+ "logps/rejected": -290.0057067871094,
1307
+ "loss": 0.5008,
1308
+ "rewards/accuracies": 0.7749999761581421,
1309
+ "rewards/chosen": -0.24817109107971191,
1310
+ "rewards/margins": 0.8901812434196472,
1311
+ "rewards/rejected": -1.1383522748947144,
1312
+ "step": 730
1313
+ },
1314
+ {
1315
+ "epoch": 0.7746663177178749,
1316
+ "grad_norm": 1.53125,
1317
+ "learning_rate": 7.338500848029603e-07,
1318
+ "logits/chosen": -0.4196249544620514,
1319
+ "logits/rejected": -0.41058415174484253,
1320
+ "logps/chosen": -293.5699157714844,
1321
+ "logps/ref_response": -0.4282347559928894,
1322
+ "logps/rejected": -278.59942626953125,
1323
+ "loss": 0.4953,
1324
+ "rewards/accuracies": 0.7875000238418579,
1325
+ "rewards/chosen": -0.13130240142345428,
1326
+ "rewards/margins": 0.9627996683120728,
1327
+ "rewards/rejected": -1.0941020250320435,
1328
+ "step": 740
1329
+ },
1330
+ {
1331
+ "epoch": 0.7851347814708192,
1332
+ "grad_norm": 1.9296875,
1333
+ "learning_rate": 6.70334495204884e-07,
1334
+ "logits/chosen": -0.4480054974555969,
1335
+ "logits/rejected": -0.40563225746154785,
1336
+ "logps/chosen": -326.13360595703125,
1337
+ "logps/ref_response": -0.49645256996154785,
1338
+ "logps/rejected": -289.4101257324219,
1339
+ "loss": 0.5131,
1340
+ "rewards/accuracies": 0.800000011920929,
1341
+ "rewards/chosen": -0.11767721176147461,
1342
+ "rewards/margins": 0.9364310503005981,
1343
+ "rewards/rejected": -1.0541083812713623,
1344
+ "step": 750
1345
+ },
1346
+ {
1347
+ "epoch": 0.7956032452237635,
1348
+ "grad_norm": 2.234375,
1349
+ "learning_rate": 6.092659210462232e-07,
1350
+ "logits/chosen": -0.47132453322410583,
1351
+ "logits/rejected": -0.44373002648353577,
1352
+ "logps/chosen": -271.53887939453125,
1353
+ "logps/ref_response": -0.5222411751747131,
1354
+ "logps/rejected": -271.93829345703125,
1355
+ "loss": 0.5203,
1356
+ "rewards/accuracies": 0.7437499761581421,
1357
+ "rewards/chosen": -0.2481314241886139,
1358
+ "rewards/margins": 0.8226064443588257,
1359
+ "rewards/rejected": -1.0707378387451172,
1360
+ "step": 760
1361
+ },
1362
+ {
1363
+ "epoch": 0.8060717089767077,
1364
+ "grad_norm": 1.4609375,
1365
+ "learning_rate": 5.507260361320738e-07,
1366
+ "logits/chosen": -0.45327791571617126,
1367
+ "logits/rejected": -0.4443967342376709,
1368
+ "logps/chosen": -287.787109375,
1369
+ "logps/ref_response": -0.50932776927948,
1370
+ "logps/rejected": -282.6838073730469,
1371
+ "loss": 0.5151,
1372
+ "rewards/accuracies": 0.6937500238418579,
1373
+ "rewards/chosen": -0.15161371231079102,
1374
+ "rewards/margins": 0.7541275024414062,
1375
+ "rewards/rejected": -0.9057412147521973,
1376
+ "step": 770
1377
+ },
1378
+ {
1379
+ "epoch": 0.816540172729652,
1380
+ "grad_norm": 1.5,
1381
+ "learning_rate": 4.947931323697983e-07,
1382
+ "logits/chosen": -0.4503496289253235,
1383
+ "logits/rejected": -0.4044824540615082,
1384
+ "logps/chosen": -289.29718017578125,
1385
+ "logps/ref_response": -0.49121037125587463,
1386
+ "logps/rejected": -283.48651123046875,
1387
+ "loss": 0.5147,
1388
+ "rewards/accuracies": 0.675000011920929,
1389
+ "rewards/chosen": -0.2892809808254242,
1390
+ "rewards/margins": 0.7692901492118835,
1391
+ "rewards/rejected": -1.0585711002349854,
1392
+ "step": 780
1393
+ },
1394
+ {
1395
+ "epoch": 0.8270086364825961,
1396
+ "grad_norm": 2.46875,
1397
+ "learning_rate": 4.4154201506053985e-07,
1398
+ "logits/chosen": -0.4808201789855957,
1399
+ "logits/rejected": -0.437244176864624,
1400
+ "logps/chosen": -303.23907470703125,
1401
+ "logps/ref_response": -0.5042006373405457,
1402
+ "logps/rejected": -268.26165771484375,
1403
+ "loss": 0.5086,
1404
+ "rewards/accuracies": 0.7749999761581421,
1405
+ "rewards/chosen": -0.2500911056995392,
1406
+ "rewards/margins": 0.948900043964386,
1407
+ "rewards/rejected": -1.198991060256958,
1408
+ "step": 790
1409
+ },
1410
+ {
1411
+ "epoch": 0.8374771002355405,
1412
+ "grad_norm": 1.109375,
1413
+ "learning_rate": 3.910439028537638e-07,
1414
+ "logits/chosen": -0.4850529730319977,
1415
+ "logits/rejected": -0.4150736927986145,
1416
+ "logps/chosen": -351.47967529296875,
1417
+ "logps/ref_response": -0.5149141550064087,
1418
+ "logps/rejected": -306.39044189453125,
1419
+ "loss": 0.5156,
1420
+ "rewards/accuracies": 0.699999988079071,
1421
+ "rewards/chosen": -0.20133157074451447,
1422
+ "rewards/margins": 0.7480217218399048,
1423
+ "rewards/rejected": -0.9493532180786133,
1424
+ "step": 800
1425
+ },
1426
+ {
1427
+ "epoch": 0.8374771002355405,
1428
+ "eval_logits/chosen": -0.21027056872844696,
1429
+ "eval_logits/rejected": -0.16033047437667847,
1430
+ "eval_logps/chosen": -294.7516174316406,
1431
+ "eval_logps/ref_response": -0.5363935232162476,
1432
+ "eval_logps/rejected": -287.9801330566406,
1433
+ "eval_loss": 0.5134991407394409,
1434
+ "eval_rewards/accuracies": 0.7519999742507935,
1435
+ "eval_rewards/chosen": -0.22044621407985687,
1436
+ "eval_rewards/margins": 0.9099085927009583,
1437
+ "eval_rewards/rejected": -1.1303547620773315,
1438
+ "eval_runtime": 349.5707,
1439
+ "eval_samples_per_second": 5.721,
1440
+ "eval_steps_per_second": 0.358,
1441
+ "step": 800
1442
+ },
1443
+ {
1444
+ "epoch": 0.8479455639884846,
1445
+ "grad_norm": 1.59375,
1446
+ "learning_rate": 3.4336633249862084e-07,
1447
+ "logits/chosen": -0.5064218640327454,
1448
+ "logits/rejected": -0.40463584661483765,
1449
+ "logps/chosen": -323.1934814453125,
1450
+ "logps/ref_response": -0.5519742369651794,
1451
+ "logps/rejected": -292.6800231933594,
1452
+ "loss": 0.4987,
1453
+ "rewards/accuracies": 0.731249988079071,
1454
+ "rewards/chosen": -0.28738316893577576,
1455
+ "rewards/margins": 0.8722078204154968,
1456
+ "rewards/rejected": -1.1595909595489502,
1457
+ "step": 810
1458
+ },
1459
+ {
1460
+ "epoch": 0.8584140277414289,
1461
+ "grad_norm": 1.78125,
1462
+ "learning_rate": 2.98573068519539e-07,
1463
+ "logits/chosen": -0.4821909964084625,
1464
+ "logits/rejected": -0.44719791412353516,
1465
+ "logps/chosen": -310.1927185058594,
1466
+ "logps/ref_response": -0.5307375192642212,
1467
+ "logps/rejected": -297.6319274902344,
1468
+ "loss": 0.5104,
1469
+ "rewards/accuracies": 0.7250000238418579,
1470
+ "rewards/chosen": -0.15440870821475983,
1471
+ "rewards/margins": 0.9249560236930847,
1472
+ "rewards/rejected": -1.0793647766113281,
1473
+ "step": 820
1474
+ },
1475
+ {
1476
+ "epoch": 0.8688824914943732,
1477
+ "grad_norm": 2.234375,
1478
+ "learning_rate": 2.5672401793681854e-07,
1479
+ "logits/chosen": -0.49778860807418823,
1480
+ "logits/rejected": -0.4640674591064453,
1481
+ "logps/chosen": -277.7762145996094,
1482
+ "logps/ref_response": -0.5466696619987488,
1483
+ "logps/rejected": -273.833740234375,
1484
+ "loss": 0.5,
1485
+ "rewards/accuracies": 0.78125,
1486
+ "rewards/chosen": -0.24376177787780762,
1487
+ "rewards/margins": 0.9780392646789551,
1488
+ "rewards/rejected": -1.2218010425567627,
1489
+ "step": 830
1490
+ },
1491
+ {
1492
+ "epoch": 0.8793509552473174,
1493
+ "grad_norm": 1.828125,
1494
+ "learning_rate": 2.178751501463036e-07,
1495
+ "logits/chosen": -0.46538400650024414,
1496
+ "logits/rejected": -0.4340798258781433,
1497
+ "logps/chosen": -317.69232177734375,
1498
+ "logps/ref_response": -0.5086795091629028,
1499
+ "logps/rejected": -311.81475830078125,
1500
+ "loss": 0.4914,
1501
+ "rewards/accuracies": 0.75,
1502
+ "rewards/chosen": -0.2690751850605011,
1503
+ "rewards/margins": 0.9315555691719055,
1504
+ "rewards/rejected": -1.200630784034729,
1505
+ "step": 840
1506
+ },
1507
+ {
1508
+ "epoch": 0.8898194190002617,
1509
+ "grad_norm": 2.015625,
1510
+ "learning_rate": 1.820784220652766e-07,
1511
+ "logits/chosen": -0.5234388113021851,
1512
+ "logits/rejected": -0.4476381838321686,
1513
+ "logps/chosen": -348.9575500488281,
1514
+ "logps/ref_response": -0.5546728372573853,
1515
+ "logps/rejected": -283.19366455078125,
1516
+ "loss": 0.5109,
1517
+ "rewards/accuracies": 0.7250000238418579,
1518
+ "rewards/chosen": -0.0391000397503376,
1519
+ "rewards/margins": 0.9498162269592285,
1520
+ "rewards/rejected": -0.9889162182807922,
1521
+ "step": 850
1522
+ },
1523
+ {
1524
+ "epoch": 0.9002878827532059,
1525
+ "grad_norm": 1.9453125,
1526
+ "learning_rate": 1.4938170864468636e-07,
1527
+ "logits/chosen": -0.44146886467933655,
1528
+ "logits/rejected": -0.38905996084213257,
1529
+ "logps/chosen": -292.7279968261719,
1530
+ "logps/ref_response": -0.4814940392971039,
1531
+ "logps/rejected": -273.334228515625,
1532
+ "loss": 0.4883,
1533
+ "rewards/accuracies": 0.7562500238418579,
1534
+ "rewards/chosen": -0.21229040622711182,
1535
+ "rewards/margins": 1.0407958030700684,
1536
+ "rewards/rejected": -1.2530862092971802,
1537
+ "step": 860
1538
+ },
1539
+ {
1540
+ "epoch": 0.9107563465061502,
1541
+ "grad_norm": 2.46875,
1542
+ "learning_rate": 1.1982873884064466e-07,
1543
+ "logits/chosen": -0.40739497542381287,
1544
+ "logits/rejected": -0.3892499804496765,
1545
+ "logps/chosen": -290.3547668457031,
1546
+ "logps/ref_response": -0.463235467672348,
1547
+ "logps/rejected": -281.25054931640625,
1548
+ "loss": 0.5111,
1549
+ "rewards/accuracies": 0.731249988079071,
1550
+ "rewards/chosen": -0.24277010560035706,
1551
+ "rewards/margins": 0.7852509617805481,
1552
+ "rewards/rejected": -1.0280208587646484,
1553
+ "step": 870
1554
+ },
1555
+ {
1556
+ "epoch": 0.9212248102590945,
1557
+ "grad_norm": 1.84375,
1558
+ "learning_rate": 9.345903713082305e-08,
1559
+ "logits/chosen": -0.4908636510372162,
1560
+ "logits/rejected": -0.4692970812320709,
1561
+ "logps/chosen": -317.8591613769531,
1562
+ "logps/ref_response": -0.5406745672225952,
1563
+ "logps/rejected": -284.82958984375,
1564
+ "loss": 0.5206,
1565
+ "rewards/accuracies": 0.637499988079071,
1566
+ "rewards/chosen": -0.23101326823234558,
1567
+ "rewards/margins": 0.7203452587127686,
1568
+ "rewards/rejected": -0.9513584971427917,
1569
+ "step": 880
1570
+ },
1571
+ {
1572
+ "epoch": 0.9316932740120387,
1573
+ "grad_norm": 2.765625,
1574
+ "learning_rate": 7.030787065396866e-08,
1575
+ "logits/chosen": -0.4572540819644928,
1576
+ "logits/rejected": -0.4023072123527527,
1577
+ "logps/chosen": -322.2685546875,
1578
+ "logps/ref_response": -0.5117658376693726,
1579
+ "logps/rejected": -297.341064453125,
1580
+ "loss": 0.5061,
1581
+ "rewards/accuracies": 0.7437499761581421,
1582
+ "rewards/chosen": -0.18995711207389832,
1583
+ "rewards/margins": 0.8081440925598145,
1584
+ "rewards/rejected": -0.9981012344360352,
1585
+ "step": 890
1586
+ },
1587
+ {
1588
+ "epoch": 0.942161737764983,
1589
+ "grad_norm": 2.078125,
1590
+ "learning_rate": 5.0406202043228604e-08,
1591
+ "logits/chosen": -0.4801081120967865,
1592
+ "logits/rejected": -0.4477986693382263,
1593
+ "logps/chosen": -336.6103820800781,
1594
+ "logps/ref_response": -0.5195636749267578,
1595
+ "logps/rejected": -278.45306396484375,
1596
+ "loss": 0.506,
1597
+ "rewards/accuracies": 0.7562500238418579,
1598
+ "rewards/chosen": -0.05415695160627365,
1599
+ "rewards/margins": 1.0482077598571777,
1600
+ "rewards/rejected": -1.1023646593093872,
1601
+ "step": 900
1602
+ },
1603
+ {
1604
+ "epoch": 0.942161737764983,
1605
+ "eval_logits/chosen": -0.20995385944843292,
1606
+ "eval_logits/rejected": -0.1602318435907364,
1607
+ "eval_logps/chosen": -294.57037353515625,
1608
+ "eval_logps/ref_response": -0.5363935232162476,
1609
+ "eval_logps/rejected": -287.7952575683594,
1610
+ "eval_loss": 0.5134302377700806,
1611
+ "eval_rewards/accuracies": 0.7480000257492065,
1612
+ "eval_rewards/chosen": -0.20232149958610535,
1613
+ "eval_rewards/margins": 0.9095419049263,
1614
+ "eval_rewards/rejected": -1.111863374710083,
1615
+ "eval_runtime": 349.7136,
1616
+ "eval_samples_per_second": 5.719,
1617
+ "eval_steps_per_second": 0.357,
1618
+ "step": 900
1619
+ },
1620
+ {
1621
+ "epoch": 0.9526302015179272,
1622
+ "grad_norm": 1.546875,
1623
+ "learning_rate": 3.378064801637687e-08,
1624
+ "logits/chosen": -0.5161974430084229,
1625
+ "logits/rejected": -0.4518052935600281,
1626
+ "logps/chosen": -317.8857421875,
1627
+ "logps/ref_response": -0.561827540397644,
1628
+ "logps/rejected": -317.4599609375,
1629
+ "loss": 0.5156,
1630
+ "rewards/accuracies": 0.737500011920929,
1631
+ "rewards/chosen": -0.11832026392221451,
1632
+ "rewards/margins": 0.8348624110221863,
1633
+ "rewards/rejected": -0.9531826972961426,
1634
+ "step": 910
1635
+ },
1636
+ {
1637
+ "epoch": 0.9630986652708715,
1638
+ "grad_norm": 1.8046875,
1639
+ "learning_rate": 2.0453443778310766e-08,
1640
+ "logits/chosen": -0.4482875466346741,
1641
+ "logits/rejected": -0.3840841054916382,
1642
+ "logps/chosen": -332.5438232421875,
1643
+ "logps/ref_response": -0.4732615351676941,
1644
+ "logps/rejected": -309.3650207519531,
1645
+ "loss": 0.5037,
1646
+ "rewards/accuracies": 0.7875000238418579,
1647
+ "rewards/chosen": -0.15293975174427032,
1648
+ "rewards/margins": 1.0318092107772827,
1649
+ "rewards/rejected": -1.184748888015747,
1650
+ "step": 920
1651
+ },
1652
+ {
1653
+ "epoch": 0.9735671290238157,
1654
+ "grad_norm": 1.7109375,
1655
+ "learning_rate": 1.0442413283435759e-08,
1656
+ "logits/chosen": -0.4494338929653168,
1657
+ "logits/rejected": -0.37092915177345276,
1658
+ "logps/chosen": -321.01531982421875,
1659
+ "logps/ref_response": -0.4792579114437103,
1660
+ "logps/rejected": -282.218017578125,
1661
+ "loss": 0.4975,
1662
+ "rewards/accuracies": 0.762499988079071,
1663
+ "rewards/chosen": -0.13543161749839783,
1664
+ "rewards/margins": 1.1757858991622925,
1665
+ "rewards/rejected": -1.3112175464630127,
1666
+ "step": 930
1667
+ },
1668
+ {
1669
+ "epoch": 0.98403559277676,
1670
+ "grad_norm": 1.453125,
1671
+ "learning_rate": 3.760945397705828e-09,
1672
+ "logits/chosen": -0.4677162170410156,
1673
+ "logits/rejected": -0.39422911405563354,
1674
+ "logps/chosen": -293.9225769042969,
1675
+ "logps/ref_response": -0.5234506726264954,
1676
+ "logps/rejected": -264.69964599609375,
1677
+ "loss": 0.4975,
1678
+ "rewards/accuracies": 0.78125,
1679
+ "rewards/chosen": -0.05600558966398239,
1680
+ "rewards/margins": 1.1023480892181396,
1681
+ "rewards/rejected": -1.1583536863327026,
1682
+ "step": 940
1683
+ },
1684
+ {
1685
+ "epoch": 0.9945040565297043,
1686
+ "grad_norm": 2.015625,
1687
+ "learning_rate": 4.1797599220405605e-10,
1688
+ "logits/chosen": -0.48652324080467224,
1689
+ "logits/rejected": -0.4518989622592926,
1690
+ "logps/chosen": -298.553955078125,
1691
+ "logps/ref_response": -0.5367287397384644,
1692
+ "logps/rejected": -282.85626220703125,
1693
+ "loss": 0.5151,
1694
+ "rewards/accuracies": 0.75,
1695
+ "rewards/chosen": -0.14170411229133606,
1696
+ "rewards/margins": 0.9237847328186035,
1697
+ "rewards/rejected": -1.0654886960983276,
1698
+ "step": 950
1699
+ },
1700
+ {
1701
+ "epoch": 0.9997382884061764,
1702
+ "step": 955,
1703
+ "total_flos": 0.0,
1704
+ "train_loss": 0.5342667900454936,
1705
+ "train_runtime": 19113.1655,
1706
+ "train_samples_per_second": 3.199,
1707
+ "train_steps_per_second": 0.05
1708
+ }
1709
+ ],
1710
+ "logging_steps": 10,
1711
+ "max_steps": 955,
1712
+ "num_input_tokens_seen": 0,
1713
+ "num_train_epochs": 1,
1714
+ "save_steps": 100000,
1715
+ "stateful_callbacks": {
1716
+ "TrainerControl": {
1717
+ "args": {
1718
+ "should_epoch_stop": false,
1719
+ "should_evaluate": false,
1720
+ "should_log": false,
1721
+ "should_save": true,
1722
+ "should_training_stop": true
1723
+ },
1724
+ "attributes": {}
1725
+ }
1726
+ },
1727
+ "total_flos": 0.0,
1728
+ "train_batch_size": 1,
1729
+ "trial_name": null,
1730
+ "trial_params": null
1731
+ }