File size: 928 Bytes
0121069
 
 
 
70715a4
 
 
9bf875b
70715a4
775458a
 
0121069
 
 
ef55760
0121069
fea0be1
7a433bb
295ebfc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0121069
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
This model is trained with Iterative DPO in OpenRLHF

Datasets and Hyperparameters

- Reward Model:https://huggingface.co/OpenLLMAI/Llama-3-8b-rm-700k
- SFT Model: https://huggingface.co/OpenLLMAI/Llama-3-8b-sft-mixture
- Prompt Dataset: https://huggingface.co/datasets/OpenLLMAI/prompt-collection-v0.1

```
Max Prompt Length: 2048
Max Response Length: 2048
best_of_n: 2 (2 samples for each prompt)
Learning Rate: 5e-7
Beta: 0.1
Scheduler: Cosine with Warmup (0.03) and MinLR (0.1 * init_lr)
Rollout Batch Size: 20000
Training Batch Size: 256
Number of Iterations: 9
```

Evaluation
```
########## First turn ##########
                      score
model           turn
Llama3-iter-dpo 1      8.55
########## Second turn ##########
                        score
model           turn
Llama3-iter-dpo 2     7.95625
########## Average ##########
                    score
model
Llama3-iter-dpo  8.253125
Llama3-sft-baseline 7.69
```