File size: 1,706 Bytes
4a9b910
 
 
 
 
 
 
 
9444298
7023efe
 
 
4a9b910
 
 
 
 
7023efe
4a9b910
 
 
 
7023efe
 
4a9b910
7023efe
2a28f12
4a9b910
 
 
 
 
7023efe
 
 
4a9b910
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
---
library_name: transformers
tags: []
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->
**PPO-C** (PPO with Calibrated Reward Calculation) is an RLHF algorithm to mitigate verbalized overconfidence in RLHF-trained Large Language Models.
PPO-C adjusts standard reward model scores during PPO training. It maintains a running average of past reward scores as a dynamic threshold to
classify responses, and adjusts the reward scores based on model expressed verbalized confidence.
Please refer to our preprint ([Taming Overconfidence in LLMs: Reward Calibration in RLHF](https://arxiv.org/abs/2410.09724)) and [repo](https://github.com/SeanLeng1/Reward-Calibration) for more details.



## Model Details


### Model Description

<!-- Provide a longer summary of what this model is. -->

We train [teknium/OpenHermes-2.5-Mistral-7B](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B) on our [HINT-lab/prompt-collections-final-v0.3](https://huggingface.co/datasets/HINT-lab/prompt-collections-final-v0.3)
with a vanilla reward model [HINT-lab/mistral-7b-hermes-rm-skywork](https://huggingface.co/HINT-lab/mistral-7b-hermes-rm-skywork).

- **Developed by:** Jixuan Leng, Chengsong Huang, Banghua Zhu, Jiaxin Huang
- **Finetuned from model :** [teknium/OpenHermes-2.5-Mistral-7B](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B)

### Model Sources [optional]

<!-- Provide the basic links for the model. -->

- **Repository:** [Our repo](https://github.com/SeanLeng1/Reward-Calibration)
- **Paper:** [Taming Overconfidence in LLMs: Reward Calibration in RLHF](https://arxiv.org/abs/2410.09724)
<!-- - **Demo [optional]:** [More Information Needed] -->