|
--- |
|
library_name: transformers |
|
tags: [] |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
**PPO-M** (PPO with Calibrated Reward Modeling) is an RLHF algorithm to mitigate verbalized overconfidence in RLHF-trained Large Language Models. |
|
PPO-M calibrates the reward modeling process by augmenting the binary pairwise ranking dataset with explicit confidence scores, and encourages the |
|
reward model to align confidence levels with response quality. |
|
Please refer to our preprint ([Taming Overconfidence in LLMs: Reward Calibration in RLHF](https://arxiv.org/abs/2410.09724)) and [repo](https://github.com/SeanLeng1/Reward-Calibration) for more details. |
|
|
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
We train [OpenRLHF/Llama-3-8b-sft-mixture](https://huggingface.co/OpenRLHF/Llama-3-8b-sft-mixture) on our [HINT-lab/prompt-collections-final-v0.3](https://huggingface.co/datasets/HINT-lab/prompt-collections-final-v0.3) |
|
with our calibrated reward model [HINT-lab/llama3-8b-crm-final-v0.1](https://huggingface.co/HINT-lab/llama3-8b-crm-final-v0.1). |
|
|
|
- **Developed by:** Jixuan Leng, Chengsong Huang, Banghua Zhu, Jiaxin Huang |
|
- **Finetuned from model:** [OpenRLHF/Llama-3-8b-sft-mixture](https://huggingface.co/OpenRLHF/Llama-3-8b-sft-mixture) |
|
|
|
### Model Sources [optional] |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** [Our repo](https://github.com/SeanLeng1/Reward-Calibration) |
|
- **Paper:** [Taming Overconfidence in LLMs: Reward Calibration in RLHF](https://arxiv.org/abs/2410.09724) |
|
<!-- - **Demo [optional]:** [More Information Needed] --> |
|
|
|
|