--- library_name: transformers tags: [] --- # Model Card for Model ID **PPO-C** (PPO with Calibrated Reward Calculation) is an RLHF algorithm to mitigate verbalized overconfidence in RLHF-trained Large Language Models. PPO-C adjusts standard reward model scores during PPO training. It maintains a running average of past reward scores as a dynamic threshold to classify responses, and adjusts the reward scores based on model expressed verbalized confidence. Please refer to our preprint ([Taming Overconfidence in LLMs: Reward Calibration in RLHF](https://arxiv.org/abs/2410.09724)) and [repo](https://github.com/SeanLeng1/Reward-Calibration) for more details. ## Model Details ### Model Description We train [teknium/OpenHermes-2.5-Mistral-7B](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B) on our [HINT-lab/prompt-collections-final-v0.3](https://huggingface.co/datasets/HINT-lab/prompt-collections-final-v0.3) with a vanilla reward model [HINT-lab/mistral-7b-hermes-rm-skywork](https://huggingface.co/HINT-lab/mistral-7b-hermes-rm-skywork). - **Developed by:** Jixuan Leng, Chengsong Huang, Banghua Zhu, Jiaxin Huang - **Finetuned from model :** [teknium/OpenHermes-2.5-Mistral-7B](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B) ### Model Sources [optional] - **Repository:** [Our repo](https://github.com/SeanLeng1/Reward-Calibration) - **Paper:** [Taming Overconfidence in LLMs: Reward Calibration in RLHF](https://arxiv.org/abs/2410.09724)