metadata

library_name: transformers
tags: []

Model Card for Model ID

PPO-C (PPO with Calibrated Reward Calculation) is an RLHF algorithm to mitigate verbalized overconfidence in RLHF-trained Large Language Models. PPO-C adjusts standard reward model scores during PPO training. It maintains a running average of past reward scores as a dynamic threshold to classify responses, and adjusts the reward scores based on model expressed verbalized confidence. Please refer to our preprint (Taming Overconfidence in LLMs: Reward Calibration in RLHF) and repo for more details.

Model Details

Model Description

We train teknium/OpenHermes-2.5-Mistral-7B on our HINT-lab/prompt-collections-final-v0.3 with a vanilla reward model HINT-lab/mistral-7b-hermes-rm-skywork.

Developed by: Jixuan Leng, Chengsong Huang, Banghua Zhu, Jiaxin Huang
Finetuned from model : teknium/OpenHermes-2.5-Mistral-7B

Model Sources [optional]

Repository: Our repo
Paper: Taming Overconfidence in LLMs: Reward Calibration in RLHF