Model Card for Model ID
PPO-C (PPO with Calibrated Reward Calculation) is an RLHF algorithm to mitigate verbalized overconfidence in RLHF-trained Large Language Models. PPO-C adjusts standard reward model scores during PPO training. It maintains a running average of past reward scores as a dynamic threshold to classify responses, and adjusts the reward scores based on model expressed verbalized confidence. Please refer to our preprint (Taming Overconfidence in LLMs: Reward Calibration in RLHF) and repo for more details.
Model Details
Model Description
We train OpenRLHF/Llama-3-8b-sft-mixture on our HINT-lab/prompt-collections-final-v0.3 with a vanilla reward model OpenRLHF/Llama-3-8b-rm-mixture.
- Developed by: Jixuan Leng, Chengsong Huang, Banghua Zhu, Jiaxin Huang
- Finetuned from model: OpenRLHF/Llama-3-8b-sft-mixture
Model Sources [optional]
- Repository: Our repo
- Paper: Taming Overconfidence in LLMs: Reward Calibration in RLHF
- Downloads last month
- 26
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.