Quantile Regression for Distributional Reward Models in RLHF
Abstract
Reinforcement learning from human feedback (RLHF) has become a key method for aligning <PRE_TAG>large language models (LLMs)</POST_TAG> with human preferences through the use of <PRE_TAG>reward models</POST_TAG>. However, traditional <PRE_TAG>reward models</POST_TAG> typically generate point estimates, which oversimplify the diversity and complexity of <PRE_TAG>human values</POST_TAG> and preferences. In this paper, we introduce <PRE_TAG>Quantile Reward Models (QRMs)</POST_TAG>, a novel approach to reward modeling that learns a <PRE_TAG>distribution over rewards</POST_TAG> instead of a single scalar value. Our method uses <PRE_TAG>quantile regression</POST_TAG> to estimate a full, potentially multimodal distribution over preferences, providing a more powerful and nuanced representation of preferences. This <PRE_TAG>distributional approach</POST_TAG> can better capture the diversity of <PRE_TAG>human values</POST_TAG>, addresses label noise, and accommodates <PRE_TAG>conflicting preferences</POST_TAG> by modeling them as distinct modes in the distribution. Our experimental results show that QRM outperforms comparable traditional point-estimate models on <PRE_TAG>RewardBench</POST_TAG>. Furthermore, we demonstrate that the additional information provided by the distributional estimates can be utilized in downstream applications, such as <PRE_TAG>risk-aware reinforcement learning</POST_TAG>, resulting in LLM policies that generate fewer extremely negative responses. Our code and model are released at https://github.com/Nicolinho/QRM.
Models citing this paper 4
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 1
Collections including this paper 0
No Collection including this paper