Papers
arxiv:2409.10164

Quantile Regression for Distributional Reward Models in RLHF

Published on Sep 16, 2024
Authors:

Abstract

Reinforcement learning from human feedback (RLHF) has become a key method for aligning <PRE_TAG>large language models (LLMs)</POST_TAG> with human preferences through the use of <PRE_TAG>reward models</POST_TAG>. However, traditional <PRE_TAG>reward models</POST_TAG> typically generate point estimates, which oversimplify the diversity and complexity of <PRE_TAG>human values</POST_TAG> and preferences. In this paper, we introduce <PRE_TAG>Quantile Reward Models (QRMs)</POST_TAG>, a novel approach to reward modeling that learns a <PRE_TAG>distribution over rewards</POST_TAG> instead of a single scalar value. Our method uses <PRE_TAG>quantile regression</POST_TAG> to estimate a full, potentially multimodal distribution over preferences, providing a more powerful and nuanced representation of preferences. This <PRE_TAG>distributional approach</POST_TAG> can better capture the diversity of <PRE_TAG>human values</POST_TAG>, addresses label noise, and accommodates <PRE_TAG>conflicting preferences</POST_TAG> by modeling them as distinct modes in the distribution. Our experimental results show that QRM outperforms comparable traditional point-estimate models on <PRE_TAG>RewardBench</POST_TAG>. Furthermore, we demonstrate that the additional information provided by the distributional estimates can be utilized in downstream applications, such as <PRE_TAG>risk-aware reinforcement learning</POST_TAG>, resulting in LLM policies that generate fewer extremely negative responses. Our code and model are released at https://github.com/Nicolinho/QRM.

Community

Sign up or log in to comment

Models citing this paper 4

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2409.10164 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.