arxiv:2409.10164

Quantile Regression for Distributional Reward Models in RLHF

Published on Sep 16, 2024

Authors:

Nicolai Dorka

Abstract

Reinforcement learning from human feedback (RLHF) has become a key method for aligning <PRE_TAG>large language models (LLMs)</POST_TAG> with human preferences through the use of <PRE_TAG>reward models</POST_TAG>. However, traditional <PRE_TAG>reward models</POST_TAG> typically generate point estimates, which oversimplify the diversity and complexity of <PRE_TAG>human values</POST_TAG> and preferences. In this paper, we introduce <PRE_TAG>Quantile Reward Models (QRMs)</POST_TAG>, a novel approach to reward modeling that learns a <PRE_TAG>distribution over rewards</POST_TAG> instead of a single scalar value. Our method uses <PRE_TAG>quantile regression</POST_TAG> to estimate a full, potentially multimodal distribution over preferences, providing a more powerful and nuanced representation of preferences. This <PRE_TAG>distributional approach</POST_TAG> can better capture the diversity of <PRE_TAG>human values</POST_TAG>, addresses label noise, and accommodates <PRE_TAG>conflicting preferences</POST_TAG> by modeling them as distinct modes in the distribution. Our experimental results show that QRM outperforms comparable traditional point-estimate models on <PRE_TAG>RewardBench</POST_TAG>. Furthermore, we demonstrate that the additional information provided by the distributional estimates can be utilized in downstream applications, such as <PRE_TAG>risk-aware reinforcement learning</POST_TAG>, resulting in LLM policies that generate fewer extremely negative responses. Our code and model are released at https://github.com/Nicolinho/QRM.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 4

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2409.10164 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.