arxiv:2407.01470

DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging

Published on Jul 1

· Submitted by

hank0316 on Jul 2

Upvote

Authors:

Tzu-Han Lin ,

Chen-An Li ,

Hung-yi Lee ,

Abstract

Reinforcement learning from human feedback (RLHF) is a popular strategy for aligning large language models (LLMs) with desired behaviors. Reward modeling is a crucial step in RLHF. However, collecting paired preference data for training reward models is often costly and time-consuming, especially for domain-specific preferences requiring expert annotation. To address this challenge, we propose the Domain knowledge merged Reward Model (DogeRM), a novel framework that integrates domain-specific knowledge into a general reward model by model merging. The experiments demonstrate that DogeRM enhances performance across different benchmarks and provide a detailed analysis showcasing the effects of model merging, showing the great potential of facilitating model alignment.

View arXiv page View PDF Add to collection

Community

hank0316

Paper author Paper submitter 4 days ago

•

edited 4 days ago

Our DogeRM framework merges the transformer layers and input embeddings from the reward model and a domain-specific SFT language model. We conducted experiments in the math and coding domains. The results demonstrate the potential of our method across various benchmarks, including RewardBench, Auto-J Eval, and Best-of-N Sampling on GSM8K/MBPP.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.01470 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.01470 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.01470 in a Space README.md to link it from this page.