Papers
arxiv:2407.01470

DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging

Published on Jul 1
· Submitted by hank0316 on Jul 2
Authors:

Abstract

Reinforcement learning from human feedback (RLHF) is a popular strategy for aligning large language models (LLMs) with desired behaviors. Reward modeling is a crucial step in RLHF. However, collecting paired preference data for training reward models is often costly and time-consuming, especially for domain-specific preferences requiring expert annotation. To address this challenge, we propose the Domain knowledge merged Reward Model (DogeRM), a novel framework that integrates domain-specific knowledge into a general reward model by model merging. The experiments demonstrate that DogeRM enhances performance across different benchmarks and provide a detailed analysis showcasing the effects of model merging, showing the great potential of facilitating model alignment.

Community

Paper author Paper submitter
edited 4 days ago

Our DogeRM framework merges the transformer layers and input embeddings from the reward model and a domain-specific SFT language model. We conducted experiments in the math and coding domains. The results demonstrate the potential of our method across various benchmarks, including RewardBench, Auto-J Eval, and Best-of-N Sampling on GSM8K/MBPP.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.01470 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.01470 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.01470 in a Space README.md to link it from this page.

Collections including this paper 3