File size: 5,007 Bytes
0eb531e 87b4a29 0210840 87b4a29 0eb531e 39387d1 0eb531e 87b4a29 a4b4071 87b4a29 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
---
license: apache-2.0
datasets:
- Skywork/Skywork-Reward-Preference-80K-v0.1
base_model:
- Qwen/Qwen2-7B-Instruct
---
## Introduction
Con-J-Qwen2-7B (learning the generative ***J***udge using self-generated ***Con***trastive judgments) is an advanced generative judge built on Qwen2-7B-Instruct architecture and dataset Skywork/Skywork-Reward-Preference-80K-v0.1.
Con-J-Qwen2-7B is trained from preference data. We prompt the pre-trained Qwen2-7B-Instruct model to generate positive and negative judgments, both supported with rationales in natural language form. Then the self-generated contrastive judgment pairs are used to train the generative judge with Direct Preference Optimization (DPO). By doing this, Con-J learns to act as a generative judge and provides accurate and supprting rationales.
## Usage
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "ZiyiYe/Con-J-Qwen2-7B"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
question = "What is the range of the numeric output of a sigmoid node in a neural network?"
answer1 = "The output of a sigmoid node is bounded between -1 and 1."
answer2 = "The output of a sigmoid node is bounded between 0 and 1."
# Format and tokenize the conversations
CON_J_PROMPT = """作为一个评价专家,给定一个问题和它的两个可能的回答,请选出哪一个回答在连贯性、准确性、覆盖度和上述定义的整体质量方面最为符合。请用JSON格式输出你的判断, 其中"原因"是你提供的解释,"更好的回答"是整数类型的1或2,例如{{"原因": "你的解释", "更好的回答": 1}}。以下是问题和候选回答的内容:
\n问题:{instruction}
回答1:{output_1}
回答2:{output_2}"""
user_prompt = CON_J_PROMPT.format(instruction=question, output_1=answer1, output_2=answer2)
system_prompt = ""
messages = [
{"role": "system", "content": system_prompt,},
{"role": "user", "content": user_prompt},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
prompt = tokenizer([prompt], return_tensors="pt")
# Generate judgment for the given prompt
with torch.no_grad():
generated_ids = model.generate(prompt.input_ids, do_sample=False, max_new_tokens=2048,)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(prompt.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
# response: {"原因": "回答1中的-1是错误的,因为sigmoid函数的实际输出范围是0到1,而不是包括-1。回答2准确地描述了sigmoid函数的输出范围是0到1。",\n "更好的回答": 2}
```
## Performance
<table>
<tr>
<th rowspan="2">Model</th>
<th rowspan="2">Infinity-<br>Preference</th>
<th rowspan="2">Ultra-<br>Feedback</th>
<th rowspan="2">PKU-<br>SafeRLHF</th>
<th colspan="4">Reward-Bench</th>
</tr>
<tr>
<th>Chat</th>
<th>Chat-H</th>
<th>Safety</th>
<th>Reasoning</th>
</tr>
<tr>
<td>Llama3.1-8B</td>
<td>59.0</td>
<td>62.9</td>
<td>66.4</td>
<td>80.7</td>
<td>49.8</td>
<td>64.0</td>
<td>68.1</td>
</tr>
<tr>
<td>Llama3.1-70B</td>
<td>64.0</td>
<td>71.4</td>
<td>67.6</td>
<td><b>97.2</b></td>
<td>70.2</td>
<td>82.8</td>
<td>86.0</td>
</tr>
<tr>
<td>Qwen2-7B</td>
<td>59.0</td>
<td>64.5</td>
<td>67.2</td>
<td>91.3</td>
<td>44.8</td>
<td>73.6</td>
<td>69.0</td>
</tr>
<tr>
<td>Qwen2.5-72B</td>
<td>70.0</td>
<td>66.0</td>
<td>58.7</td>
<td>86.6</td>
<td>61.4</td>
<td>74.5</td>
<td><b>90.7</b></td>
</tr>
<tr>
<td>Auto-J</td>
<td>69.0</td>
<td>63.9</td>
<td>66.9</td>
<td>93.0</td>
<td>40.0</td>
<td>65.5</td>
<td>50.5</td>
</tr>
<tr>
<td>Prometheus 2</td>
<td>68.0</td>
<td>63.3</td>
<td>63.0</td>
<td>85.5</td>
<td>49.1</td>
<td>77.1</td>
<td>76.5</td>
</tr>
<tr>
<td>GPT-4o</td>
<td><u>75.0</u></td>
<td><u>72.2</u></td>
<td><b>69.6</b></td>
<td><u>95.3</u></td>
<td><u>74.3</u></td>
<td><u>87.6</u></td>
<td>86.9</td>
</tr>
<tr>
<td>Con-J (ours)</td>
<td><b>81.0</b></td>
<td><b>73.0</b></td>
<td><u>68.4</u></td>
<td>91.3</td>
<td><b>79.6</b></td>
<td><b>88.0</b></td>
<td><u>87.1</u></td>
</tr>
</table>
## Reference
```
@misc{ye2024scalarrewardmodellearning,
title={Beyond Scalar Reward Model: Learning Generative Judge from Preference Data},
author={Ziyi Ye and Xiangsheng Li and Qiuchi Li and Qingyao Ai and Yujia Zhou and Wei Shen and Dong Yan and Yiqun Liu},
year={2024},
eprint={2410.03742},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.03742},
}
```
|