Safetensors
qwen2
File size: 5,007 Bytes
0eb531e
 
 
 
 
 
 
87b4a29
 
 
0210840
87b4a29
 
 
 
 
0eb531e
 
 
 
39387d1
0eb531e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87b4a29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a4b4071
 
 
 
 
 
 
 
 
 
 
87b4a29
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
license: apache-2.0
datasets:
- Skywork/Skywork-Reward-Preference-80K-v0.1
base_model:
- Qwen/Qwen2-7B-Instruct
---

## Introduction

Con-J-Qwen2-7B (learning the generative ***J***udge using self-generated ***Con***trastive judgments) is an advanced generative judge built on Qwen2-7B-Instruct architecture and dataset Skywork/Skywork-Reward-Preference-80K-v0.1. 

Con-J-Qwen2-7B is trained from preference data. We prompt the pre-trained Qwen2-7B-Instruct model to generate positive and negative judgments, both supported with rationales in natural language form. Then the self-generated contrastive judgment pairs are used to train the generative judge with Direct Preference Optimization (DPO). By doing this, Con-J learns to act as a generative judge and provides accurate and supprting rationales.

## Usage

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "ZiyiYe/Con-J-Qwen2-7B"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)

question = "What is the range of the numeric output of a sigmoid node in a neural network?"
answer1 = "The output of a sigmoid node is bounded between -1 and 1."
answer2 = "The output of a sigmoid node is bounded between 0 and 1."

# Format and tokenize the conversations
CON_J_PROMPT = """作为一个评价专家,给定一个问题和它的两个可能的回答,请选出哪一个回答在连贯性、准确性、覆盖度和上述定义的整体质量方面最为符合。请用JSON格式输出你的判断, 其中"原因"是你提供的解释,"更好的回答"是整数类型的1或2,例如{{"原因": "你的解释", "更好的回答": 1}}。以下是问题和候选回答的内容:
    \n问题:{instruction}
回答1:{output_1}
回答2:{output_2}"""
user_prompt = CON_J_PROMPT.format(instruction=question, output_1=answer1, output_2=answer2)
system_prompt = ""
messages = [
    {"role": "system", "content": system_prompt,},
    {"role": "user", "content": user_prompt},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
prompt = tokenizer([prompt], return_tensors="pt")

# Generate judgment for the given prompt
with torch.no_grad():
    generated_ids = model.generate(prompt.input_ids, do_sample=False, max_new_tokens=2048,)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(prompt.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

# response: {"原因": "回答1中的-1是错误的,因为sigmoid函数的实际输出范围是0到1,而不是包括-1。回答2准确地描述了sigmoid函数的输出范围是0到1。",\n "更好的回答": 2}

```


## Performance

<table>
  <tr>
    <th rowspan="2">Model</th>
    <th rowspan="2">Infinity-<br>Preference</th>
    <th rowspan="2">Ultra-<br>Feedback</th>
    <th rowspan="2">PKU-<br>SafeRLHF</th>
    <th colspan="4">Reward-Bench</th>
  </tr>
  <tr>
    <th>Chat</th>
    <th>Chat-H</th>
    <th>Safety</th>
    <th>Reasoning</th>
  </tr>
  <tr>
    <td>Llama3.1-8B</td>
    <td>59.0</td>
    <td>62.9</td>
    <td>66.4</td>
    <td>80.7</td>
    <td>49.8</td>
    <td>64.0</td>
    <td>68.1</td>
  </tr>
  <tr>
    <td>Llama3.1-70B</td>
    <td>64.0</td>
    <td>71.4</td>
    <td>67.6</td>
    <td><b>97.2</b></td>
    <td>70.2</td>
    <td>82.8</td>
    <td>86.0</td>
  </tr>
  <tr>
    <td>Qwen2-7B</td>
    <td>59.0</td>
    <td>64.5</td>
    <td>67.2</td>
    <td>91.3</td>
    <td>44.8</td>
    <td>73.6</td>
    <td>69.0</td>
  </tr>
  <tr>
    <td>Qwen2.5-72B</td>
    <td>70.0</td>
    <td>66.0</td>
    <td>58.7</td>
    <td>86.6</td>
    <td>61.4</td>
    <td>74.5</td>
    <td><b>90.7</b></td>
  </tr>
  <tr>
    <td>Auto-J</td>
    <td>69.0</td>
    <td>63.9</td>
    <td>66.9</td>
    <td>93.0</td>
    <td>40.0</td>
    <td>65.5</td>
    <td>50.5</td>
  </tr>
  <tr>
    <td>Prometheus 2</td>
    <td>68.0</td>
    <td>63.3</td>
    <td>63.0</td>
    <td>85.5</td>
    <td>49.1</td>
    <td>77.1</td>
    <td>76.5</td>
  </tr>
  <tr>
    <td>GPT-4o</td>
    <td><u>75.0</u></td>
    <td><u>72.2</u></td>
    <td><b>69.6</b></td>
    <td><u>95.3</u></td>
    <td><u>74.3</u></td>
    <td><u>87.6</u></td>
    <td>86.9</td>
  </tr>
  <tr>
    <td>Con-J (ours)</td>
    <td><b>81.0</b></td>
    <td><b>73.0</b></td>
    <td><u>68.4</u></td>
    <td>91.3</td>
    <td><b>79.6</b></td>
    <td><b>88.0</b></td>
    <td><u>87.1</u></td>
  </tr>
</table>



## Reference
```
@misc{ye2024scalarrewardmodellearning,
      title={Beyond Scalar Reward Model: Learning Generative Judge from Preference Data}, 
      author={Ziyi Ye and Xiangsheng Li and Qiuchi Li and Qingyao Ai and Yujia Zhou and Wei Shen and Dong Yan and Yiqun Liu},
      year={2024},
      eprint={2410.03742},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.03742}, 
}
```