--- license: apache-2.0 datasets: - Skywork/Skywork-Reward-Preference-80K-v0.1 base_model: - Qwen/Qwen2-7B-Instruct --- ## Introduction Con-J-Qwen2-7B (learning the generative ***J***udge using self-generated ***Con***trastive judgments) is an advanced generative judge built on Qwen2-7B-Instruct architecture and dataset Skywork/Skywork-Reward-Preference-80K-v0.1. Con-J-Qwen2-7B is trained from preference data. We prompt the pre-trained Qwen2-7B-Instruct model to generate positive and negative judgments, both supported with rationales in natural language form. Then the self-generated contrastive judgment pairs are used to train the generative judge with Direct Preference Optimization (DPO). By doing this, Con-J learns to act as a generative judge and provides accurate and supprting rationales. ## Usage ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "ZiyiYe/Con-J-Qwen2-7B" model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto', trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained(model_name) question = "What is the range of the numeric output of a sigmoid node in a neural network?" answer1 = "The output of a sigmoid node is bounded between -1 and 1." answer2 = "The output of a sigmoid node is bounded between 0 and 1." # Format and tokenize the conversations CON_J_PROMPT = """作为一个评价专家，给定一个问题和它的两个可能的回答，请选出哪一个回答在连贯性、准确性、覆盖度和上述定义的整体质量方面最为符合。请用JSON格式输出你的判断, 其中"原因"是你提供的解释，"更好的回答"是整数类型的1或2，例如{{"原因": "你的解释", "更好的回答": 1}}。以下是问题和候选回答的内容： \n问题：{instruction} 回答1：{output_1} 回答2：{output_2}""" user_prompt = CON_J_PROMPT.format(instruction=question, output_1=answer1, output_2=answer2) system_prompt = "" messages = [ {"role": "system", "content": system_prompt,}, {"role": "user", "content": user_prompt}, ] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) prompt = tokenizer([prompt], return_tensors="pt") # Generate judgment for the given prompt with torch.no_grad(): generated_ids = model.generate(prompt.input_ids, do_sample=False, max_new_tokens=2048,) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(prompt.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] # response: {"原因": "回答1中的-1是错误的，因为sigmoid函数的实际输出范围是0到1，而不是包括-1。回答2准确地描述了sigmoid函数的输出范围是0到1。",\n "更好的回答": 2} ``` ## Performance

Model	Infinity- Preference	Ultra- Feedback	PKU- SafeRLHF	Reward-Bench
Model	Infinity- Preference	Ultra- Feedback	PKU- SafeRLHF	Chat	Chat-H	Safety	Reasoning
Llama3.1-8B	59.0	62.9	66.4	80.7	49.8	64.0	68.1
Llama3.1-70B	64.0	71.4	67.6	97.2	70.2	82.8	86.0
Qwen2-7B	59.0	64.5	67.2	91.3	44.8	73.6	69.0
Qwen2.5-72B	70.0	66.0	58.7	86.6	61.4	74.5	90.7
Auto-J	69.0	63.9	66.9	93.0	40.0	65.5	50.5
Prometheus 2	68.0	63.3	63.0	85.5	49.1	77.1	76.5
GPT-4o	75.0	72.2	69.6	95.3	74.3	87.6	86.9
Con-J (ours)	81.0	73.0	68.4	91.3	79.6	88.0	87.1

## Reference ``` @misc{ye2024scalarrewardmodellearning, title={Beyond Scalar Reward Model: Learning Generative Judge from Preference Data}, author={Ziyi Ye and Xiangsheng Li and Qiuchi Li and Qingyao Ai and Yujia Zhou and Wei Shen and Dong Yan and Yiqun Liu}, year={2024}, eprint={2410.03742}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2410.03742}, } ```