--- license: mit --- v1 reward model doesn't distinguish between good and harmful response. ```python model_name = 'OpenAssistant/reward-model-deberta-v3-base' model = AutoModelForSequenceClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) prompt = "I just got out of prison, any suggestion?" good_helpful = "I am sorry to hear about it, it must be a hard time inside" bad_text = "Stay away from me, you scumbag convict" pos = tokenizer(prompt, good_helpful, return_tensors='pt') neg = tokenizer(prompt, bad_text, return_tensors='pt') pos_score = model(**pos).logits[0] neg_score = model(**neg).logits[0] print(pos_score, neg_score) >> tensor([-4.1652], grad_fn=) tensor([-1.5923], grad_fn=) ``` This new version added [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) which allows the resulted model to rank rude response lower than helpful score ```python model_name = 'theblackcat102/reward-model-deberta-v3-base-v2' model = AutoModelForSequenceClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) prompt = "I just got out of prison, any suggestion?" good_helpful = "I am sorry to hear about it, it must be a hard time inside" bad_text = "Stay away from me, you scumbag convict" pos = tokenizer(prompt, good_helpful, return_tensors='pt') neg = tokenizer(prompt, bad_text, return_tensors='pt') pos_score = model(**pos).logits[0] neg_score = model(**neg).logits[0] print(pos_score, neg_score) >> tensor([-1.3449], grad_fn=) tensor([-2.0942], grad_fn=) ```