Poor Performance - Always seems to return the label 0 (Entailment)

#1
by lcahill - opened

Hi there, I love the idea of this, but I am not seeing good performance. This is returning entailment (0) for everything. I am wondering if I am doing something wrong, and if not, what accuracy do you get when you test this using the mnli test dataset?

from transformers import AutoModelForCausalLM, AutoTokenizer, MistralForCausalLM, LlamaTokenizerFast
import torch

device = torch.device('cuda')

model_id = "mistralai/Mistral-7B-v0.1"
peft_model_id = "predibase/glue_mnli"

model: MistralForCausalLM = AutoModelForCausalLM.from_pretrained(model_id, device_map=device, torch_dtype=torch.bfloat16)
model.load_adapter(peft_model_id)


input_string = """You are given a premise and a hypothesis below. If the premise entails the hypothesis, return 0. If the premise contradicts the hypothesis, return 2. Otherwise, if the premise does neither, return 1.

### Premise: The customer's feedback was "I went to Bank of America and while I think their personal loans are absolute garbage, I was very impressed with their customer service. I was also impressed with the cleanliness of the branch. I will definitely be returning to ANZ in the future. Might get a credit card too, they had good rates." 

### Hypothesis: The customer's feedback includes mention of a white unicorn.

### Label:"""

tokenizer: LlamaTokenizerFast = AutoTokenizer.from_pretrained(model_id)
input_tokens = tokenizer(input_string, return_tensors="pt").to(device)
generation_output = model.generate(input_tokens['input_ids'], max_length=4000, pad_token_id=tokenizer.eos_token_id)
output = tokenizer.decode(generation_output[0])
print(output)

Output is:

Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:07<00:00,  3.55s/it]
<s> You are given a premise and a hypothesis below. If the premise entails the hypothesis, return 0. If the premise contradicts the hypothesis, return 2. Otherwise, if the premise does neither, return 1.

### Premise: The customer's feedback was "I went to Bank of America and while I think their personal loans are absolute garbage, I was very impressed with their customer service. I was also impressed with the cleanliness of the branch. I will definitely be returning to ANZ in the future. Might get a credit card too, they had good rates." 

### Hypothesis: The customer's feedback includes mention of a white unicorn.

### Label:<s> 0</s>

This is not an abnormal example. I have tried this will multiple (fictitious) tests and everytime it returns 0.

Predibase org

Hi! I can definitely take a look at this -- there might be an issue with how the weights were uploaded. The accuracy we received on this dataset was about 87%.

Please feel free to check out the adapter's performance here: https://predibase.com/lora-land

Predibase org

Also, if you want to join our Discord, we are more likely to respond more quickly to comments there: https://discord.gg/CBgdrGnZjy

Sign up or log in to comment