Possible bug that decreases the performance and (maybe) avoids generating the stop token
Hi, first of all thanks for the great tutorial (https://huggingface.co/blog/mlabonne/orpo-llama-3).
When I inferenced the trained model, I have seen that it sometimes did not stop. I found that the prompt during inference is different than the prompt during training. More precisely, I found that in format_chat_template
we apply the chat template to row["chosen"]
which produces the following prompt template:
<|im_start|>user\n [...] <|im_end|>\n<|im_start|>assistant\n [...] <|im_end|>
--> so it not only contains the answer, but also the users question!
Later on the orpo_trainer not only uses this "answer", but prompt + answer:
def build_tokenized_answer(self, prompt, answer):
"""
Llama tokenizer does satisfy `enc(a + b) = enc(a) + enc(b)`.
It does ensure `enc(a + b) = enc(a) + enc(a + b)[len(enc(a)):]`.
Reference:
https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257
"""
full_tokenized = self.tokenizer(prompt + answer, add_special_tokens=False)
[...]
Therefore, during training it sees the users question two times (debug output):
And we let them have a great time! [...] <|im_start|>user\n And we let them have a great time! [...] <|im_end|>\n<|im_start|>assistant\n We ensure our children enjoy themselves and have a great time! [...] <|im_end|>\n
During inference, we build the prompt(at least IMO) correctly:
<|im_start|>user\n And we let them have a great time! [...] <|im_end|>\n<|im_start|>assistant\n We ensure our children enjoy themselves and have a great time! [...] <|im_end|>\n
But now our training distribution and test distribution are different which may lead to decreased performance and in my case I have seen that the model did not generate the stop token i.e. it produced some weird patterns. I tested this by overfitting on a single sample and inference this later on.
To implement this I created my own dataset and ensured that the "chosen" element contained only the AI answer and the "prompt" only the users question. Here an example:
{
"chosen": [{"content": "Hi, I'm an AI", "role": "assistant"}],
"rejected": [{"content": "", "role": "assistant"}],
"prompt": [{"content": question, "role": "user"}],
"question": [{"content": question, "role": "user"}],
}
Therefore, the format_chat_template is adapted to
def format_chat_template(row):
row["chosen"] = tokenizer.apply_chat_template(row["chosen"], tokenize=False)
row["rejected"] = tokenizer.apply_chat_template(row["rejected"], tokenize=False)
row["prompt"] = tokenizer.apply_chat_template(row["prompt"], tokenize=False)
row["question"] = tokenizer.apply_chat_template(row["question"], tokenize=False)
return row
Honestly, I don't know why the orpo-dpo-mix-40k dataset contains the question and the answer in "chosen" but splitting those fixed the problem for me.