Possible bug that decreases the performance and (maybe) avoids generating the stop token

#6
by davidpeer - opened

Hi, first of all thanks for the great tutorial (https://huggingface.co/blog/mlabonne/orpo-llama-3).

When I inferenced the trained model, I have seen that it sometimes did not stop. I found that the prompt during inference is different than the prompt during training. More precisely, I found that in format_chat_template we apply the chat template to row["chosen"] which produces the following prompt template:

<|im_start|>user\n [...] <|im_end|>\n<|im_start|>assistant\n [...] <|im_end|>

--> so it not only contains the answer, but also the users question!

Later on the orpo_trainer not only uses this "answer", but prompt + answer:

def build_tokenized_answer(self, prompt, answer):
        """
        Llama tokenizer does satisfy `enc(a + b) = enc(a) + enc(b)`.
        It does ensure `enc(a + b) = enc(a) + enc(a + b)[len(enc(a)):]`.
        Reference:
            https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257
        """

        full_tokenized = self.tokenizer(prompt + answer, add_special_tokens=False)
       [...]

Therefore, during training it sees the users question two times (debug output):

And we let them have a great time! [...] <|im_start|>user\n And we let them have a great time! [...] <|im_end|>\n<|im_start|>assistant\n We ensure our children enjoy themselves and have a great time! [...] <|im_end|>\n

During inference, we build the prompt(at least IMO) correctly:

<|im_start|>user\n And we let them have a great time! [...] <|im_end|>\n<|im_start|>assistant\n We ensure our children enjoy themselves and have a great time! [...] <|im_end|>\n

But now our training distribution and test distribution are different which may lead to decreased performance and in my case I have seen that the model did not generate the stop token i.e. it produced some weird patterns. I tested this by overfitting on a single sample and inference this later on.

To implement this I created my own dataset and ensured that the "chosen" element contained only the AI answer and the "prompt" only the users question. Here an example:

    {
            "chosen": [{"content": "Hi, I'm an AI", "role": "assistant"}],
            "rejected": [{"content": "", "role": "assistant"}],
            "prompt": [{"content": question, "role": "user"}],
            "question": [{"content": question, "role": "user"}],
        }

Therefore, the format_chat_template is adapted to

    def format_chat_template(row):
        row["chosen"] = tokenizer.apply_chat_template(row["chosen"], tokenize=False)
        row["rejected"] = tokenizer.apply_chat_template(row["rejected"], tokenize=False)
        row["prompt"] = tokenizer.apply_chat_template(row["prompt"], tokenize=False)
        row["question"] = tokenizer.apply_chat_template(row["question"], tokenize=False)
        return row

Honestly, I don't know why the orpo-dpo-mix-40k dataset contains the question and the answer in "chosen" but splitting those fixed the problem for me.

davidpeer changed discussion title from Possible bug that leats to infinite generations to Possible bug that no stop token is generated
davidpeer changed discussion title from Possible bug that no stop token is generated to Possible bug that avoids generating the stop token
davidpeer changed discussion title from Possible bug that avoids generating the stop token to Possible bug that decreases the performance and (maybe) avoids generating the stop token

Sign up or log in to comment