Results validation with another benchmarks?

by VlSav - opened May 31, 2024

May 31, 2024

Hello!
Nice work! Very interesting results!
Did you try to validate with other benchmarks? I tried to check with MMLU (lm-eval-harness) and looks like MMLU results degrades a bit comparing with original suzume_multilingual. Wondering if MT-bench score is preferable...

ptrdvn

Lightblue KK. org Jun 2, 2024

Yeah, as I found in the paper, the Belebele scores drop when doing ORPO training while the MT-Bench scores increase. I think this is because they measure different things - while the MT-Bench scores measures the chat ability of the output given, Belebele and MMLU measure the logit scores of the "correct" answer. So I think this ORPO trained model will be better at chatting, but worse at logit-based knowledge testing tasks. We found in the paper that the lightblue/suzume-llama-3-8B-multilingual-orpo-borda-top25 did better at Belebel than the base model, so that might also be better at MMLU?

VlSav

Jun 2, 2024

Thanks. Yes, you are right, I've checked for https://huggingface.co/lightblue/suzume-llama-3-8B-multilingual-orpo-borda-top25 and MMLU are better for it, as well as other logit-based testing benchmarks. BTW, when you did MT-bench scoring do you have kind of length control? As mentioned in some papers (f.e. https://arxiv.org/html/2404.04475v1) OpenAI's GPT typically prefer lengthy answers, so may be it is also the case with ORPO trained models?

hiauiarau

Jun 3, 2024

Yes, there is a preference for long answers. And in this version of the model, the answers are just huge. In fact, the training dataset should have long examples in both positive and negative example answers, otherwise the model will learn that you should just write a long answer. Hence, you need to carefully validate the training dataset, in terms of answer lengths, and in terms of examples accepted and rejected.

ptrdvn

Lightblue KK. org Jun 3, 2024

Hey, yeah, I agree that that is something I need to work on for the next iteration of this model. If you just say "Hi" to the model, it list this loooong answer about how it is here to helpful and how useful it will be. Ironically, not very helpful haha.

The idea of training using long negatives is a good one - I have not checked whether the positives are substantially longer than the negatives, but I would wager they are.

However, I think I will probably focus on training using a method like SimPO (https://arxiv.org/pdf/2405.14734), as it contains a length penalty naturally, which would (I think) mean that I would be able to use any length of answers for both positives and negatives.

hiauiarau

Jun 3, 2024

These results were obtained on a benchmark: a 1.5-fold increase in response length. The quality is compared with RLHFlow/LLaMA3-iterative-DPO-final and IlyaGusev/saiga_llama3_8b - this card describes the benchmark and training data

VlSav changed discussion status to closed Jun 7, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment