Idea: How about improving the quality of the answers using GPT-4?
First of all, Daniel, I love how you improved the efficiency by using less data, and the output quality by using the best data. But I already applauded you for that on Twitter/X.
In the orca DPO dataset picking random examples I always find places to make it better.
I wonder if it could be improved if they told the judge LLM to rate & and then improve the answers on certain dimensions.
Examples:
1) Instruction following
"Rate how much the length of the response follows the user's explicit (stated) or implicit (intended) needs.
Implicit: simple questions need shorter answers, more complex questions need longer ones"
I saw it multiple times that system prompt and user prompt were at odds regarding length (in intel DPO dataset).
It could definitely be improved, system prompt strength brought up.
But even user prompt following too.
2) Summarizing (entity density)
I would choose the 'chosen' too if given the two!
But actually it's clear that the 'chosen' answer could be improved too (and in this case the rejected is giving hints).
Chosen follows the "one sentence" instruction well, but rejected is actually is better at entity extraction & information density
It's possible to improve by prompting GPT-4:
https://sharegpt.com/c/UQwU79R
A relevant article on the topic of summarization & entity density improvements:
[Article: Smarter Summaries w/ Finetuning GPT-3.5 and Chain of Density]
Fine-tuned GPT 3.5 can match GPT-4 performance (with only 20 examples, 5 epochs)!
This gave me the idea that there might be low-hanging fruits here in fine-tuning smaller models with quality examples.
3) More?
And it doesn't just apply to summarization.
Simply need to ensure quality examples for different use-cases in the dataset.