More truthful, but a lot more denials.

#9
by deleted - opened

I like how this model performs across a broad spectrum of tasks and am curious if this has something to do with UNA (looking forward to the paper).

My primary criticism is it tries too hard to prevent hallucinations, as indicated by its unusually good TruthfulQA score of 64.6. That is, for every false thing it correctly labels as false it claims 1,000s of true things are false, and continues to do so even after I correct it.

This also make prompted story telling effectively impossible. It will ignore key directives in the user prompt in favor standard story telling elements, littering the story with contradictions (e.g. he heard footsteps and a knock on the door but was still startled and caught red handed grabbing the money off the counter). And when re-prompted to fix the errors it will just make them again.

Sadly, Mistral has an above average hallucination rate (e.g. 42.1 TruthfulQA), and without an advanced technique like self-RAG there's nothing within reason (besides lowering the temperature) that can be done about it without making an LLM falsely deny countless truths and ignoring prompted story telling directives.

  1. It has something to do with UNA, this is not about data primarly.
  2. Interesting, how large is the context u talking about
  3. Its not a story telling model
  4. How does other performs, is this something unique of cybertron or is mostly latent in all the others as well ?
deleted

@fblgit I make a hobby out of testing LLMs by throwing the same set of trick questions, logic puzzles, coding, Q&As, story prompts... at them (currently >30 Mistral & Llama 2 13b fine-tunes).

This LLM performed with the leaders except in 2 areas. Denying truths (but also fewer hallucinations) and story telling.

To answer your questions, since I want to put the LLMs to the test my story prompts are deliberately long and convoluted 5 sentence paragraphs filled with include/exclude directives, with some selected to break the standard story telling mold in order to force contradictions (1,200 character prompts). And the generated stories are usually about 6-10 short paragraphs long. More than 2/3rd of LLMs performed poorly, so this LLM is in good company, and only 3 did a reasonably good job (Xwin 0.2 13b Llama 2 did the best, followed by loyal piano m7 and openhermes 2.5).

To clarify, this LLM is parsing the story prompts properly because it tries to force the included directives into the story, but in a way that's causing contradictions with standard story telling elements (example in my previous post). Many other LLMs have comprehension issues, such as completely ignoring a directive or making mistakes like swapping character names. That's not an issue with this LLM.

In regards to denying facts, this was by far the most pronounced I've come across. My questions were selected to test the fringes of the foundational model's knowledge so I'm blowing this way out of proportion. But Mistral does contain some of the answers, and when re-prompted, they can be found, including with this LLM. All Mistrals deny truths rather than responding with 'I don't know'. What makes this LLM unique is the sheer number of denials, including after I correct it. In a word, it's too cynical, throwing the baby out with the bathwater.

But again, in order to limit the time it takes to test LLMs my questions are deliberately fringe, and story prompts deliberately long, convoluted and against the grain of standard story telling elements. When I fed simple story prompts without directives and let it do it's things the stories were on par with other Mistrals. This appears to be about being too cynical, even to the point of not respecting the user's prompt. I would say Xwin 0.2 and this LLM are polar opposites. Xwin will always do its best to give the user the response he/she desires, often at the expense of things like the truth, while cybertron is obsessed with the truth, or what it thinks is the truth, often at the expense of prioritizing the user's desired response. But both have good comprehension of the user's prompts and are overall top performers.

Sign up or log in to comment