Notably better than Phi3.5 in many ways, but something is wrong.

#5
by phil111 - opened

Thanks for the sneak peek. This model is certainly more powerful than Phi3.5, but it's far too focused on things like coding to work as a general purpose LLM.

For example, it keeps answering basic questions with json formatting, such as the one below (although the answer is correct).

What is the condition called when blood sugar briefly falls too far after eating?

{
  "answer": "reactive hypoglycemia",
  "relevant_passages": [
    {
      "title": "Reactive Hypoglycemia",
      "score": 1,
      "passage_id": 2,
      "text": "Reactive hypoglycemia is a condition in which blood sugar levels drop too low after eating. It's also called postprandial hypoglycemia, or post-meal hypoglycemia."
    }
  ]
}

And while various tasks like creative story writing is improved, plus the alignment is less absurd (e.g. fewer denials and moralizations about perfectly legal, common, and harmless things), its general world knowledge is really bad for its size. Smaller models like Llama 3.1 8b and Gemma 9b can functionally respond to a much wider variety of prompts.

In short, Phi4, like Phi3.5, is just an AI tool, not a general purpose AI model like Llama 3.1, Gemma2 or ChatGPT. It's so overtrained on select tasks that its output format commonly makes no sense (e.g. json), and it can't function as an AI for the general population because extensive data filtering. That is, Microsoft acted as the gakekeepers of information, deciding which of humanity's most popular information to add to the corpus, leaving it almost completely ignorant about too many things that the general population cares most about. Again, that makes it an AI tool/agent (e.g. coder or math solver) and not a general purpose chat/instruct AI model.

Thanks for your review phil. I know you calculate exact scores for your models, I'd be interested to know what score phi 4 gets compared to other models in your knowledge test.

Its skewed distribution from heavy tuning probably makes it harder for the model to answer more general questions, but it might know more than it is willing to say. If it responds to basic questions in json format then the fine-tuning process, which primarily tunes the output format and style of responses, may be the reason for the skewed distribution, not the pretraining data itself, because I doubt the distribution resulting from pretraining could cause this response. So re-tuning it could maybe fix it? That's just my Idea though.

I think there is something wrong with your setup. It seems that phi4 is regurgitating some training data instead of answering properly.
The same prompt returns a properly formatted answer when I tried it.

Make sure that you are using the correct chat template. Phi4 is using a modified chatml format. Also try reducing the temperature settings

@nlpguy @matteogeniaccio You guys must be right. Something is clearly configured wrong on my end, and the chat template is the most likely culprit. But I do use low temps (0 and 0.3), plus minimize hallucinations further by using a high min-P, so there's clearly pockets of very popular knowledge missing from the corpus, regardless of any configuration issue.

Thanks for testing the question. About half of the answers to my questions are in json format, so if this was inherent to the model you would undoubtedly have noticed it to. Normally I would test odd outputs against a full float version online but can't find any (e.g. at LMsys).

But despite my configuration issue there's clearly something special about this model. It handled a complex, and frankly absurd, story prompt much better than previous Phi models. Plus it correctly answered some unusually difficult questions smaller models typically get wrong (e.g. Thorne–Żytkow Object and the more obscure literary references "it was the age of wisdom, it was the age of foolishness," rather than just the far more recognizable "It was the best of times, it was the worst of times" from Charles Dickens A Tale of Two Cities). I look forward to seeing how this performs once configured properly.

Seems to be noticeably smarter than any model i tried up to 14b. Answered all my test questions correctly and never in json.

@urtuuu Thanks for checking. I clearly configured something wrong so I'm closing this discussion.

phil111 changed discussion status to closed

Sign up or log in to comment