allenai/Llama-3.1-Tulu-3-8B

Here are my observations after trying this model in Q8_0 quantization and greedy decoding and comparing it to Llama 3.1 Instruct in the same quantization, sampling parameters and system prompt:

Does not handle false premise questions well. Unlike L3.1 it does not correct the user, but makes up wrong justification.
Example - "Why do numbers in Slitherlink puzzle can go only up to 2?" (They can go up to 3).
Hallucinates about obscure real world facts noticeably more than L3.1
Example - question it about small towns around the world and compare the answers to Wikipedia entries.

allenai
/

Llama-3.1-Tulu-3-8B

Feedback