Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance Paper • 2410.18889 • Published Oct 24 • 15
GLEE: A Unified Framework and Benchmark for Language-based Economic Environments Paper • 2410.05254 • Published Oct 7 • 80
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations Paper • 2410.02707 • Published Oct 3 • 47
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? Paper • 2405.05904 • Published May 9 • 6
SEAHORSE release Collection The SEAHORSE metrics (as described in https://arxiv.org/abs/2305.13194). • 12 items • Updated 9 days ago • 17
On the Robustness of Dialogue History Representation in Conversational Question Answering: A Comprehensive Study and a New Prompt-based Method Paper • 2206.14796 • Published Jun 29, 2022 • 1
RED-ACE: Robust Error Detection for ASR using Confidence Embeddings Paper • 2203.07172 • Published Mar 14, 2022 • 1
TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models Paper • 2305.11171 • Published May 18, 2023 • 2