ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities Paper • 2412.06745 • Published 25 days ago • 6
Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks? Paper • 2411.05000 • Published Nov 7, 2024 • 21
On scalable oversight with weak LLMs judging strong LLMs Paper • 2407.04622 • Published Jul 5, 2024 • 11
InstructVideo: Instructing Video Diffusion Models with Human Feedback Paper • 2312.12490 • Published Dec 19, 2023 • 17
arXiVeri: Automatic table verification with GPT Paper • 2306.07968 • Published Jun 13, 2023 • 6
Crosslingual Generalization through Multitask Finetuning Paper • 2211.01786 • Published Nov 3, 2022 • 2