On scalable oversight with weak LLMs judging strong LLMs Paper • 2407.04622 • Published Jul 5, 2024 • 11
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text Paper • 2406.08418 • Published Jun 12, 2024 • 29
NATURAL PLAN: Benchmarking LLMs on Natural Language Planning Paper • 2406.04520 • Published Jun 6, 2024 • 11
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark Paper • 2406.01574 • Published Jun 3, 2024 • 44
view article Article CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models May 24, 2024 • 21
Imp: Highly Capable Large Multimodal Models for Mobile Devices Paper • 2405.12107 • Published May 20, 2024 • 26