Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models Paper • 2312.04724 • Published Dec 7, 2023 • 20
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution Paper • 2401.03065 • Published Jan 5 • 11
Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers Paper • 2401.04695 • Published Jan 9 • 11
TravelPlanner: A Benchmark for Real-World Planning with Language Agents Paper • 2402.01622 • Published Feb 2 • 33
Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios Paper • 2401.17167 • Published Jan 30 • 1
Language Models, Agent Models, and World Models: The LAW for Machine Reasoning and Planning Paper • 2312.05230 • Published Dec 8, 2023
LongAlign: A Recipe for Long Context Alignment of Large Language Models Paper • 2401.18058 • Published Jan 31 • 20
Premise Order Matters in Reasoning with Large Language Models Paper • 2402.08939 • Published Feb 14 • 27
In Search of Needles in a 10M Haystack: Recurrent Memory Finds What LLMs Miss Paper • 2402.10790 • Published Feb 16 • 41
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks Paper • 2412.15204 • Published 3 days ago • 27
RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios Paper • 2412.08972 • Published 11 days ago • 9