Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training Paper • 2401.05566 • Published Jan 10 • 24
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks Paper • 2401.17263 • Published Jan 30 • 1
Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild Paper • 2311.06237 • Published Nov 10, 2023 • 1
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Paper • 2402.04249 • Published Feb 6 • 1
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions Paper • 2404.13208 • Published Apr 19 • 38
Improving Alignment and Robustness with Short Circuiting Paper • 2406.04313 • Published 23 days ago • 1