Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training Paper • 2401.05566 • Published Jan 10 • 26
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks Paper • 2401.17263 • Published Jan 30 • 1
Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild Paper • 2311.06237 • Published Nov 10, 2023 • 1
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Paper • 2402.04249 • Published Feb 6 • 4
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions Paper • 2404.13208 • Published Apr 19 • 38