Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
Paper
•
2407.15549
•
Published
The UCL Deciding, Acting, and Reasoning with Knowledge (DARK) Lab is a Reinforcement Learning research group at the UCL Centre for Artificial Intelligence. We focus on research in complex open-ended environments that provide a constant stream of novel observations without reliable reward functions, often requiring agents to create their own curricula and to deal with external knowledge, natural language, and hard exploration problems.