Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse Paper • 2409.11242 • Published Sep 17 • 5
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique Paper • 2408.10701 • Published Aug 20 • 10
WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models Paper • 2408.03837 • Published Aug 7 • 17
Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment Paper • 2308.09662 • Published Aug 18, 2023 • 3
DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling Paper • 2406.11617 • Published Jun 17 • 8
Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming Paper • 2406.11654 • Published Jun 17 • 6
LLM Safety Collection Our research on LLM safety: red-teaming, value alignment, realignment. • 7 items • Updated Aug 8 • 1
Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases Paper • 2310.14303 • Published Oct 22, 2023 • 1
Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic Paper • 2402.11746 • Published Feb 19 • 2