Papers
arxiv:2407.12784

AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

Published on Jul 17
Ā· Submitted by Zhaorun on Jul 18
#3 Paper of the day
Authors:
,
,
,

Abstract

LLM agents have demonstrated remarkable performance across various applications, primarily due to their advanced capabilities in reasoning, utilizing external knowledge and tools, calling APIs, and executing actions to interact with environments. Current agents typically utilize a memory module or a retrieval-augmented generation (RAG) mechanism, retrieving past knowledge and instances with similar embeddings from knowledge bases to inform task planning and execution. However, the reliance on unverified knowledge bases raises significant concerns about their safety and trustworthiness. To uncover such vulnerabilities, we propose a novel red teaming approach AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. In particular, we form the trigger generation process as a constrained optimization to optimize backdoor triggers by mapping the triggered instances to a unique embedding space, so as to ensure that whenever a user instruction contains the optimized backdoor trigger, the malicious demonstrations are retrieved from the poisoned memory or knowledge base with high probability. In the meantime, benign instructions without the trigger will still maintain normal performance. Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning, and the optimized backdoor trigger exhibits superior transferability, in-context coherence, and stealthiness. Extensive experiments demonstrate AgentPoison's effectiveness in attacking three types of real-world LLM agents: RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent. On each agent, AgentPoison achieves an average attack success rate higher than 80% with minimal impact on benign performance (less than 1%) with a poison rate less than 0.1%.

Community

Paper author Paper submitter

AgentPoison is the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. In particular, AgentPoison incorporates a constrained trigger optimization algorithm that maps triggered instances into a unique and compact embedding space to achieve high ASR under attack, and high benign utility in non-attack cases simultaneously.

Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning, and the optimized backdoor trigger exhibits superior transferability, in-context coherence, and stealthiness. Extensive experiments demonstrate AgentPoison effectiveness in attacking three types of real-world LLM agents: RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent.

Specifically,

  • šŸ˜ˆ On each agent, AgentPoison achieves an average attack success rate of ā‰„ 80% with minimal impact on benign performance (ā‰¤ 1%) with a small poison rate < 0.1% !
  • šŸ”„ Even when we inject only a single poisoning instance with a single-token trigger, AgentPoison achieves high ASR (ā‰„ 60%) !!
  • šŸ‘ŗ AgentPoison achieves high attack transferability across different RAG retrievers and high resilience against various perturbations and defenses !

šŸ”„šŸ”„ Our project is fully open-sourced! For more details, please refer to:
Project page: https://billchan226.github.io/AgentPoison.html
Code repo: https://github.com/BillChan226/AgentPoison
Dataset: https://drive.google.com/drive/folders/1WNJlgEZA3El6PNudK_onP7dThMXCY60K

Ā·

Hi @Zhaorun congrats on this work!

I see the model repository is currently empty, are you planning to upload the weights?

Here's how to do that: https://huggingface.co/docs/hub/models-uploading

Also, I see the data is currently available on Google Drive, would you be interested in making it available on the hub so that people can load it in 2 lines of code?

Here are some useful guides for that:

Let me know if you need any help!

Cheers,
Niels

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.12784 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.12784 in a Space README.md to link it from this page.

Collections including this paper 9