Literature Review: Prompt Infection: LLM-To-LLM Prompt Injection Within Multi-Agent Systems

This paper investigates the vulnerability of Multi-Agent Systems (MAS) to a specific class of adversarial attacks termed “Prompt Infection”. The authors demonstrate how a single malicious prompt, initially embedded within an external document, can compromise a downstream sequence of interconnected Large Language Models (LLMs). By overriding original system instructions, the authors claim the infection payload self-replicates and traverses the agent network to execute unauthorized actions, i.e. data exfiltration or malware distribution, while bypassing standard single-agent safety measures.

Key Insights

  1. The Model Competence Paradox Empirical results highlight an inverse relationship between a model’s foundational reasoning capabilities and its post-infection safety. Advanced models like GPT-4o effectively filter out a large majority of initial injection attempts, proving highly resilient compared to GPT-3.5. However, once successfully infected, GPT-4o executes the malicious payload with significantly higher precision and fewer errors, making it a far more dangerous vector.

  2. Inadequacy of Isolated Defense Protocols Existing defense strategies, alongside the authors’ proposed “LLM Tagging” (which prepends source identifiers to agent responses), fail to mitigate the threat when implemented independently. LLM Tagging alone only reduces the attack success rate by a negligible margin, requiring strict coupling with secondary instruction defenses to demonstrate any structural viability.

Example

Consider a data theft scenario involving a sequential chain of specialized agents. An attacker embeds an infectious prompt inside a seemingly benign webpage. A user requests a summary of this webpage through the MAS. The initial Web Reader agent ingests the content, becomes compromised, and passes the self-replicating payload downstream. The infection then reaches a Database Manager agent, which is coerced into appending sensitive internal records to the prompt context. Finally, the modified context arrives at a Coder agent equipped with execution privileges, which automatically writes and executes a POST request to send the collected sensitive data to an external, attacker-controlled server.

Ratings

Novelty: 2/5 The fundamental mechanism does not diverge significantly from established indirect prompt injection techniques observed in standard Retrieval-Augmented Generation (RAG) pipelines.

Clarity: 3.5/5 The empirical setup and threat modeling are articulated well.

Personal Perspective

While the exploration of multi-agent vulnerabilities is timely at this point in time, the work does not demonstrate a very radical departure from existing indirect prompt injection literature. The experimental setup fails to convincingly illustrate why this threat is more critical in a multi-agent system compared to standard human-in-the-loop applications or single-agent RAG setups. The agents are entirely unaware that they have been jailbroken; they are merely executing localized tasks using injected material, meaning there is no emergent misalignment or complex coordination taking place. The proposed LLM Tagging defense is not particularly promising or robust either, leaving the broader implications of the paper feeling somewhat incremental.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Literature Review: Who's in Charge? Disempowerment Patterns in Real-World LLM Usage
  • Literature Review: Gradual Disempowerment: Systemic Existential Risks From Incremental AI Development
  • Literature Review: Gradual Disempowerment: Systemic Existential Risks From Incremental AI Development
  • Literature Review: Automatic Prompt Optimization With "Gradient Descent" And Beam Search
  • Literature Review: Unable To Forget: Proactive Interference Reveals Working Memory Limits In LLMs Beyond Context Length