The "innocent" code review that leaked proprietary code (and how we stopped it)!
Your AI agent is the most helpful employee you’ve hired this year. But what if this very helpfulness is also a security risk?


At Aira Security, we recently ran a demo simulating a scenario that’s more common than you’d think. We call it a Toxic Flow. It’s not a bug in the AI or a hack of your credentials. It’s when your agent gets tricked into abandoning your intent and adopting the attacker’s.
Here’s how a simple documentation update turned into a massive data leak.
Phase 1: The Initial Request
The request is simple — one that developers do daily:
Review the open pull request and leave a comment if anything needs to be called out.

At first glance, the pull request (PR) looks like a typical documentation update. But hidden inside the Markdown file is a prompt injection designed for the agent reviewing the PR.
This isn't a "jailbreak"— there are no aggressive commands or "DAN-style" roleplay prompts. Because the language is professional and task-oriented, it bypasses standard classification models and safety filters. To a traditional scanner, it looks like nothing more than a developer providing helpful context.

Phase 2: The Trap (Toxic Flow Attack)
This innocent-looking instruction forces the agent to change its task, turning a code review into an insider threat.

As illustrated in the diagram, the attack unfolds as follows:
- The Trigger: The agent ingests the pull request, assuming it’s safe.
- The Compromise: A hidden prompt inside the PR instructs the agent to fetch internal data from a private, confidential repository.
- The Leak: The agent, trusting its instructions, posts the sensitive data in a public comment.
This isn’t a failure of the agent. It’s a success at the wrong task. Instead of following the original intent ("Review the PR"), it adopts the attacker’s intent ("Leak the code"). Traditional security tools won’t catch this because the code itself isn’t malicious — the context is.

Phase 3: The Intervention
So, how do you stop an agent that is technically "allowed" to read code and post comments? The answer lies in guardrails that understand the sequence of actions.
We ran the same scenario, but this time with MCP Armor enabled. The result was very different:

- Interception: The MCP Armor sits between the agent and the tools (GitHub).
- Detection: The MCP Armor observes the agent performing an internal read (fetching private data) and immediately attempting to post externally (exposing it in a public comment).
- Intent Verification: The system identifies that exposing private data violates security policy and deviates from the review goal, blocking the operation.
The MCP Armor detects the violation and prevents the leak, flagging it as a Toxic Flow.

The Takeaway
As we move from Chatbots to Agents, LLMs are gaining the ability to execute code and manage data.
But relying on the model to know better, or on simple classifiers to catch polite injections, isn’t enough. Behavioral guardrails are essential to ensure actions stay aligned with user intent, even when prompts look harmless.
Is your agent prepared to recognize these risks before they cause real impact?