The "innocent" code review that leaked proprietary code (and how we stopped it)!

At Aira Security, we recently ran a demo simulating a scenario that’s more common than you’d think. We call it a Toxic Flow. It’s not a bug in the AI or a hack of your credentials. It’s when your agent gets tricked into abandoning your intent and adopting the attacker’s.

Here’s how a simple documentation update turned into a massive data leak.

Phase 1: The Initial Request

The request is simple — one that developers do daily:

Review the open pull request and leave a comment if anything needs to be called out.

Pull Request Review

At first glance, the pull request (PR) looks like a typical documentation update. But hidden inside the Markdown file is a prompt injection designed for the agent reviewing the PR.

This isn't a "jailbreak"— there are no aggressive commands or "DAN-style" roleplay prompts. Because the language is professional and task-oriented, it bypasses standard classification models and safety filters. To a traditional scanner, it looks like nothing more than a developer providing helpful context.

Hidden Prompt

Phase 2: The Trap (Toxic Flow Attack)

This innocent-looking instruction forces the agent to change its task, turning a code review into an insider threat.

Toxic Flow Attack

As illustrated in the diagram, the attack unfolds as follows:

The Trigger: The agent ingests the pull request, assuming it’s safe.
The Compromise: A hidden prompt inside the PR instructs the agent to fetch internal data from a private, confidential repository.
The Leak: The agent, trusting its instructions, posts the sensitive data in a public comment.

This isn’t a failure of the agent. It’s a success at the wrong task. Instead of following the original intent ("Review the PR"), it adopts the attacker’s intent ("Leak the code"). Traditional security tools won’t catch this because the code itself isn’t malicious — the context is.

Data Leak

Phase 3: The Intervention

So, how do you stop an agent that is technically "allowed" to read code and post comments? The answer lies in guardrails that understand the sequence of actions.

We ran the same scenario, but this time with MCP Armor enabled. The result was very different:

MCP Armor Guardrails

Interception: The MCP Armor sits between the agent and the tools (GitHub).
Detection: The MCP Armor observes the agent performing an internal read (fetching private data) and immediately attempting to post externally (exposing it in a public comment).
Intent Verification: The system identifies that exposing private data violates security policy and deviates from the review goal, blocking the operation.

The MCP Armor detects the violation and prevents the leak, flagging it as a Toxic Flow.

Violation Prevented

The Takeaway

As we move from Chatbots to Agents, LLMs are gaining the ability to execute code and manage data.

But relying on the model to know better, or on simple classifiers to catch polite injections, isn’t enough. Behavioral guardrails are essential to ensure actions stay aligned with user intent, even when prompts look harmless.

Is your agent prepared to recognize these risks before they cause real impact?

Book a Demo