Code Review Agent That Leaked Proprietary Code

Your AI agent is the most helpful employee you’ve hired this year. It’s fast, tireless, and has access to everything. But what if this very helpfulness is also a security risk?

At Aira Security, we recently ran a demo simulating a scenario that’s more common than you’d think. We call it a Toxic Flow. It’s not a bug in the AI or a hack of your credentials. It’s when your agent gets tricked into abandoning your intent and adopting the attacker’s.

Here’s how a simple documentation update turned into a massive data leak.

Phase 1: The Initial Request

The request is simple — one that developers do daily:

Review the open pull request and leave a comment if anything needs to be called out.

Claude Agent reviewing the open PR in a public developer documentation repo

At first glance, the pull request (PR) looks like a typical documentation update. But hidden inside the Markdown file is a prompt injection designed for the agent reviewing the PR.

This isn't a "jailbreak"— there are no aggressive commands or "DAN-style" roleplay prompts. Because the language is professional and task-oriented, it bypasses standard classification models and safety filters. To a traditional scanner, it looks like nothing more than a developer providing helpful context.

The Trojan Horse: A hidden prompt injection inside a standard documentation update

Phase 2: The Trap (Toxic Flow Attack)

This innocent-looking instruction forces the agent to change its task, turning a code review into an insider threat

The Toxic Flow: How an Agent becomes an Insider Threat

As illustrated in the diagram, the attack unfolds as below:

  1. The Trigger: The agent ingests the pull request, assuming it’s safe.
  2. The Compromise: A hidden prompt inside the PR instructs the agent to fetch internal data from a private, confidential repository.
  3. The Leak: The agent, trusting its instructions, posts the sensitive data in a public comment.

This isn’t a failure of the agent. It’s a success at the wrong task. Instead of following the original intent ("Review the PR"), it adopts the attacker’s intent ("Leak the code"). Traditional security tools won’t catch this because the code itself isn’t malicious — the context is

Claude leaked internal code in PR comment

Phase 3: The Intervention

So, how do you stop an agent that is technically "allowed" to read code and post comments? The answer lies in guardrails that understand the sequence of actions.

We ran the same scenario, but this time with MCP Checkpoint enabled. The result was very different:

  1. Interception: The MCP Checkpoint sits between the agent and the tools (GitHub).

  2. Detection: The MCP checkpoint observes the agent performing an internal read (fetching private data) and immediately attempting to post externally (exposing it in a public comment).

  3. Intent Verification: The system identifies that exposing private data violates security policy and deviates from the review goal, blocking the operation.

The MCP Checkpoint detects the violation and prevents the leak, flagging it as a Toxic Flow.

MCP checkpoint stops Claude from leaking internal data

The Takeaway

As we move from Chatbots to Agents, LLMs are gaining the ability to execute code and manage data.

But relying on the model to know better, or on simple classifiers to catch polite injections, isn’t enough. Behavioral guardrails are essential to ensure actions stay aligned with user intent, even when prompts look harmless.

Is your agent prepared to recognize these risks before they cause real impact?

[Book a Demo]

Related Blog

No items found.