
Your AI agent is the most helpful employee you’ve hired this year. It’s fast, tireless, and has access to everything. But what if this very helpfulness is also a security risk?
At Aira Security, we recently ran a demo simulating a scenario that’s more common than you’d think. We call it a Toxic Flow. It’s not a bug in the AI or a hack of your credentials. It’s when your agent gets tricked into abandoning your intent and adopting the attacker’s.
Here’s how a simple documentation update turned into a massive data leak.
The request is simple — one that developers do daily:
Review the open pull request and leave a comment if anything needs to be called out.

At first glance, the pull request (PR) looks like a typical documentation update. But hidden inside the Markdown file is a prompt injection designed for the agent reviewing the PR.
This isn't a "jailbreak"— there are no aggressive commands or "DAN-style" roleplay prompts. Because the language is professional and task-oriented, it bypasses standard classification models and safety filters. To a traditional scanner, it looks like nothing more than a developer providing helpful context.

This innocent-looking instruction forces the agent to change its task, turning a code review into an insider threat

As illustrated in the diagram, the attack unfolds as below:
This isn’t a failure of the agent. It’s a success at the wrong task. Instead of following the original intent ("Review the PR"), it adopts the attacker’s intent ("Leak the code"). Traditional security tools won’t catch this because the code itself isn’t malicious — the context is

So, how do you stop an agent that is technically "allowed" to read code and post comments? The answer lies in guardrails that understand the sequence of actions.
We ran the same scenario, but this time with MCP Checkpoint enabled. The result was very different:
The MCP Checkpoint detects the violation and prevents the leak, flagging it as a Toxic Flow.

As we move from Chatbots to Agents, LLMs are gaining the ability to execute code and manage data.
But relying on the model to know better, or on simple classifiers to catch polite injections, isn’t enough. Behavioral guardrails are essential to ensure actions stay aligned with user intent, even when prompts look harmless.
Is your agent prepared to recognize these risks before they cause real impact?