Guide
Stop Your AI Agent From Going Rogue
An agent with an inbox, network access, and a tool to send mail is one prompt injection away from a data leak. Use nylas agent policy and nylas agent rule to box it in before that happens.
Written by Caleb Geene Director, Site Reliability Engineering
Why AI agents go rogue
On April 25, 2026, a Cursor agent powered by Claude Opus 4.6 deleted PocketOS's production database and its Railway-volume backups in nine seconds — the agent hit a credential mismatch in staging, found an API token in an unrelated file, and issued a delete call it thought was scoped to staging. The same month, Pillar Security demonstrated that a prompt injection in Google's Antigravity IDE could exploit the find_by_name tool's Pattern parameter (the -X flag of the underlying fd utility) to escape the sandbox and run shell commands. In May 2026, Microsoft published a finding that Semantic Kernel can be turned into a host-level RCE primitive: one untrusted prompt, no browser exploit, calc.exe on the device running the agent.
None of those agents malfunctioned. They acted on what they read. The trigger came from somewhere they were trusted to read — a stray credential file, a shared doc, an upstream issue, an email body.
Simon Willison named the underlying pattern the lethal trifecta: private data, untrusted content, and external communication. An agent with all three will eventually be tricked into combining them. Email gives the agent every leg of the trifecta in one tool. Inbound message bodies are the untrusted content. The mailbox itself is the private data. The send command is the external-communication vector.
OWASP's AI Agent Security Cheat Sheet treats the model as untrusted by default and recommends layered defenses for exactly this reason: detection at the prompt layer is probabilistic and clever inputs eventually get through. Containment in the substrate (the layer between the agent and the network) is deterministic and what policies and rules give you.
How do you stop an AI agent from going rogue?
AI agent guardrails are deterministic controls that enforce what an agent can and cannot do, regardless of what its model decides. They span several layers (model-side safety, MCP tool restrictions, network egress, and policy-layer interception), and the strongest implementation lives outside the agent's decision loop so a prompt injection cannot reason its way past it. This guide configures the policy layer for email: bind the agent's mailbox to a policy, then attach rules that intercept anything outside the envelope. The Nylas CLI gives you three primitives for this layer:
- Account — a managed
provider=nylasmailbox the agent owns. It cannot send from any other identity. - Policy — the container attached to the account. Rules live in policies; the agent can only do what the policy permits.
- Rule — an inbound or outbound interceptor with a condition and an action: archive, block, mark as read, mark as starred.
Together they enforce a containment boundary outside the agent's own decision loop: for email specifically, inbound and outbound traffic restricted to what the policy permits. The agent cannot prompt its way past a rule it does not control. WorkOS's guide to containing prompt injection walks through the same principle for the broader API-tool surface.
Step 1 — Create a containment policy
Create the policy first so you can attach it when the account is provisioned. The policy is the container that holds your rules; if the policy doesn't exist yet, the account has nothing to attach to. Capture the policy ID from the response — you'll reference it in Step 2 and when adding rules in Step 3.
nylas agent policy create --name "Strict Outbound" --json{
"id": "policy_01HZX9...",
"name": "Strict Outbound",
"rules": [],
"created_at": "2026-05-08T14:02:11Z"
}Stash the ID in a shell variable for the next steps:
POLICY_ID=$(nylas agent policy create --name "Strict Outbound" --json | jq -r .id)Step 2 — Create the agent account attached to the policy
Now provision the agent's mailbox with the policy attached and an app password for IMAP/SMTP. A managed provider=nylas account isolates the agent from your real inbox so a compromise cannot read your personal mail or send as you. Pass --policy-id on creation so the policy is enforced from the first message, and --app-password so the agent's mail client (or MCP server) can authenticate.
nylas agent account create agent@yourapp.nylas.email \
--app-password 'YourSecureAgentPass!2026' \
--policy-id "$POLICY_ID"If the nylas connector does not exist yet, this creates it automatically. The app password is what the agent's mail client uses on every connection: pull the value from a secret manager (1Password, Vault, AWS Secrets Manager) or shell env var, and never commit it to source control. Rotate it with nylas agent account update --app-password when the agent identity needs new credentials. For the full identity walkthrough including a worked send/receive round trip, see Create an AI Agent Email Identity.
To change which rules the policy enforces later, edit the policy itself with nylas agent policy update or attach new rules with nylas agent rule create — the agent inherits the change on the next message processed by the policy, no re-creation needed.
Step 3 — Add containment rules
Rules are trigger + condition + action. The trigger is inbound or outbound. Conditions are field,operator,value triples. Actions are repeatable. You want rules in both directions: inbound to control what the agent ingests, outbound to control what it can send.
Inbound rules — control what reaches the agent
Inbound is where prompt injection enters. The agent reads the message body and treats its contents as instructions; an attacker who can put a string in front of the agent has already won the model. Inbound rules intercept the message before the agent ever sees it, so the injection never reaches the decision loop.
Block known phishing senders
Drop messages from sender domains you have already classified as malicious — threat-intel feeds, prior incident IOCs, typo-squats of your own domain. The block fires before the message is delivered to the agent's mailbox.
nylas agent rule create \
--name "Block phishing sender" \
--trigger inbound \
--condition from.domain,is,phisher.example.com \
--action block \
--priority 0Setting --priority 0 ensures the block fires before any archive or label rule absorbs the match.
Archive bounce notifications
Bounce notifications are an underrated injection vector: the body of a bounce is untrusted content the agent will read while trying to recover from a failed send. Archive bounces on receipt and mark them read so the agent never escalates them into a follow-up thread.
nylas agent rule create \
--name "Archive bounces" \
--trigger inbound \
--condition from.domain,is,mailer-daemon.example.com \
--action archive \
--action mark_as_read \
--priority 10Use in_list to cover the long tail of mailer-daemon variants your providers actually emit. Auto-replies and out-of-office messages are harder to filter at the rule layer because they originate from real user mailboxes — treat them as residual risk and rely on the outbound rules below to contain any escalation.
Outbound rules — control what the agent can send
Outbound is the exfiltration vector. Even if an inbound rule misses an injection, an outbound rule can stop the resulting message from leaving the connector. These are the rules that turn a successful injection into a contained injection.
Hard-block known exfiltration targets
If your threat model includes a specific recipient (an attacker domain from a previous incident, a known C2 channel, a typo-squat of a partner domain), reject the send before SMTP is invoked. The agent has no path to bypass the rule because it executes outside the agent's decision loop.
nylas agent rule create \
--name "Hard block: attacker.example" \
--trigger outbound \
--condition recipient.domain,is,attacker.example \
--action block \
--priority 0Block multiple exfil targets via in_list
Maintaining a separate rule for every attacker domain does not scale. Use in_list with a Nylas list ID to reference dozens or hundreds of recipient domains in a single rule. Update the list and the rule starts enforcing the new entries on the next outbound; you never edit the rule itself.
nylas agent rule create \
--name "Block exfil targets (list)" \
--trigger outbound \
--condition recipient.domain,in_list,list_blocklist_2026 \
--action block \
--priority 20Lists are referenced by ID and passed as repeatable values after in_list in the condition (e.g. recipient.domain,in_list,list_a,list_b). See the v3 API reference for list creation and population. Pair the in_list block with the hard-block rule above — the hard-block covers the single most-critical IOC, the list-based rule covers the long tail.
Verify the guardrails are live
Inspect the rules attached to the agent's policy. The list reflects what the policy engine will enforce on the next message:
nylas agent rule listID NAME TRIGGER ACTIONS PRIORITY
rule_01HZX9... Block phishing sender inbound block 0
rule_01HZXA... Archive bounces inbound archive, mark_as_read 10
rule_01HZXB... Hard block: attacker.example outbound block 0
rule_01HZXC... Block exfil targets (list) outbound block 20Then confirm the connector and account state agree:
nylas agent statusTest in both directions before letting the agent run. For outbound, attempt a send to a blocked recipient domain — the call should fail at the policy layer with nothing leaving the connector. For inbound, send a message from a blocked sender domain to the agent's mailbox and confirm the rule drops or archives it before delivery. If either message slips through, the trigger or priority is wrong; re-check nylas agent rule list and adjust.
Tested on Nylas CLI 3.1.5 against a managed provider=nylas mailbox in May 2026. Run nylas version to confirm the binary on your machine, and nylas agent status --json to confirm the connector and the agent's policy attachment before relying on the rules in production.
Layer audit logging on top of the policy
Rules contain in real time. Audit logs reconstruct what happened after. You want both, because the rules tell you what the agent could not do, and the audit log tells you what it actually tried to do. Initialize audit logging once:
nylas audit init --enableThen filter by source to see exactly what the agent attempted:
nylas audit logs show --source claude-code --since "2026-05-01" --status errorEvery rejected send shows up in the log with its grant, command, status, and request ID for correlation. For the full audit playbook including SIEM export and CI/CD integration, see Audit AI Agent Activity (Claude, Copilot, MCP).
What rules can't do (and what to combine them with)
Nylas agent rules match on header fields the connector exposes: from.domain, recipient.domain, and other envelope-level attributes. They do not introspect message bodies for natural-language intent and they are not a replacement for network egress controls. Treat them as one ring of defense in depth, not the only one. The table below compares the common containment layers, what each one catches, and what each one misses.
| Layer | Catches | Misses |
|---|---|---|
| Network egress host/network layer | Any non-allowlisted outbound network destination from the agent host. | Activity inside an allowed connection (e.g. the Nylas connector itself). |
| MCP / tool restrictions agent harness | Tools the agent literally cannot invoke (no send_mail tool means no send). | Misuse of tools the agent does have. |
| Model-side safety system prompts, RLHF | Obvious harmful instructions during generation. | Adversarial prompts, indirect injection, context-window manipulation. |
| Nylas agent rules this guide | Inbound senders, outbound recipients, and bounce loops at the email connector. | Body-level semantics, message intent, narrative meaning. |
| Audit logs after the fact | Every executed command for forensic review. | Nothing in real time. Reconstruction, not prevention. |
Prompt-injection prevention does not live in any one layer; the model itself is part of the threat surface. Snowflake's Cortex AI guardrails post documents the same layered approach for their managed inference stack.
Next steps
- Create an AI Agent Email Identity — the full setup for the managed mailbox you contained in this guide.
- Audit AI Agent Activity (Claude, Copilot, MCP) — record every command an agent runs so you can prove what the rules blocked.
- Give Your AI Coding Agent an Email Address — the OAuth-based alternative for Claude Code, Cursor, and Codex CLI.
- Why AI Agents Need Email — the case for first-class agent email and the threat model that motivates these guardrails.
- Full command reference — every CLI command documented.
- OWASP AI Agent Security Cheat Sheet — the authoritative threat-model reference for autonomous agents.
- NIST AI Risk Management Framework — the GOVERN function explicitly requires containment controls for autonomous systems.
- Microsoft: Prompts Become Shells (May 2026) — the Semantic Kernel RCE finding that motivated tighter agent boundaries industry-wide.