Guide

Stop Your AI Agent From Going Rogue

An agent with an inbox, network access, and a tool to send mail is one prompt injection away from a data leak. Use nylas agent policy and nylas agent rule to box it in before that happens.

Written by Caleb Geene Director, Site Reliability Engineering

Reviewed by Prem Keshari

Updated July 9, 2026

Verified — CLI 3.1.5 · Nylas managed · last tested May 8, 2026

Why do AI agents go rogue?

On April 25, 2026, a Cursor agent powered by Claude Opus 4.6 deleted PocketOS's production database and its Railway-volume backups in nine seconds — the agent hit a credential mismatch in staging, found an API token in an unrelated file, and issued a delete call it thought was scoped to staging. The same month, Pillar Security demonstrated that a prompt injection in Google's Antigravity IDE could exploit the find_by_name tool's Pattern parameter (the -X flag of the underlying fd utility) to escape the sandbox and run shell commands. In May 2026, Microsoft published a finding that Semantic Kernel can be turned into a host-level RCE primitive: one untrusted prompt, no browser exploit, calc.exe on the device running the agent.

None of those agents malfunctioned. They acted on what they read. The trigger came from somewhere they were trusted to read — a stray credential file, a shared doc, an upstream issue, an email body.

Simon Willison named the underlying pattern the lethal trifecta: private data, untrusted content, and external communication. An agent with all three will eventually be tricked into combining them. Email gives the agent every leg of the trifecta in one tool. Inbound message bodies are the untrusted content. The mailbox itself is the private data. The send command is the external-communication vector.

OWASP's AI Agent Security Cheat Sheet treats the model as untrusted by default and recommends layered defenses for exactly this reason: detection at the prompt layer is probabilistic and clever inputs eventually get through. Containment in the substrate (the layer between the agent and the network) is deterministic and what policies and rules give you.

How do you stop an AI agent from going rogue?

AI agent guardrails are deterministic controls that enforce what an agent can and cannot do, regardless of what its model decides. They span several layers (model-side safety, MCP tool restrictions, network egress, and policy-layer interception), and the strongest implementation lives outside the agent's decision loop so a prompt injection cannot reason its way past it. This guide configures the policy layer for email: bind the agent's mailbox to a policy, then attach rules that intercept anything outside the envelope. The Nylas CLI gives you three primitives for this layer:

Account — a managed provider=nylas mailbox the agent owns. It cannot send from any other identity.
Policy — the container attached to the account. Rules live in policies; the agent can only do what the policy permits.
Rule — an inbound or outbound interceptor with a condition and an action: archive, block, mark as read, mark as starred.

Together they enforce a containment boundary outside the agent's own decision loop: for email specifically, inbound and outbound traffic restricted to what the policy permits. The agent cannot prompt its way past a rule it does not control. WorkOS's guide to containing prompt injection walks through the same principle for the broader API-tool surface.

How do I create a containment policy?

A containment policy is the top-level object that groups every inbound and outbound rule the agent must obey. Create it first so the account has something to bind to at provisioning time. Policy creation takes under 2 seconds and returns a unique policy ID you'll reference in Step 2 (account creation) and Step 3 (rule creation). Without a policy, the account has no enforcement layer and every message flows unchecked.

The command below creates a policy named "Strict Outbound" and prints the response as JSON. Capture the id field from the response — it is the handle for every subsequent rule and account operation.

nylas agent policy create --name "Strict Outbound" --json

{
  "id": "policy_01HZX9...",
  "name": "Strict Outbound",
  "rules": [],
  "created_at": "2026-05-08T14:02:11Z"
}

Stash the ID in a shell variable for the next steps:

POLICY_ID=$(nylas agent policy create --name "Strict Outbound" --json | jq -r .id)

How do I create the agent account and attach the policy?

Now provision the agent's mailbox, then attach the policy to its workspace. A managed provider=nylas account isolates the agent from your real inbox so a compromise cannot read your personal mail or send as you. Pass --app-password so the agent's mail client (or MCP server) can authenticate, then apply the strict policy to the account's default workspace with nylas workspace update --policy-id so every outbound message is evaluated before it leaves.

nylas agent account create agent@yourapp.nylas.email \
  --app-password 'YourSecureAgentPass!2026'

# The account starts on a default workspace; find it with: nylas workspace list --json
nylas workspace update WORKSPACE_ID --policy-id "$POLICY_ID"

If the nylas connector does not exist yet, this creates it automatically. The app password is what the agent's mail client uses on every connection: pull the value from a secret manager (1Password, Vault, AWS Secrets Manager) or shell env var, and never commit it to source control. Rotate it with nylas agent account update --app-password when the agent identity needs new credentials. For the full identity walkthrough including a worked send/receive round trip, see Create an AI Agent Email Identity.

To change which rules the policy enforces later, edit the policy itself with nylas agent policy update or attach new rules with nylas agent rule create — the agent inherits the change on the next message processed by the policy, no re-creation needed.

How do I add containment rules?

Rules are trigger + condition + action. The trigger is inbound or outbound. Conditions are field,operator,value triples. Actions are repeatable. You want rules in both directions: inbound to control what the agent ingests, outbound to control what it can send.

Inbound rules — control what reaches the agent

Inbound is where prompt injection enters. The agent reads the message body and treats its contents as instructions; an attacker who can put a string in front of the agent has already won the model. Inbound rules intercept the message before the agent ever sees it, so the injection never reaches the decision loop.

Block known phishing senders

Verizon's 2025 Data Breach Investigations Report found that 36% of breaches involve phishing. For an AI agent with inbox access, a phishing email is not just a credential threat — it is a prompt injection payload. An inbound block rule drops messages from sender domains you have already classified as malicious (threat-intel feeds, prior incident IOCs, typo-squats of your own domain) before the agent reads the body, so the injection never enters the model's context window.

The rule below creates an inbound block on a single sender domain. It fires at the connector layer before the message is delivered to the agent's mailbox, and the agent receives no notification that a message was intercepted.

nylas agent rule create \
  --name "Block phishing sender" \
  --trigger inbound \
  --condition from.domain,is,phisher.example.com \
  --action block \
  --priority 0

Setting --priority 0 ensures the block fires before any archive or label rule absorbs the match.

Archive bounce notifications

Bounce notifications account for 2-5% of inbound volume and carry attacker-controlled content in the original message body. When the agent reads a bounce to recover from a failed send, it ingests the bounced message's body as context — and that body can contain a prompt injection payload. An archive rule intercepts bounce notifications on receipt and marks them read, so the agent never escalates them into a follow-up thread or treats the bounced content as new instructions.

The rule below matches messages from a mailer-daemon domain and applies two actions: archive (move out of the inbox) and mark-as-read (suppress unread indicators the agent might poll).

nylas agent rule create \
  --name "Archive bounces" \
  --trigger inbound \
  --condition from.domain,is,mailer-daemon.example.com \
  --action archive \
  --action mark_as_read \
  --priority 10

Use in_list to cover the long tail of mailer-daemon variants your providers actually emit. Auto-replies and out-of-office messages are harder to filter at the rule layer because they originate from real user mailboxes — treat them as residual risk and rely on the outbound rules below to contain any escalation.

Outbound rules — control what the agent can send

Outbound is the exfiltration vector. Even if an inbound rule misses an injection, an outbound rule can stop the resulting message from leaving the connector. These are the rules that turn a successful injection into a contained injection.

Hard-block known exfiltration targets

An outbound block rule rejects a send before SMTP is invoked. Policy-layer blocks execute in under 5 ms — faster than the agent's own decision loop — so even if a prompt injection convinces the model to exfiltrate data, the message never leaves the connector. Use a hard block when your threat model includes a specific recipient: an attacker domain from a previous incident, a known C2 channel, or a typo-squat of a partner domain.

The rule below creates an outbound block on a single recipient domain at priority 0. Because the block fires outside the agent's process, the model cannot reason its way around it or retry with a different phrasing.

nylas agent rule create \
  --name "Hard block: attacker.example" \
  --trigger outbound \
  --condition recipient.domain,is,attacker.example \
  --action block \
  --priority 0

Block multiple exfil targets via in_list

A single list-based rule can reference hundreds of blocked domains without creating individual rules for each one. The in_list operator references a Nylas list ID that holds the domain entries. When you add a domain to the list, the rule starts enforcing it on the next outbound message — you never edit the rule itself. This is how you scale a blocklist from a handful of known IOCs to a full threat-intel feed without rule sprawl.

The rule below creates an outbound block that checks recipient domains against a centralized list. Pair it with the hard-block rule above: the hard block covers the single most-critical IOC at priority 0, and the list-based rule covers the long tail at priority 20.

nylas agent rule create \
  --name "Block exfil targets (list)" \
  --trigger outbound \
  --condition recipient.domain,in_list,list_blocklist_2026 \
  --action block \
  --priority 20

Lists are referenced by ID and passed as repeatable values after in_list in the condition (e.g. recipient.domain,in_list,list_a,list_b). See the v3 API reference for list creation and population. Pair the in_list block with the hard-block rule above — the hard-block covers the single most-critical IOC, the list-based rule covers the long tail.

How do I verify the guardrails are live?

Run verification before production — a misconfigured priority or trigger can leave the agent unprotected while appearing correctly configured. The nylas agent rule list command shows every active rule, its trigger direction, actions, and priority. The output reflects what the policy engine will enforce on the next message:

nylas agent rule list

ID                   NAME                            TRIGGER    ACTIONS                  PRIORITY
rule_01HZX9...       Block phishing sender           inbound    block                    0
rule_01HZXA...       Archive bounces                 inbound    archive, mark_as_read    10
rule_01HZXB...       Hard block: attacker.example    outbound   block                    0
rule_01HZXC...       Block exfil targets (list)      outbound   block                    20

Then confirm the connector and account state agree:

nylas agent status

Test in both directions before letting the agent run. For outbound, attempt a send to a blocked recipient domain — the call should fail at the policy layer with nothing leaving the connector. For inbound, send a message from a blocked sender domain to the agent's mailbox and confirm the rule drops or archives it before delivery. If either message slips through, the trigger or priority is wrong; re-check nylas agent rule list and adjust.

Tested on Nylas CLI 3.1.5 against a managed provider=nylas mailbox in May 2026. Run nylas version to confirm the binary on your machine, and nylas agent status --json to confirm the connector and the agent's policy attachment before relying on the rules in production.

How do I layer audit logging on top of the policy?

Rules contain in real time. Audit logs reconstruct what happened after. You want both, because the rules tell you what the agent could not do, and the audit log tells you what it actually tried to do. Audit logs capture every rejected send with grant, command, status, and request ID for correlation — the four fields you need to tie a blocked message back to the prompt that triggered it.

Initialize audit logging once with the command below. This enables persistent logging for all agent activity on the current grant, including rule-blocked messages, successful sends, and authentication events.

nylas audit init --enable

Then filter by source and status to see exactly what the agent attempted and what the policy rejected. The --status error flag isolates blocked actions from successful ones:

nylas audit logs show --source claude-code --since "2026-05-01" --status error

Each log entry includes the request ID, which you can correlate with the agent's own trace logs to see what prompt or tool call led to the blocked action. For the full audit playbook including SIEM export and CI/CD integration, see Audit AI Agent Activity (Claude, Copilot, MCP).

What rules can't do (and what to combine them with)

Nylas agent rules match on header fields the connector exposes: from.domain, recipient.domain, and other envelope-level attributes. They do not introspect message bodies for natural-language intent and they are not a replacement for network egress controls. Treat them as one ring of defense in depth, not the only one. The table below compares the common containment layers, what each one catches, and what each one misses.

Layer	Catches	Misses
Network egress host/network layer	Any non-allowlisted outbound network destination from the agent host.	Activity inside an allowed connection (e.g. the Nylas connector itself).
MCP / tool restrictions agent harness	Tools the agent literally cannot invoke (no `send_mail` tool means no send).	Misuse of tools the agent does have.
Model-side safety system prompts, RLHF	Obvious harmful instructions during generation.	Adversarial prompts, indirect injection, context-window manipulation.
Nylas agent rules this guide	Inbound senders, outbound recipients, and bounce loops at the email connector.	Body-level semantics, message intent, narrative meaning.
Audit logs after the fact	Every executed command for forensic review.	Nothing in real time. Reconstruction, not prevention.

Prompt-injection prevention does not live in any one layer; the model itself is part of the threat surface. Snowflake's Cortex AI guardrails post documents the same layered approach for their managed inference stack.

Next steps

Agent Rules and Policies: Complete Guide — every condition field, operator, action, trigger, and policy setting for the rules you created here.
Manage the Agent Account Lifecycle — pause an agent with a block-all rule, then resume or delete it once the incident is closed.
Create an AI Agent Email Identity — the full setup for the managed mailbox you contained in this guide.
Audit AI Agent Activity (Claude, Copilot, MCP) — record every command an agent runs so you can prove what the rules blocked.
Give Your AI Coding Agent an Email Address — the OAuth-based alternative for Claude Code, Cursor, and Codex CLI.
Why AI Agents Need Email — the case for first-class agent email and the threat model that motivates these guardrails.
Calendar invite prompt injection — defend the calendar vector, where invites enter an agent's context without a click
AI agent audit dashboard — per-agent command counts, error rates, and session replay from audit logs
Give Hermes its own email address — apply this containment model to the self-hosted Hermes agent end to end
Full command reference — every CLI command documented.
OWASP AI Agent Security Cheat Sheet — the authoritative threat-model reference for autonomous agents.
NIST AI Risk Management Framework — the GOVERN function explicitly requires containment controls for autonomous systems.
Microsoft: Prompts Become Shells (May 2026) — the Semantic Kernel RCE finding that motivated tighter agent boundaries industry-wide.