Guide

Build a Human-in-the-Loop Email Agent

Your support agent auto-replied to the wrong customer. It apologized for an outage that never happened, cc'd a competitor's email address pulled from the thread, and signed off as 'Best, AI Assistant.' That reply went out in 400 milliseconds because nothing sat between the LLM and the send button. This guide builds the gate that was missing: an AI agent that classifies incoming mail, drafts replies, and queues them for human review before anything leaves your outbox.

Written by Aaron de Mello Senior Engineering Manager

VerifiedCLI 3.1.1 · Gmail, Outlook · last tested May 22, 2026

What is a human-in-the-loop email agent?

A human-in-the-loop (HITL) email agent is an AI workflow where the model reads incoming mail, classifies it by urgency, and drafts replies, but a human must approve each draft before it sends. The agent never fires a reply on its own. Per Gartner's 2024 AI Impact Radar, 85% of enterprises will require human oversight for AI-generated external communications by 2026. The approval gate isn't a nice-to-have. It's a deployment requirement.

This differs from a fully automated triage agent (covered in Build an AI Email Triage Agent) where the model can mark messages read or create drafts without review. It also differs from the defensive patterns in Stop Your AI Agent From Going Rogue, which focus on containment after deployment. This guide builds approval as a first-class feature of the agent itself.

Why do AI email agents need human approval?

AI email agents need human approval because LLMs hallucinate, leak context, and lack judgment about tone. A 2024 Stanford HAI report found LLMs produce factual errors in 15-25% of generated text. Applied to email, that means 1 in 5 auto-replies could contain wrong dates, fabricated details, or misattributed quotes. Three specific failure modes make unsupervised email agents dangerous:

  • Hallucinated facts — the model invents meeting times, references documents that don't exist, or misquotes prior messages in a thread
  • Confidential data leaks — thread context from one customer gets mixed into a reply to another when the model processes multiple conversations in a batch
  • Wrong tone — a casual reply to a legal notice, a blunt dismissal of a frustrated customer, or an apology for something that never happened

The approval gate catches all three. It takes 5-10 seconds per draft to scan for accuracy. That's the difference between a workflow your compliance team signs off on and one they shut down after the first incident.

How does the approval workflow work?

The approval workflow has 6 stages. Each stage is a discrete script or command, so you can swap any piece without rebuilding the pipeline. The total loop time from fetch to approved-send is under 2 minutes per batch of 20 messages, not counting human review time. Here's the flow:

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│  Nylas CLI   │────▶│ Python Agent │────▶│    LLM      │
│ (email I/O)  │◀────│ (orchestrator│◀────│ (classifier │
│              │     │  + queue)    │     │  + drafter) │
└─────────────┘     └──────────────┘     └─────────────┘

                    ┌──────▼───────┐
                    │ Review Queue │
                    │ (JSON files) │
                    └──────┬───────┘

                    ┌──────▼───────┐
                    │  Human       │
                    │  Reviewer    │
                    └──────┬───────┘

                ┌──────────▼──────────┐
                │  approve / reject   │
                │  (send or discard)  │
                └─────────────────────┘

1. nylas email list --unread --json   →  fetch unread messages
2. Python sends each email to LLM    →  classify as URGENT/ROUTINE/SPAM
3. LLM drafts replies for URGENT/ROUTINE
4. Drafts written to review queue    →  pending/*.json
5. Human reviews: approve or reject
6. Approved drafts sent via nylas email send

The review queue is the key architectural difference. Instead of calling nylas email send immediately after drafting, the agent writes each draft to a pending/ directory as a JSON file. Nothing sends until the reviewer runs the approve command. Rejected drafts move to rejected/ for audit.

What do you need before starting?

The HITL agent requires 3 components: the Nylas CLI for mailbox access, Python 3.10+ for the orchestration and review scripts, and an LLM API key for classification and drafting. The full setup takes under 5 minutes if you already have Python and Homebrew installed. You'll also need about 50MB of disk for the review queue directory.

  • Nylas CLI installed and authenticated (nylas auth whoami should show your account)
  • Python 3.10+ with openai or anthropic package
  • An LLM API key from OpenAI, Anthropic, or a local Ollama instance
  • A connected email account (Gmail, Outlook, or any supported provider)
# Install Python dependency
pip install openai

# Create the review queue directories
mkdir -p pending approved rejected

How do you classify incoming email?

Classification runs on every unread message and assigns one of three labels: URGENT (needs a reply within 1 hour), ROUTINE (reply today), or SPAM (discard silently). Three categories keep the reviewer's decisions fast. The LLM processes 20 emails in about 3 seconds using gpt-4o-mini at roughly $0.002 per batch. The classifier prompt constrains output to a single word, and the code falls back to ROUTINE for any unexpected response:

import subprocess
import json
from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from env

CLASSIFY_PROMPT = """Classify this email into exactly one category.
Return ONLY the category name, nothing else.

Categories:
- URGENT: needs a reply within 1 hour (incidents, exec requests, deadlines)
- ROUTINE: needs a reply today (code reviews, follow-ups, questions)
- SPAM: no reply needed (newsletters, marketing, noreply@ senders)

Email:
From: {sender}
Subject: {subject}
Preview: {snippet}
"""

def fetch_unread(limit=20):
    """Fetch unread emails via Nylas CLI."""
    result = subprocess.run(
        ["nylas", "email", "list", "--unread", "--limit", str(limit), "--json"],
        capture_output=True, text=True
    )
    if result.returncode != 0:
        return []
    return json.loads(result.stdout)

def classify(email):
    """Classify a single email. Returns URGENT, ROUTINE, or SPAM."""
    sender = email["from"][0]["email"] if email.get("from") else "unknown"
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": CLASSIFY_PROMPT.format(
            sender=sender,
            subject=email.get("subject", ""),
            snippet=email.get("snippet", ""),
        )}],
        max_tokens=10, temperature=0,
    )
    cat = resp.choices[0].message.content.strip().upper()
    return cat if cat in ("URGENT", "ROUTINE", "SPAM") else "ROUTINE"

Setting temperature=0 makes classification deterministic. The same email always gets the same label, which matters when you're debugging why a message was misrouted. If classification accuracy dips below 90%, add sender-based pre-rules (e.g., always classify noreply@ as SPAM before hitting the LLM).

How do you draft responses for review?

The drafting step generates a reply for every URGENT and ROUTINE email, then writes it to the review queue instead of sending it. Each draft includes the original message ID so the approved reply can be threaded correctly. Drafts use temperature=0.7 for natural phrasing. The average draft takes 1.2 seconds to generate with gpt-4o:

import os
from datetime import datetime, timezone

DRAFT_PROMPT = """Write a short, professional reply to this email.
Keep it under 3 sentences. Be direct. Match the formality of the original.

Original email:
From: {sender}
Subject: {subject}
Body preview: {snippet}

Reply:"""

def draft_reply(email):
    """Generate a draft reply."""
    sender = email["from"][0]["email"] if email.get("from") else "unknown"
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": DRAFT_PROMPT.format(
            sender=sender,
            subject=email.get("subject", ""),
            snippet=email.get("snippet", ""),
        )}],
        max_tokens=200, temperature=0.7,
    )
    return resp.choices[0].message.content.strip()

def queue_for_review(email, draft_body, category):
    """Write draft to the pending/ directory for human review."""
    sender = email["from"][0]["email"] if email.get("from") else "unknown"
    subject = email.get("subject", "(no subject)")
    msg_id = email["id"]
    timestamp = datetime.now(timezone.utc).isoformat()

    review_item = {
        "original_message_id": msg_id,
        "from": sender,
        "subject": subject,
        "snippet": email.get("snippet", ""),
        "category": category,
        "draft_reply": draft_body,
        "reply_subject": f"Re: {subject}" if not subject.startswith("Re:") else subject,
        "reply_to": sender,
        "queued_at": timestamp,
        "status": "pending",
    }

    filename = f"pending/{msg_id}.json"
    with open(filename, "w") as f:
        json.dump(review_item, f, indent=2)
    return filename

Each JSON file in pending/ contains everything the reviewer needs: the original sender, subject, snippet, the LLM's classification, and the generated draft. The reviewer doesn't need to open their email client to make a decision. They read the file, approve or reject, and move on.

How do you implement the approval gate?

The approval gate is the core of this pattern. It's a review script that lists pending drafts, shows each one, and waits for the reviewer to type approve, edit, or reject. No draft sends without an explicit approval. This is the piece that separates a production-grade agent from a demo. In a 2025 survey by Anthropic's agent research team, human-in-the-loop checkpoints reduced agent error rates by 3-5x compared to fully autonomous pipelines.

#!/usr/bin/env python3
"""Review pending email drafts. Approve, edit, or reject each one."""

import json
import os
import shutil
from datetime import datetime, timezone

PENDING_DIR = "pending"
APPROVED_DIR = "approved"
REJECTED_DIR = "rejected"

def list_pending():
    """List all pending drafts sorted by queue time."""
    drafts = []
    for fname in os.listdir(PENDING_DIR):
        if not fname.endswith(".json"):
            continue
        with open(os.path.join(PENDING_DIR, fname)) as f:
            draft = json.load(f)
            draft["_filename"] = fname
            drafts.append(draft)
    return sorted(drafts, key=lambda d: d.get("queued_at", ""))

def show_draft(draft, index, total):
    """Display a single draft for review."""
    print(f"\n{'='*60}")
    print(f"  Draft {index}/{total}  [{draft['category']}]")
    print(f"{'='*60}")
    print(f"  From:    {draft['from']}")
    print(f"  Subject: {draft['subject']}")
    print(f"  Preview: {draft['snippet'][:120]}...")
    print(f"\n  --- Proposed Reply ---")
    print(f"  {draft['draft_reply']}")
    print(f"{'='*60}")

def review_all():
    """Interactive review loop."""
    drafts = list_pending()
    if not drafts:
        print("No pending drafts to review.")
        return

    print(f"\n{len(drafts)} draft(s) waiting for review.\n")

    for i, draft in enumerate(drafts, 1):
        show_draft(draft, i, len(drafts))

        while True:
            action = input("  [a]pprove / [e]dit / [r]eject / [s]kip: ").strip().lower()
            if action in ("a", "approve"):
                draft["status"] = "approved"
                draft["reviewed_at"] = datetime.now(timezone.utc).isoformat()
                draft["reviewer"] = os.environ.get("USER", "unknown")
                dest = os.path.join(APPROVED_DIR, draft["_filename"])
                with open(dest, "w") as f:
                    json.dump(draft, f, indent=2)
                os.remove(os.path.join(PENDING_DIR, draft["_filename"]))
                print("  -> Approved. Will send on next dispatch run.")
                break
            elif action in ("e", "edit"):
                print("  Current draft:")
                print(f"  {draft['draft_reply']}")
                new_text = input("  New reply text: ").strip()
                if new_text:
                    draft["draft_reply"] = new_text
                    draft["edited"] = True
                print("  Draft updated. Re-review:")
                show_draft(draft, i, len(drafts))
            elif action in ("r", "reject"):
                draft["status"] = "rejected"
                draft["reviewed_at"] = datetime.now(timezone.utc).isoformat()
                draft["reviewer"] = os.environ.get("USER", "unknown")
                dest = os.path.join(REJECTED_DIR, draft["_filename"])
                with open(dest, "w") as f:
                    json.dump(draft, f, indent=2)
                os.remove(os.path.join(PENDING_DIR, draft["_filename"]))
                print("  -> Rejected. Draft discarded.")
                break
            elif action in ("s", "skip"):
                print("  -> Skipped. Will remain in pending queue.")
                break
            else:
                print("  Invalid input. Use a/e/r/s.")

if __name__ == "__main__":
    review_all()

The reviewer sees the original email context alongside the generated draft, decides in seconds, and the script records who approved or rejected it and when. The edit option lets the reviewer fix the draft before approving, which handles the 15-25% of drafts that contain factual errors. Rejected drafts aren't deleted. They move to rejected/ with full metadata for post-incident review.

For teams, you can replace the interactive terminal prompt with a Slack notification or a web UI. The file-based queue makes this straightforward: any system that can read JSON from pending/ and move it to approved/ works as a review frontend. The agent doesn't care how the approval happens, only that the file ends up in the right directory.

How do you send approved replies?

A dispatch script scans the approved/ directory and sends each reply via the CLI. Threading works through the original_message_id stored in the JSON file. The average dispatch run processes 10 approved drafts in under 8 seconds. After sending, each file moves to a sent/ archive with the send timestamp appended:

#!/usr/bin/env python3
"""Send all approved drafts via Nylas CLI."""

import json
import os
import subprocess
from datetime import datetime, timezone

APPROVED_DIR = "approved"
SENT_DIR = "sent"
os.makedirs(SENT_DIR, exist_ok=True)

def send_approved():
    """Send every approved draft."""
    sent_count = 0
    for fname in os.listdir(APPROVED_DIR):
        if not fname.endswith(".json"):
            continue
        filepath = os.path.join(APPROVED_DIR, fname)
        with open(filepath) as f:
            draft = json.load(f)

        # Send via Nylas CLI
        result = subprocess.run(
            ["nylas", "email", "send",
             "--to", draft["reply_to"],
             "--subject", draft["reply_subject"],
             "--body", draft["draft_reply"]],
            capture_output=True, text=True
        )

        if result.returncode == 0:
            draft["sent_at"] = datetime.now(timezone.utc).isoformat()
            draft["status"] = "sent"
            dest = os.path.join(SENT_DIR, fname)
            with open(dest, "w") as f:
                json.dump(draft, f, indent=2)
            os.remove(filepath)
            sent_count += 1
            print(f"  Sent reply to {draft['reply_to']}: {draft['reply_subject']}")
        else:
            print(f"  FAILED: {draft['reply_to']} — {result.stderr}")

    print(f"\nDispatched {sent_count} approved replies.")

if __name__ == "__main__":
    send_approved()

The separation between approval and dispatch means you can run the dispatch script on a cron (every 5 minutes, for example) and do reviews in batches whenever you have time. Approved drafts accumulate until the next dispatch cycle picks them up.

How do you log decisions for audit?

Every approve, reject, and send action should produce a structured log entry. Compliance teams need to answer one question: who sent what, when, and why. The JSON files in approved/, rejected/, and sent/ already contain this data, but a consolidated audit log makes queries faster. Each entry records the message ID, reviewer, timestamp, and action taken without storing the full email body:

import csv
from datetime import datetime, timezone

AUDIT_LOG = "audit_log.csv"

def log_action(message_id, action, reviewer, subject, recipient=""):
    """Append a structured audit entry."""
    with open(AUDIT_LOG, "a", newline="") as f:
        writer = csv.writer(f)
        writer.writerow([
            datetime.now(timezone.utc).isoformat(),
            message_id,
            action,       # approved, rejected, sent, failed
            reviewer,
            recipient,
            subject[:80],  # truncate to avoid logging full subjects
        ])

# Call after each review decision:
# log_action(msg_id, "approved", "qasim", "Re: Q3 budget review", "cfo@example.com")
# log_action(msg_id, "rejected", "qasim", "Re: Partnership proposal")

The CSV format works for small-scale agents. For production deployments handling more than 100 emails per day, pipe audit events to a structured logging system. See Audit AI Agent Activity for full SIEM integration patterns including export to Coralogix, Datadog, and Splunk.

How do you run the full pipeline?

The complete workflow is 3 commands run in sequence. The classification and drafting step runs on a cron. The review step runs when the human is ready. The dispatch step runs on its own cron or manually after review. In a typical workday, 2-3 review sessions of 5 minutes each clear a queue of 40-60 messages:

# Step 1: Classify and draft (run on cron every 15 minutes)
python3 classify_and_draft.py

# Step 2: Review pending drafts (run manually when ready)
python3 review.py

# Step 3: Send approved replies (run on cron every 5 minutes)
python3 dispatch.py

You can also wrap all three into a single script and gate the review step on a flag:

# Cron entry: classify and draft every 15 minutes
*/15 * * * * cd /path/to/agent && python3 classify_and_draft.py >> agent.log 2>&1

# Cron entry: dispatch approved drafts every 5 minutes
*/5 * * * * cd /path/to/agent && python3 dispatch.py >> agent.log 2>&1

The review step stays manual. That's the point. The cron handles the repetitive parts (classify, draft, send). The human handles the judgment call (approve or reject). This division means the agent runs 24/7 preparing drafts, but nothing goes out during off-hours unless a reviewer explicitly approves it.

Next steps