Guide

Build an AI Email Auto-Responder

An AI auto-responder reads your inbox, picks messages that need a reply, drafts one with an LLM, and waits for you to approve before sending. This guide builds that pipeline from scratch in Python, using the CLI for mailbox access and any OpenAI-compatible model for drafting. The human-approval gate is non-negotiable — never auto-send without review.

Written by Aaron de Mello Senior Engineering Manager

VerifiedCLI 3.1.11 · Gmail, Outlook · last tested May 23, 2026

Command references used in this guide: nylas email list, nylas email read, nylas email send, and nylas email mark-read.

What is an AI email auto-responder?

An AI email auto-responder is a pipeline that monitors your inbox, identifies messages requiring a reply, generates context-aware drafts with a large language model, and queues them for human review before sending. Unlike Gmail's built-in “Smart Reply” (limited to 3 canned suggestions under 10 words each), an LLM-based responder generates full-paragraph replies that reference the sender's specific questions. The key constraint: nothing sends without explicit approval.

LLMs can produce plausible-sounding but incorrect facts — a well-documented limitation called “hallucination” that makes human review essential for outbound email. The approval gate catches invented dates, wrong names, and misquoted details before they reach the recipient. This guide builds the gate as a first-class component, not an afterthought.

How do you fetch and classify incoming email?

The first stage pulls unread messages and sorts them into three buckets: REPLY (needs a response), ACKNOWLEDGE (a brief “got it” suffices), and SKIP (newsletters, noreply senders, automated notifications). Classification runs on the sender, subject, and snippet — no need to fetch the full body yet. Processing 20 messages costs about $0.002 with gpt-4o-mini at current OpenAI API pricing of 0.15 cents per 1K input tokens.

The Python script below calls the CLI via subprocess to fetch unread mail, then sends each message's metadata to the LLM for classification. Setting temperature=0 makes the classifier deterministic — the same email always gets the same label, which simplifies debugging when a message lands in the wrong bucket.

import subprocess
import json
from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from env

CLASSIFY_PROMPT = """Classify this email into exactly one category.
Return ONLY the category name, nothing else.

Categories:
- REPLY: sender expects a substantive response (questions, requests, follow-ups)
- ACKNOWLEDGE: a brief "received, thanks" is enough (FYIs, status updates)
- SKIP: no reply needed (newsletters, noreply@, automated alerts)

Email:
From: {sender}
Subject: {subject}
Preview: {snippet}
"""

def fetch_unread(limit=20):
    """Fetch unread emails via the CLI."""
    result = subprocess.run(
        ["nylas", "email", "list", "--unread", "--limit", str(limit), "--json"],
        capture_output=True, text=True
    )
    if result.returncode != 0:
        print(f"Fetch failed: {result.stderr}")
        return []
    return json.loads(result.stdout)

def classify(email):
    """Classify a single email. Returns REPLY, ACKNOWLEDGE, or SKIP."""
    sender = email["from"][0]["email"] if email.get("from") else "unknown"
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": CLASSIFY_PROMPT.format(
            sender=sender,
            subject=email.get("subject", ""),
            snippet=email.get("snippet", ""),
        )}],
        max_tokens=10, temperature=0,
    )
    label = resp.choices[0].message.content.strip().upper()
    return label if label in ("REPLY", "ACKNOWLEDGE", "SKIP") else "SKIP"

Messages classified as SKIP get marked read automatically via nylas email mark-read so they don't appear in the next fetch cycle. REPLY and ACKNOWLEDGE messages move to the drafting stage.

How do you generate replies with an LLM?

Drafting reads the full message body (not just the snippet) and generates a reply that references the sender's specific points. The full body fetch uses nylas email read with the message ID. The OpenAI text generation guide covers the chat completions API used here. Generating one draft takes about 1.5 seconds with gpt-4o at roughly $0.01 per message — a batch of 10 costs under $0.10. The draft prompt constrains the model to 3 sentences for ACKNOWLEDGE messages and 5-8 sentences for REPLY messages.

Each generated draft gets saved to a pending/ directory as a JSON file rather than sent immediately. The file includes the original message ID, sender, subject, the LLM's classification, and the proposed reply text. This separation between drafting and sending is what makes the pipeline safe for production use.

import os
from datetime import datetime, timezone

DRAFT_PROMPT = """Write a reply to this email.
- For REPLY: address each question or request (5-8 sentences max)
- For ACKNOWLEDGE: brief confirmation (2-3 sentences max)
- Match the sender's formality level
- Never invent facts not present in the original

Category: {category}
From: {sender}
Subject: {subject}
Body: {body}

Reply:"""

def fetch_body(message_id):
    """Fetch the full message body."""
    result = subprocess.run(
        ["nylas", "email", "read", message_id, "--json"],
        capture_output=True, text=True
    )
    if result.returncode != 0:
        return ""
    msg = json.loads(result.stdout)
    return msg.get("body", msg.get("snippet", ""))[:3000]

def draft_reply(email, category):
    """Generate a reply and save to the review queue."""
    sender = email["from"][0]["email"] if email.get("from") else "unknown"
    body = fetch_body(email["id"])

    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": DRAFT_PROMPT.format(
            category=category,
            sender=sender,
            subject=email.get("subject", ""),
            body=body,
        )}],
        max_tokens=300, temperature=0.7,
    )
    draft_text = resp.choices[0].message.content.strip()

    # Save to review queue
    review_item = {
        "message_id": email["id"],
        "to": sender,
        "subject": f"Re: {email.get('subject', '')}".replace("Re: Re:", "Re:"),
        "category": category,
        "draft": draft_text,
        "queued_at": datetime.now(timezone.utc).isoformat(),
        "status": "pending",
    }
    os.makedirs("pending", exist_ok=True)
    filepath = f"pending/{email['id']}.json"
    with open(filepath, "w") as f:
        json.dump(review_item, f, indent=2)
    return filepath

The [:3000] truncation on the body prevents long threads from blowing past the model's context window and keeps per-message costs predictable. For threads longer than 3,000 characters, the model sees the most recent message and the start of the thread, which is usually enough context for a reply.

How do you add a human approval gate?

The approval gate is a review script that reads JSON files from pending/, displays each draft alongside the original message context, and waits for the reviewer to approve, edit, or reject. Anthropic's research on building effective agents recommends human-in-the-loop checkpoints for high-stakes actions like sending email, where mistakes are irreversible. The approval step takes 5-10 seconds per draft.

Approved drafts are sent via the CLI. Rejected drafts move to a rejected/ directory with the reviewer's name and timestamp for audit. The original message gets marked as read after the draft is sent so it won't appear in the next fetch cycle.

#!/usr/bin/env python3
"""Review and send approved auto-responses."""

import json
import os
import subprocess
from datetime import datetime, timezone

PENDING = "pending"
REJECTED = "rejected"

def review_and_send():
    os.makedirs(REJECTED, exist_ok=True)
    files = sorted(f for f in os.listdir(PENDING) if f.endswith(".json"))

    if not files:
        print("No pending drafts.")
        return

    print(f"\n{len(files)} draft(s) to review.\n")

    for fname in files:
        path = os.path.join(PENDING, fname)
        with open(path) as f:
            draft = json.load(f)

        print(f"{'='*55}")
        print(f"  To:      {draft['to']}")
        print(f"  Subject: {draft['subject']}")
        print(f"  Type:    {draft['category']}")
        print(f"\n  --- Draft Reply ---")
        print(f"  {draft['draft']}")
        print(f"{'='*55}")

        action = input("  [a]pprove / [r]eject / [s]kip: ").strip().lower()

        if action in ("a", "approve"):
            result = subprocess.run(
                ["nylas", "email", "send",
                 "--to", draft["to"],
                 "--subject", draft["subject"],
                 "--body", draft["draft"],
                 "--yes"],
                capture_output=True, text=True
            )
            if result.returncode == 0:
                # Mark original as read
                subprocess.run(
                    ["nylas", "email", "mark-read", "--id", draft["message_id"]],
                    capture_output=True, text=True
                )
                os.remove(path)
                print("  -> Sent and marked read.")
            else:
                print(f"  -> Send failed: {result.stderr}")
        elif action in ("r", "reject"):
            draft["status"] = "rejected"
            draft["reviewed_at"] = datetime.now(timezone.utc).isoformat()
            with open(os.path.join(REJECTED, fname), "w") as f:
                json.dump(draft, f, indent=2)
            os.remove(path)
            print("  -> Rejected.")
        else:
            print("  -> Skipped.")

if __name__ == "__main__":
    review_and_send()

For teams handling more than 50 messages per day, replace the terminal prompt with a Slack notification or web dashboard. The file-based queue makes integration straightforward: any system that can read JSON from pending/ and call the CLI works as a review frontend.

How do you handle edge cases and safety?

Three failure modes need explicit handling. First, the LLM sometimes generates replies that reference attachments or meeting links mentioned in the original email — but it fabricates the URLs. The fix: strip any URLs from the draft that don't appear in the original message body. Second, thread loops occur when two auto-responders reply to each other. Gmail rate-limits outbound at 500 messages per day for consumer accounts and 2,000 for Workspace. The fix: skip messages from known auto-responder addresses (noreply@, mailer-daemon@).

Third, confidentiality leaks happen when the model references content from a different email thread that was processed in the same batch. The fix: process each message in a separate LLM call with no shared context. The script above already does this — each draft_reply() call creates its own prompt with only that message's content. Never batch multiple emails into one LLM request.

# Sender blocklist to prevent reply loops
SKIP_SENDERS = {
    "noreply@", "no-reply@", "mailer-daemon@",
    "notifications@", "donotreply@", "bounce@",
}

def should_skip(sender: str) -> bool:
    """Check if the sender is an automated address."""
    return any(sender.lower().startswith(prefix) for prefix in SKIP_SENDERS)

# URL validation to catch fabricated links
import re

def strip_fabricated_urls(draft: str, original_body: str) -> str:
    """Remove URLs from the draft that don't appear in the original."""
    original_urls = set(re.findall(r'https?://\S+', original_body))
    def replace_url(match):
        url = match.group(0)
        return url if url in original_urls else "[link removed]"
    return re.sub(r'https?://\S+', replace_url, draft)

# Schedule: run classify + draft every 15 minutes via cron
# */15 * * * * cd /path/to/responder && python3 auto_respond.py >> auto.log 2>&1

Log every draft and send action with the message ID, classification, and reviewer. Don't log the full email body — message IDs are enough for audit. The Audit AI Agent Activity guide covers structured logging patterns for agent workflows.

Next steps