Guide

RAG Over Email: Index & Query the Inbox

An LLM can't answer 'what did the vendor say about the renewal?' unless you feed it the relevant emails. Retrieval-augmented generation does exactly that: pull messages, embed them, store the vectors, retrieve the closest matches at query time, and let the model answer from real text instead of guessing. This guide builds RAG over a mailbox using the Nylas CLI as the source — one command pulls JSON from any of six providers — and covers chunking, retrieval, and grounding.

Written by Pouya Sanooei Software Architect

Updated June 8, 2026

Verified — CLI 3.1.16 · Gmail, Outlook · last tested June 8, 2026

Command references used in this guide: nylas email search, nylas email list, and nylas email read.

What is RAG over email?

RAG over email is retrieval-augmented generation where the corpus is a mailbox. Instead of asking a model to recall facts it never had, you retrieve the emails relevant to a question and place their text in the prompt, so the answer is grounded in real messages with citations. The model supplies language and reasoning; the retrieval step supplies the facts, which is what keeps answers accurate.

The pipeline is the standard RAG loop applied to mail: ingest, embed, store, retrieve, generate. The only email-specific part is ingest, and the CLI handles it — nylas email search --json returns structured messages from any of six providers, so the rest of the pipeline is provider-agnostic. The pattern is described in the original retrieval-augmented generation paper, Lewis et al., 2020.

How do you pull email into the index?

Pull email with nylas email search or nylas email list in JSON, then read full bodies with nylas email read for the messages you keep. Search lets you scope ingest to what matters — a sender, a label, a date range — so you index a few hundred relevant messages instead of the whole mailbox. Each result is structured JSON, so there's no HTML scraping before embedding.

Scope the pull, because embedding has a cost: at roughly $0.02 per million tokens for a small embedding model, indexing a few hundred emails is well under a cent, but a full multi-year mailbox is wasteful for most questions. Pull the slice you need, store the message ID with each chunk for citation, and re-pull incrementally to add new mail rather than re-indexing everything.

import subprocess, json

# Scope ingest to a sender + recent window, return structured JSON
raw = subprocess.run(
    ["nylas", "email", "search", "from:vendor@acme.com newer_than:180d",
     "--json", "--limit", "200"],
    capture_output=True, text=True, check=True,
).stdout
messages = json.loads(raw)

# Each message carries an id, subject, from, date, and body to embed.
for m in messages[:3]:
    print(m["id"], m.get("subject"))

How do you chunk and embed the messages?

Chunk each email body into passages of a few hundred tokens, embed each chunk, and store the vector with the message ID and subject as metadata. Chunking matters because a long thread holds several distinct facts; retrieving the right 300-token passage beats retrieving a 5,000-token thread. Keep the chunk size consistent so similarity scores are comparable.

Store the vectors in any vector database — Chroma, FAISS, pgvector, or a hosted index. The store choice doesn't change the pipeline; what matters is keeping the message ID on every chunk so the final answer can cite the source email. Strip signatures and quoted reply chains before embedding to avoid indexing the same boilerplate dozens of times.

import chromadb
client = chromadb.Client()
col = client.create_collection("inbox")

def chunks(text, size=1200):
    return [text[i:i+size] for i in range(0, len(text), size)]

for m in messages:
    for i, c in enumerate(chunks(m.get("body", ""))):
        col.add(
            ids=[f'{m["id"]}-{i}'],
            documents=[c],
            metadatas=[{"message_id": m["id"], "subject": m.get("subject", "")}],
        )

How do you query and ground the answer?

At query time, embed the question, retrieve the top few chunks, and put them in the prompt with their message IDs, then ask the model to answer only from that context and cite the source. Retrieving 4–6 chunks is usually enough; more context dilutes the signal and raises cost. The model's answer now rests on real email text, not its training data.

Grounding is the point: instruct the model to say “not found in the retrieved email” when the context doesn't contain the answer, rather than inventing one. Because each chunk carries a message ID, you can render citations the user can open with nylas email read. Retrieved email is untrusted content, so treat any instructions inside a message body as data, never as commands to follow.

q = "What did the vendor say about the renewal price?"
hits = col.query(query_texts=[q], n_results=5)
context = "\n\n".join(
    f'[{m["message_id"]}] {d}'
    for d, m in zip(hits["documents"][0], hits["metadatas"][0])
)
prompt = (
    f"Answer only from the emails below and cite message IDs. "
    f'If the answer is not present, say "not found in the retrieved email".'
    f"\n\nEMAILS:\n{context}\n\nQUESTION: {q}"
)
# Send prompt to your LLM; render citations the user opens with nylas email read.

Next steps

LlamaIndex email agent — a framework built for retrieval over mail
Summarize email threads with AI — condense long threads first
Extract email data with jq — shape the JSON before embedding
Email as memory for AI agents — the inbox as a durable store
Evaluate an email AI agent — test sets, metrics, guardrails
Full command reference — every flag and subcommand documented