Guide

Import Email into Neo4j for Graph Analysis

Email is naturally a graph. Every message from Alice to Bob is a directed edge. CC adds parallel edges. Calendar meetings create hyperedges connecting multiple people. Import this data from any email provider into Neo4j or NetworkX and query it: shortest introduction paths, bridge connectors, communication clusters, PageRank influence.

Written by Aaron de Mello Senior Engineering Manager

Reviewed by Hazik

VerifiedCLI 3.1.1 · Gmail, Outlook · last tested April 11, 2026

Why email is a graph problem

Email metadata forms a natural directed graph where each message is an edge and each address is a node. Relational databases struggle with path-traversal queries on this structure, while graph databases like Neo4j answer shortest-path and community-detection queries in milliseconds even on graphs with millions of edges.

Every email creates a directed edge. When Alice emails Bob, that’s Alice → Bob. CC adds parallel edges. Reply threads create cycles. Thread IDs group related edges into conversation subgraphs. Calendar meetings are hyperedges connecting multiple people simultaneously. According to the Radicati Group, the average office worker sends and receives 121 emails per day — that’s roughly 31,000 edges per person per year, enough to reveal organizational structure through graph analysis alone.

Graph queries are natural here. Shortest path between two contacts tells you who can introduce you. Betweenness centrality finds bridge connectors. Community detection reveals teams and project groups. A SQL JOIN on a flat contacts table can’t answer “who is 3 hops from me?” without recursive CTEs that scale poorly beyond 2 levels.

Graph schema design

The graph schema models email communication as 2 node types (Person, Company) and 4 edge types (EMAILED, CC_ON, WORKS_AT, MET_WITH). This structure keeps queries simple while supporting weighted edges — the weight property on EMAILED edges tracks message frequency, which is the strongest signal for relationship strength.

Separating Person and Company nodes lets you aggregate communication volume at the organizational level. A 500-person company typically produces 4,000–6,000 unique Person nodes and 200–400 Company domain nodes from a 90-day email export. Each node and edge type carries specific properties for downstream queries.

Nodes:
  Person  { email, name, domain }
  Company { domain }

Edges:
  EMAILED    Person -> Person  { date, subject, thread_id, weight }
  CC_ON      Person -> Person  { date, subject }
  WORKS_AT   Person -> Company
  MET_WITH   Person -> Person  { title, date, recurring }

Export email data as edge lists

Exporting email as edge lists means extracting the sender, recipients, CC list, date, subject, and thread ID from each message into a flat JSON structure that any graph tool can ingest. Nylas CLI normalizes this across Gmail, Outlook, and Yahoo into a single JSON schema, so the export step is identical regardless of provider.

The --limit 1000 flag caps the export at 1,000 messages, which typically produces 2,000–5,000 edges depending on how many recipients each message has. The jq pipeline extracts only graph-relevant fields, reducing a 4 MB raw export to roughly 400 KB of edge data.

# Export structured edge data
nylas email list --json --limit 1000 > emails.json

# Extract graph-relevant fields
cat emails.json | jq '[.[] | {
  id: .id,
  from: .from[0].email,
  from_name: .from[0].name,
  to: [.to[].email],
  cc: [(.cc // [])[].email],
  date: .date,
  subject: .subject,
  thread_id: .thread_id
}]' > edges.json

NetworkX: in-memory analysis

NetworkX is a pure-Python graph library that runs entirely in memory with no external database. It includes over 500 built-in algorithms for centrality, PageRank, shortest paths, and community detection. For email graphs under 10,000 nodes, NetworkX processes queries in under 1 second on a standard laptop.

The script calls nylas email list --json --limit 500 via subprocess, parses each message into sender-recipient edges, and increments the weight property on duplicate edges to track frequency. Degree centrality identifies the most-connected contacts, betweenness centrality finds bridge connectors between groups, and PageRank surfaces influential senders — people who receive email from other high-volume senders.

#!/usr/bin/env python3
"""Import email into NetworkX for graph analysis."""
import json, subprocess
import networkx as nx

result = subprocess.run(
    ["nylas", "email", "list", "--json", "--limit", "500"],
    capture_output=True, text=True, check=True
)
emails = json.loads(result.stdout)

G = nx.DiGraph()
for msg in emails:
    sender = msg["from"][0]["email"]
    G.add_node(sender, name=msg["from"][0].get("name", ""),
               domain=sender.split("@")[1])
    for r in msg.get("to", []):
        to_email = r["email"]
        G.add_node(to_email, name=r.get("name", ""),
                   domain=to_email.split("@")[1])
        if G.has_edge(sender, to_email):
            G[sender][to_email]["weight"] += 1
        else:
            G.add_edge(sender, to_email, weight=1)
    for cc in msg.get("cc", []):
        cc_email = cc["email"]
        G.add_node(cc_email, domain=cc_email.split("@")[1])
        if G.has_edge(sender, cc_email):
            G[sender][cc_email]["weight"] += 1
        else:
            G.add_edge(sender, cc_email, weight=1, cc=True)

print(f"Graph: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")

# Degree centrality
for email, score in sorted(nx.degree_centrality(G).items(),
                           key=lambda x: -x[1])[:10]:
    print(f"  {email}: {score:.3f}")

# Betweenness — bridge connectors
print("\nBridge connectors:")
for email, score in sorted(nx.betweenness_centrality(G).items(),
                           key=lambda x: -x[1])[:5]:
    print(f"  {email}: {score:.3f}")

# PageRank
print("\nPageRank (influence):")
for email, score in sorted(nx.pagerank(G).items(),
                           key=lambda x: -x[1])[:10]:
    print(f"  {email}: {score:.4f}")

Detect communication communities

Community detection groups contacts into clusters that communicate more with each other than with outsiders. The Louvain method, published by Blondel et al. in 2008, optimizes modularity in O(n log n) time and handles graphs with millions of nodes. On a typical 3,000-node email graph, Louvain identifies 8–15 communities in under 200 milliseconds.

NetworkX includes Louvain via louvain_communities(). The function takes an undirected graph (converted from the directed email graph with G.to_undirected()), so edges in both directions between two people collapse into a single community signal. Sorting communities by size surfaces the largest groups first, and aggregating domains within each community shows whether a cluster aligns with a company, a project, or a cross-organizational team.

from networkx.algorithms.community import louvain_communities
from collections import Counter

communities = louvain_communities(G.to_undirected())
print(f"Found {len(communities)} communities")

for i, comm in enumerate(sorted(communities, key=len, reverse=True)[:5]):
    domains = Counter(G.nodes[n].get("domain", "") for n in comm)
    print(f"  Community {i+1} ({len(comm)} people):")
    print(f"    Top domains: {dict(domains.most_common(3))}")
    print(f"    Sample: {sorted(comm)[:3]}")

Neo4j: persistent graph database

Neo4j is a native graph database that stores data as nodes and edges on disk, supports the Cypher query language, and includes a browser-based visualization interface at port 7474. Unlike NetworkX, Neo4j handles graphs with tens of millions of nodes and persists data across restarts. According to DB-Engines, Neo4j has ranked as the most popular graph database every month since 2013.

The Docker command maps two ports: 7474 for the Neo4j Browser UI and 7687 for the Bolt protocol that Cypher clients connect to. The NEO4J_AUTH variable sets the initial username and password. A clean Neo4j container uses roughly 500 MB of RAM at idle and scales to handle 1–2 GB for a 100,000-node email graph.

docker run -d --name neo4j \
  -p 7474:7474 -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/changeme123 \
  neo4j:latest

Cypher import statements

Cypher’s LOAD CSV command imports CSV files into Neo4j by matching each row to a MERGE or CREATE operation. MERGE is idempotent — it creates a node only if one with the same key doesn’t already exist, so re-running the import on updated exports won’t duplicate data. Neo4j’s Cypher planner processes LOAD CSV at roughly 10,000–50,000 rows per second depending on index coverage.

The import script creates Person nodes from a contacts CSV, derives Company nodes from the domain property, links each person to their company with a WORKS_AT edge, and then loads email edges from a separate CSV. The two indexes on Person.email and Company.domain speed up MATCH lookups from O(n) to O(log n), which matters when the email CSV contains tens of thousands of rows.

// Create person nodes
LOAD CSV WITH HEADERS FROM 'file:///contacts.csv' AS row
MERGE (p:Person {email: row.email})
SET p.name = row.name, p.domain = row.domain;

// Company nodes from domains
MATCH (p:Person) MERGE (c:Company {domain: p.domain})
MERGE (p)-[:WORKS_AT]->(c);

// Email edges
LOAD CSV WITH HEADERS FROM 'file:///emails.csv' AS row
MATCH (s:Person {email: row.from_email})
MATCH (r:Person {email: row.to_email})
CREATE (s)-[:EMAILED {date: row.date, subject: row.subject}]->(r);

CREATE INDEX FOR (p:Person) ON (p.email);
CREATE INDEX FOR (c:Company) ON (c.domain);

Useful Cypher queries

Once email data is loaded into Neo4j, Cypher queries answer relationship questions that would require recursive CTEs or application-level code in a relational database. Neo4j’s native graph storage traverses edges in constant time per hop, so a 5-hop shortest-path query on a 50,000-node graph typically completes in under 10 milliseconds.

The queries cover five common use cases: finding the shortest introduction path between two people, ranking the most connected contacts by edge count, measuring company-to-company email volume, identifying bridge connectors who link different organizations, and listing senders who haven’t received a reply. Each query uses only standard Cypher syntax compatible with Neo4j 4.x and 5.x.

// Shortest introduction path
MATCH path = shortestPath(
  (a:Person {email: 'you@company.com'})-[:EMAILED*]-(b:Person {email: 'target@acme.com'})
) RETURN path;

// Most connected contacts
MATCH (p:Person)
WITH p, size([(p)-[:EMAILED]-() | 1]) AS connections
RETURN p.email, p.name, connections ORDER BY connections DESC LIMIT 10;

// Company-to-company volume
MATCH (a:Person)-[e:EMAILED]->(b:Person)
WHERE a.domain <> b.domain
WITH a.domain AS from_co, b.domain AS to_co, count(e) AS volume
RETURN from_co, to_co, volume ORDER BY volume DESC LIMIT 20;

// Bridge connectors
MATCH (a:Person)-[:EMAILED]->(bridge:Person)-[:EMAILED]->(b:Person)
WHERE a.domain <> bridge.domain AND bridge.domain <> b.domain
RETURN bridge.email, count(DISTINCT a.domain) + count(DISTINCT b.domain) AS companies
ORDER BY companies DESC LIMIT 10;

// Unanswered senders
MATCH (them:Person)-[:EMAILED]->(you:Person {email: 'you@company.com'})
WHERE NOT (you)-[:EMAILED]->(them)
RETURN them.email, count(*) AS unanswered ORDER BY unanswered DESC LIMIT 10;

Add calendar MET_WITH edges

Calendar meetings add a second edge type to the graph that captures face-to-face and video relationships distinct from email-only communication. A meeting with 5 attendees produces 10 pairwise MET_WITH edges (the combination C(5,2)), so 200 meetings typically generate 1,000–3,000 meeting edges depending on average attendee count.

The Nylas CLI exports calendar events with nylas calendar events list --json, which returns organizer, participants, title, timing, and recurrence across Google Calendar, Outlook, and iCloud. The jq pipeline extracts the fields needed for graph import, and the recurring flag lets downstream queries weight standing meetings differently from one-off events.

nylas calendar events list --json --limit 200 | jq '[.[] | {
  organizer: .organizer.email,
  attendees: [.participants[].email],
  title: .title,
  date: .when.start_time,
  recurring: (.recurrence != null)
}]' > meetings.json

The Cypher import query uses a double UNWIND to generate all pairwise combinations of attendees within each meeting, filtering with a1 < a2 to avoid duplicate edges. The combined relationship score weights meetings at 3x the value of emails because, according to research from the Harvard Business Review, a 15-minute face-to-face interaction generates more trust than 20 email exchanges.

// Pairwise meeting edges
UNWIND $meetings AS mtg
UNWIND mtg.attendees AS a1
UNWIND mtg.attendees AS a2
WITH mtg, a1, a2 WHERE a1 < a2
MERGE (p1:Person {email: a1})
MERGE (p2:Person {email: a2})
CREATE (p1)-[:MET_WITH {title: mtg.title, date: mtg.date, recurring: mtg.recurring}]->(p2);

// Combined relationship score (email + meetings)
MATCH (you:Person {email: 'you@company.com'})-[e:EMAILED]-(them:Person)
WITH you, them, count(e) AS emails
OPTIONAL MATCH (you)-[m:MET_WITH]-(them)
WITH them, emails, count(m) AS meetings
RETURN them.email, emails, meetings, emails + meetings * 3 AS score
ORDER BY score DESC LIMIT 20;

Next steps

With email and calendar data loaded into a graph database, the next step is enriching nodes with external data and applying the graph to specific workflows. Relationship scoring, org chart reconstruction, and contact enrichment each build on the Person-EMAILED-Person foundation created in this guide. The resources here cover those extensions plus the full Cypher and NetworkX reference documentation.