Guide

Model Email as a Graph: Neo4j, Cypher, and Network Analysis

Email is naturally a graph. Every message from Alice to Bob is a directed edge. CC adds parallel edges. Calendar meetings create hyperedges connecting multiple people. Import this data from Gmail, Outlook, Exchange, Yahoo, iCloud, and IMAP into Neo4j or NetworkX and query it: shortest introduction paths, bridge connectors, communication clusters, PageRank influence.

By Aaron de Mello

Why email is a graph problem

Every email creates a directed edge. When Alice emails Bob, that’s Alice → Bob. CC adds parallel edges. Reply threads create cycles. Thread IDs group related edges into conversation subgraphs. Calendar meetings are hyperedges connecting multiple people simultaneously.

Graph queries are natural here. Shortest path between two contacts tells you who can introduce you. Betweenness centrality finds bridge connectors. Community detection reveals teams and project groups. Try doing any of that with SQL JOINs on a flat contacts table.

Graph schema design

Nodes:
  Person  { email, name, domain }
  Company { domain }

Edges:
  EMAILED    Person -> Person  { date, subject, thread_id, weight }
  CC_ON      Person -> Person  { date, subject }
  WORKS_AT   Person -> Company
  MET_WITH   Person -> Person  { title, date, recurring }

Export email data as edge lists

# Export structured edge data
nylas email list --json --limit 1000 > emails.json

# Extract graph-relevant fields
cat emails.json | jq '[.[] | {
  id: .id,
  from: .from[0].email,
  from_name: .from[0].name,
  to: [.to[].email],
  cc: [(.cc // [])[].email],
  date: .date,
  subject: .subject,
  thread_id: .thread_id
}]' > edges.json

NetworkX: in-memory analysis

NetworkX ships with pip, needs no database, and includes algorithms for centrality, PageRank, and community detection. Good for datasets under 10,000 nodes.

#!/usr/bin/env python3
"""Import email into NetworkX for graph analysis."""
import json, subprocess
import networkx as nx

result = subprocess.run(
    ["nylas", "email", "list", "--json", "--limit", "500"],
    capture_output=True, text=True, check=True
)
emails = json.loads(result.stdout)

G = nx.DiGraph()
for msg in emails:
    sender = msg["from"][0]["email"]
    G.add_node(sender, name=msg["from"][0].get("name", ""),
               domain=sender.split("@")[1])
    for r in msg.get("to", []):
        to_email = r["email"]
        G.add_node(to_email, name=r.get("name", ""),
                   domain=to_email.split("@")[1])
        if G.has_edge(sender, to_email):
            G[sender][to_email]["weight"] += 1
        else:
            G.add_edge(sender, to_email, weight=1)
    for cc in msg.get("cc", []):
        cc_email = cc["email"]
        G.add_node(cc_email, domain=cc_email.split("@")[1])
        if G.has_edge(sender, cc_email):
            G[sender][cc_email]["weight"] += 1
        else:
            G.add_edge(sender, cc_email, weight=1, cc=True)

print(f"Graph: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")

# Degree centrality
for email, score in sorted(nx.degree_centrality(G).items(),
                           key=lambda x: -x[1])[:10]:
    print(f"  {email}: {score:.3f}")

# Betweenness — bridge connectors
print("\nBridge connectors:")
for email, score in sorted(nx.betweenness_centrality(G).items(),
                           key=lambda x: -x[1])[:5]:
    print(f"  {email}: {score:.3f}")

# PageRank
print("\nPageRank (influence):")
for email, score in sorted(nx.pagerank(G).items(),
                           key=lambda x: -x[1])[:10]:
    print(f"  {email}: {score:.4f}")

Detect communication communities

from networkx.algorithms.community import louvain_communities
from collections import Counter

communities = louvain_communities(G.to_undirected())
print(f"Found {len(communities)} communities")

for i, comm in enumerate(sorted(communities, key=len, reverse=True)[:5]):
    domains = Counter(G.nodes[n].get("domain", "") for n in comm)
    print(f"  Community {i+1} ({len(comm)} people):")
    print(f"    Top domains: {dict(domains.most_common(3))}")
    print(f"    Sample: {sorted(comm)[:3]}")

Neo4j: persistent graph database

For datasets over 10,000 nodes or persistent storage with a visual query interface, use Neo4j:

docker run -d --name neo4j \
  -p 7474:7474 -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/changeme123 \
  neo4j:latest

Cypher import statements

// Create person nodes
LOAD CSV WITH HEADERS FROM 'file:///contacts.csv' AS row
MERGE (p:Person {email: row.email})
SET p.name = row.name, p.domain = row.domain;

// Company nodes from domains
MATCH (p:Person) MERGE (c:Company {domain: p.domain})
MERGE (p)-[:WORKS_AT]->(c);

// Email edges
LOAD CSV WITH HEADERS FROM 'file:///emails.csv' AS row
MATCH (s:Person {email: row.from_email})
MATCH (r:Person {email: row.to_email})
CREATE (s)-[:EMAILED {date: row.date, subject: row.subject}]->(r);

CREATE INDEX FOR (p:Person) ON (p.email);
CREATE INDEX FOR (c:Company) ON (c.domain);

Useful Cypher queries

// Shortest introduction path
MATCH path = shortestPath(
  (a:Person {email: 'you@company.com'})-[:EMAILED*]-(b:Person {email: 'target@acme.com'})
) RETURN path;

// Most connected contacts
MATCH (p:Person)
WITH p, size([(p)-[:EMAILED]-() | 1]) AS connections
RETURN p.email, p.name, connections ORDER BY connections DESC LIMIT 10;

// Company-to-company volume
MATCH (a:Person)-[e:EMAILED]->(b:Person)
WHERE a.domain <> b.domain
WITH a.domain AS from_co, b.domain AS to_co, count(e) AS volume
RETURN from_co, to_co, volume ORDER BY volume DESC LIMIT 20;

// Bridge connectors
MATCH (a:Person)-[:EMAILED]->(bridge:Person)-[:EMAILED]->(b:Person)
WHERE a.domain <> bridge.domain AND bridge.domain <> b.domain
RETURN bridge.email, count(DISTINCT a.domain) + count(DISTINCT b.domain) AS companies
ORDER BY companies DESC LIMIT 10;

// Unanswered senders
MATCH (them:Person)-[:EMAILED]->(you:Person {email: 'you@company.com'})
WHERE NOT (you)-[:EMAILED]->(them)
RETURN them.email, count(*) AS unanswered ORDER BY unanswered DESC LIMIT 10;

Add calendar MET_WITH edges

nylas calendar events list --json --limit 200 | jq '[.[] | {
  organizer: .organizer.email,
  attendees: [.participants[].email],
  title: .title,
  date: .when.start_time,
  recurring: (.recurrence != null)
}]' > meetings.json
// Pairwise meeting edges
UNWIND $meetings AS mtg
UNWIND mtg.attendees AS a1
UNWIND mtg.attendees AS a2
WITH mtg, a1, a2 WHERE a1 < a2
MERGE (p1:Person {email: a1})
MERGE (p2:Person {email: a2})
CREATE (p1)-[:MET_WITH {title: mtg.title, date: mtg.date, recurring: mtg.recurring}]->(p2);

// Combined relationship score (email + meetings)
MATCH (you:Person {email: 'you@company.com'})-[e:EMAILED]-(them:Person)
WITH you, them, count(e) AS emails
OPTIONAL MATCH (you)-[m:MET_WITH]-(them)
WITH them, emails, count(m) AS meetings
RETURN them.email, emails, meetings, emails + meetings * 3 AS score
ORDER BY score DESC LIMIT 20;

Next steps