Guide

Import Email into a Graph Database

Email is naturally a graph — people are nodes, messages are edges. Every email from Alice to Bob creates a directed edge. CC adds more edges. Calendar meetings create hyperedges connecting multiple people simultaneously. A graph database lets you query these relationships directly: shortest path between two contacts, most influential connector, communication clusters.

Why email is a graph problem

Every email you send creates a directed edge from sender to recipient. When Alice emails Bob, that is a directed edge Alice → Bob. CC adds parallel edges to the same message — if Alice emails Bob and CCs Carol, you get two edges from the same source event. Reply threads create cycles: Alice → Bob → Alice. Thread IDs group related edges into conversation subgraphs.

Calendar meetings are hyperedges — a single event connects multiple people simultaneously. A recurring weekly standup with five attendees creates a dense cluster that no individual email could represent. When you combine email edges with calendar hyperedges, you get a rich communication graph that captures both asynchronous and synchronous interaction.

Graph queries are natural here. Shortest path between two contacts tells you who can introduce you. Betweenness centrality finds bridge connectors — people who link otherwise disconnected groups. Community detection reveals teams and clusters. PageRank identifies who receives the most attention. Try doing any of that with a SQL JOIN across a flat contacts table. You could write the query, but it would be slow, brittle, and unreadable. Graph databases make these operations first-class.

Export email data as JSON

Start by exporting your recent emails as JSON. The --json flag gives you structured data with sender, recipients, CC, timestamps, and thread IDs. Then use jq to extract just the fields you need for graph construction.

# Export emails as structured JSON
nylas email list --json --limit 1000 > emails.json

# Extract the graph-relevant fields
cat emails.json | jq '[.[] | {
  id: .id,
  from: .from[0].email,
  from_name: .from[0].name,
  to: [.to[].email],
  cc: [(.cc // [])[].email],
  date: .date,
  subject: .subject,
  thread_id: .thread_id
}]' > email_edges.json

The output is an array of objects, each representing one email with its sender, recipients, CC list, and metadata. The to and cc fields are arrays because a single message can have multiple recipients. Each recipient becomes a separate edge in the graph.

Define the graph schema

Before importing, plan your node and edge types. This schema supports both email and calendar data, with company grouping for organizational queries.

Nodes:
  Person  { email, name, domain, company }
  Company { name, domain }

Edges:
  EMAILED    Person → Person  { date, subject, thread_id }
  CC_ON      Person → Person  { date, subject }
  WORKS_AT   Person → Company
  MET_WITH   Person → Person  { event_title, date, recurring }

Person nodes are identified by email address. The domain field is extracted from the email (everything after the @). Company nodes are identified by domain. EMAILED and CC_ON are separate edge types so you can weight direct communication differently from CC inclusion. MET_WITH edges come from calendar data and represent synchronous interaction.

NetworkX quick start

NetworkX is a Python library for graph analysis. It ships with pip, requires no external database, and includes algorithms for centrality, community detection, shortest paths, and PageRank. This makes it ideal for quick exploration before committing to a full graph database.

#!/usr/bin/env python3
"""Import email into a NetworkX graph for analysis."""
import json
import subprocess
import networkx as nx
from collections import Counter

# Export emails via Nylas CLI
result = subprocess.run(
    ["nylas", "email", "list", "--json", "--limit", "500"],
    capture_output=True, text=True, check=True
)
emails = json.loads(result.stdout)

# Build directed graph
G = nx.DiGraph()

for msg in emails:
    sender = msg["from"][0]["email"]
    sender_name = msg["from"][0].get("name", "")
    sender_domain = sender.split("@")[1]

    G.add_node(sender, name=sender_name, domain=sender_domain)

    for recipient in msg.get("to", []):
        to_email = recipient["email"]
        to_domain = to_email.split("@")[1]
        G.add_node(to_email, name=recipient.get("name", ""), domain=to_domain)

        if G.has_edge(sender, to_email):
            G[sender][to_email]["weight"] += 1
        else:
            G.add_edge(sender, to_email, weight=1)

    # Add CC edges with a different type
    for cc_recipient in msg.get("cc", []):
        cc_email = cc_recipient["email"]
        cc_domain = cc_email.split("@")[1]
        G.add_node(cc_email, name=cc_recipient.get("name", ""), domain=cc_domain)

        if G.has_edge(sender, cc_email):
            G[sender][cc_email]["weight"] += 1
            G[sender][cc_email]["cc_count"] = G[sender][cc_email].get("cc_count", 0) + 1
        else:
            G.add_edge(sender, cc_email, weight=1, cc_count=1)

print(f"Nodes: {G.number_of_nodes()}")
print(f"Edges: {G.number_of_edges()}")
print(f"Total emails processed: {len(emails)}")

# Most connected contacts (by degree centrality)
centrality = nx.degree_centrality(G)
top_contacts = sorted(centrality.items(), key=lambda x: -x[1])[:10]
print("\nMost connected contacts:")
for email, score in top_contacts:
    print(f"  {email}: {score:.3f}")

# Betweenness centrality (bridge nodes)
betweenness = nx.betweenness_centrality(G)
bridges = sorted(betweenness.items(), key=lambda x: -x[1])[:5]
print("\nBridge connectors:")
for email, score in bridges:
    print(f"  {email}: {score:.3f}")

# PageRank
pagerank = nx.pagerank(G)
top_pr = sorted(pagerank.items(), key=lambda x: -x[1])[:10]
print("\nPageRank (who receives most attention):")
for email, score in top_pr:
    print(f"  {email}: {score:.4f}")

The script builds a directed graph where edge weight represents email frequency. Degree centrality tells you who communicates with the most people. Betweenness centrality finds bridge connectors — people who sit between otherwise disconnected groups. PageRank identifies who receives the most attention, weighted by the importance of the senders. These three metrics together give you a complete picture of influence in your email network.

Detect communities

Community detection groups nodes into clusters based on connection density. The Louvain algorithm is fast and works well on email graphs. Each community typically maps to a team, project group, or external company you interact with.

# Community detection with Louvain
from networkx.algorithms.community import louvain_communities
from collections import Counter

communities = louvain_communities(G.to_undirected())
print(f"\nFound {len(communities)} communities:")
for i, community in enumerate(sorted(communities, key=len, reverse=True)[:5]):
    domains = Counter(G.nodes[n].get("domain", "") for n in community)
    top_domains = dict(domains.most_common(3))
    members = sorted(community)[:5]
    print(f"  Community {i+1} ({len(community)} people):")
    print(f"    Top domains: {top_domains}")
    print(f"    Sample members: {members}")

# Find which community a specific person belongs to
target = "partner@acme.com"
for i, community in enumerate(communities):
    if target in community:
        print(f"\n{target} is in community {i+1} with {len(community)} others")
        shared = [n for n in community if n != target][:10]
        print(f"  Shares community with: {shared}")
        break

The domain breakdown for each community tells you what it represents. A community dominated by acme.com addresses is your relationship with Acme. A community with mixed domains from your own company and an external vendor is likely a project team. Communities with mostly freemail addresses are personal contacts.

Visualize with matplotlib

A visual representation makes patterns immediately obvious. Color nodes by domain to see company clusters. Edge thickness by weight shows communication intensity. Spring layout positions densely connected nodes closer together.

import matplotlib.pyplot as plt

plt.figure(figsize=(16, 12))
pos = nx.spring_layout(G, k=0.5, iterations=50)

# Color nodes by domain
domains = list(set(nx.get_node_attributes(G, "domain").values()))
domain_colors = {d: plt.cm.tab20(i / len(domains)) for i, d in enumerate(domains)}
colors = [domain_colors.get(G.nodes[n].get("domain", ""), "gray") for n in G.nodes()]

# Scale node size by degree
sizes = [30 + 20 * G.degree(n) for n in G.nodes()]

# Scale edge width by weight
weights = [G[u][v].get("weight", 1) * 0.3 for u, v in G.edges()]

nx.draw(
    G, pos,
    node_color=colors,
    node_size=sizes,
    edge_color="gray",
    alpha=0.7,
    arrows=True,
    arrowsize=5,
    width=weights,
)

# Label high-degree nodes
high_degree = {n: n.split("@")[0] for n in G.nodes() if G.degree(n) > 5}
nx.draw_networkx_labels(G, pos, high_degree, font_size=6, font_color="white")

plt.title("Email Communication Graph")
plt.savefig("email_graph.png", dpi=150, bbox_inches="tight")
print("Saved to email_graph.png")

The visualization works best with 50–200 nodes. For larger graphs, filter to a specific time range or domain subset before plotting. You can also export to GEXF format for interactive exploration in Gephi:

# Export to GEXF for Gephi
nx.write_gexf(G, "email_graph.gexf")
print("Open email_graph.gexf in Gephi for interactive exploration")

Neo4j import

NetworkX is great for exploration, but if you want persistent storage, full-text search on properties, or a visual query interface, Neo4j is the standard. Start a local instance with Docker.

docker run -d --name neo4j \
  -p 7474:7474 -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/password123 \
  neo4j:latest

Open http://localhost:7474 in your browser to access the Neo4j Browser. First, prepare CSV files from your email export, then load them with Cypher.

# Generate CSV files for Neo4j import
cat email_edges.json | jq -r '
  [.[] | .from, .to[]] | unique | .[] |
  [., (. | split("@")[1])]
  | @csv' | sort -u > contacts.csv

# Add header
echo '"email","domain"' | cat - contacts.csv > tmp && mv tmp contacts.csv

# Generate email edges CSV
cat email_edges.json | jq -r '
  .[] | . as $msg | .to[] |
  [$msg.from, ., $msg.date, $msg.subject, $msg.thread_id]
  | @csv' > emails.csv

echo '"from_email","to_email","date","subject","thread_id"' | cat - emails.csv > tmp && mv tmp emails.csv

Copy the CSV files into the Neo4j import directory, then run the Cypher import.

// Create person nodes
LOAD CSV WITH HEADERS FROM 'file:///contacts.csv' AS row
MERGE (p:Person {email: row.email})
SET p.name = row.name, p.domain = row.domain;

// Create company nodes from person domains
MATCH (p:Person)
MERGE (c:Company {domain: p.domain})
MERGE (p)-[:WORKS_AT]->(c);

// Create email edges
LOAD CSV WITH HEADERS FROM 'file:///emails.csv' AS row
MATCH (sender:Person {email: row.from_email})
MATCH (recipient:Person {email: row.to_email})
CREATE (sender)-[:EMAILED {date: row.date, subject: row.subject}]->(recipient);

// Create indexes for fast lookups
CREATE INDEX FOR (p:Person) ON (p.email);
CREATE INDEX FOR (c:Company) ON (c.domain);

The MERGE operation ensures nodes are created only once, even if the same email address appears in multiple messages. The WORKS_AT edge connects each person to their company node based on domain. Indexes on email and domain make subsequent queries fast.

Useful Cypher queries

Once your data is in Neo4j, these queries answer the most common relationship questions. Each query runs in the Neo4j Browser or via the cypher-shell CLI.

// Shortest path between two people
MATCH path = shortestPath(
  (a:Person {email: 'alice@yourcompany.com'})-[:EMAILED*]-(b:Person {email: 'target@acme.com'})
)
RETURN path;

// Most influential connector (highest degree)
MATCH (p:Person)
WITH p, size([(p)-[:EMAILED]-() | 1]) as connections
RETURN p.email, p.name, connections
ORDER BY connections DESC LIMIT 10;

// Company-to-company communication volume
MATCH (a:Person)-[e:EMAILED]->(b:Person)
WHERE a.domain <> b.domain
WITH a.domain AS from_company, b.domain AS to_company, count(e) AS volume
RETURN from_company, to_company, volume
ORDER BY volume DESC LIMIT 20;

// People who bridge two companies
MATCH (a:Person)-[:EMAILED]->(bridge:Person)-[:EMAILED]->(b:Person)
WHERE a.domain <> bridge.domain
  AND bridge.domain <> b.domain
  AND a.domain <> b.domain
RETURN bridge.email, bridge.name,
       count(DISTINCT a.domain) + count(DISTINCT b.domain) AS companies_connected
ORDER BY companies_connected DESC LIMIT 10;

// Communication over time (weekly)
MATCH ()-[e:EMAILED]->()
WITH date(e.date) AS d, count(*) AS emails
RETURN d.year + '-W' + d.week AS week, sum(emails) AS total
ORDER BY week;

// Reciprocity check — who emails you but you never reply to
MATCH (them:Person)-[:EMAILED]->(you:Person {email: 'you@yourcompany.com'})
WHERE NOT (you)-[:EMAILED]->(them)
RETURN them.email, them.name, count(*) AS unanswered
ORDER BY unanswered DESC LIMIT 10;

The shortest path query is particularly powerful for sales and business development. If you want to reach someone at a target company, it tells you exactly who in your network can make the introduction, and how many hops away they are. The bridge query finds people who connect otherwise separate groups — these are your most valuable connectors.

Add calendar edges

Email captures asynchronous communication, but meetings capture synchronous interaction. Adding calendar data to your graph creates a much richer picture. People who meet regularly have stronger relationships than people who only exchange occasional emails.

# Export calendar events
nylas calendar events list --json --limit 200 | jq '[.[] | {
  organizer: .organizer.email,
  attendees: [.participants[].email],
  title: .title,
  date: .when.start_time,
  recurring: (.recurrence != null)
}]' > meetings.json

Import the meeting data as MET_WITH edges in Neo4j. Each meeting creates edges between all pairs of attendees, weighted by whether the meeting is recurring.

// Import meeting edges (run from a Python script that reads meetings.json)
// For each meeting, create pairwise MET_WITH edges between all attendees

UNWIND $meetings AS meeting
UNWIND meeting.attendees AS a1
UNWIND meeting.attendees AS a2
WITH meeting, a1, a2 WHERE a1 < a2
MERGE (p1:Person {email: a1})
MERGE (p2:Person {email: a2})
CREATE (p1)-[:MET_WITH {
  event_title: meeting.title,
  date: meeting.date,
  recurring: meeting.recurring
}]->(p2);

Now you can query for relationship patterns that span both channels. The most interesting query: who do you meet with but never email?

// People you meet with but never email
MATCH (you:Person {email: 'you@yourcompany.com'})-[:MET_WITH]-(them:Person)
WHERE NOT (you)-[:EMAILED]->(them)
  AND NOT (them)-[:EMAILED]->(you)
RETURN them.email, them.name, count(*) AS meetings
ORDER BY meetings DESC;

// Strongest relationships (both email and meetings)
MATCH (you:Person {email: 'you@yourcompany.com'})-[e:EMAILED]-(them:Person)
WITH you, them, count(e) AS email_count
OPTIONAL MATCH (you)-[m:MET_WITH]-(them)
WITH them, email_count, count(m) AS meeting_count
RETURN them.email, them.name, email_count, meeting_count,
       email_count + meeting_count * 3 AS relationship_score
ORDER BY relationship_score DESC LIMIT 20;

The relationship score weights meetings 3x higher than individual emails because a meeting represents a deliberate time investment. Adjust this multiplier based on your communication patterns. Some organizations communicate primarily through email; others are meeting-heavy.

Next steps

You now have a complete email communication graph that you can query for shortest paths, community structure, bridge connectors, and relationship strength. The next guide in this series takes the graph further.