Guide
Handle Email API Outages: Backoff and Queues
Every API has bad minutes. A 99.99% SLA still allows 52.6 minutes of failure a year, and your script doesn't get to choose which minutes. This guide covers the client side of outage survival: telling a platform incident apart from a local misconfiguration in under 30 seconds, backing off exponentially instead of hammering a struggling endpoint, spooling outbound mail to disk, and flushing the spool when service returns.
Written by Prem Keshari Senior SRE
Command references used in this guide: nylas doctor, nylas email send, nylas email list, and nylas audit logs show.
How do you tell an outage from a local misconfiguration?
Outage detection is a two-signal check: your own diagnostics and the platform's status page. The nylas doctor --json command runs its 5 diagnostic checks in a few seconds and reports per-check status with measured API latency. If doctor shows your config healthy but API calls still fail, check status.nylas.com before touching anything local.
# Classify the failure (a doctor that can't run counts as unhealthy)
FAILING=$(nylas doctor --json | \
jq '[.checks[] | select(.status != "ok")] | length' || echo unknown)
FAILING=${FAILING:-unknown}
if [ "$FAILING" = "0" ]; then
echo "local setup healthy — suspect a platform or provider incident"
echo "check https://status.nylas.com"
else
nylas doctor --json | jq '.checks[] | select(.status != "ok")'
fiThe two signals split the debugging tree cleanly. Local check failing: fix credentials or network, no amount of waiting helps. Local checks passing and the status page reporting an incident: stop debugging, start queueing. That classification takes under 30 seconds and prevents the classic outage mistake of rotating working credentials in a panic.
How does exponential backoff work in a shell script?
Exponential backoff doubles the wait after each failed attempt: 1, 2, 4, 8 seconds, and so on, up to a cap. The doubling matters during an outage because thousands of clients retrying on a fixed interval re-overload the recovering service in synchronized waves. Adding random jitter desynchronizes your retries from everyone else's; the AWS Architecture Blog analysis of backoff and jitter found that jittered backoff cuts both total client work and time-to-success compared to plain exponential backoff.
# Exponential backoff with jitter, capped at 300s
send_with_backoff() {
local delay=1 max_delay=300 attempt=1 max_attempts=8
while ! nylas email send --to "$1" --subject "$2" --body "$3" --yes; do
if [ "$attempt" -ge "$max_attempts" ]; then
return 1
fi
jitter=$(printf '%03d' $((RANDOM % 1000)))
sleep "$delay.$jitter" # e.g. 2.047 = 2s + 47ms jitter
delay=$((delay * 2)); [ "$delay" -gt "$max_delay" ] && delay=$max_delay
attempt=$((attempt + 1))
done
}
send_with_backoff ops@example.com "Disk alert" "Volume at 91%"Eight attempts with doubling delays spread retries across just over 2 minutes (1+2+4+8+16+32+64 = 127 seconds of waiting) before giving up, which rides out brief incidents without manual intervention. For anything longer, retrying in a foreground loop holds your script hostage; that's when you switch from retrying to queueing.
How do you queue emails locally during an outage?
A local spool is a directory of JSON files, one per unsent message, written whenever a send fails. The pattern mirrors what Postfix does with its deferred queue, in about 15 lines of shell instead of a daemon. Each file holds the recipient, subject, and body, so a later flush pass can replay it exactly.
SPOOL=~/.local/spool/email
mkdir -p "$SPOOL"
# Try to send; spool on failure
queue_or_send() {
if ! nylas email send --to "$1" --subject "$2" --body "$3" --yes; then
# Write to a dotfile first, then mv: the flush loop globs *.json,
# so it can never read a half-written file
local name
name="$(date -u +%s%N)"
jq -n --arg to "$1" --arg subject "$2" --arg body "$3" \
'{to: $to, subject: $subject, body: $body}' \
> "$SPOOL/.$name.tmp" && mv "$SPOOL/.$name.tmp" "$SPOOL/$name.json"
echo "spooled: $2"
fi
}
queue_or_send ops@example.com "Nightly report" "$(cat /tmp/report.txt)"Timestamped filenames keep the spool ordered, so messages flush in the order they were written. The spool also survives reboots and script crashes, which an in-memory retry loop doesn't. Disk cost is trivial: a thousand spooled messages of typical alert size use under 5 MB.
How do you flush the queue when service returns?
The flush pass gates on a health check, then replays each spooled file and deletes it only after a successful send. Gating on nylas doctor prevents the flush from burning attempts while the API is still down, and deleting after (not before) the send means a crash mid-flush loses nothing.
SPOOL=~/.local/spool/email
# Nothing spooled? Skip the health check entirely
ls "$SPOOL"/*.json >/dev/null 2>&1 || exit 0
# Only flush when every doctor check passes — and fail closed
# if doctor itself can't run
FAILING=$(nylas doctor --json | \
jq '[.checks[] | select(.status != "ok")] | length' || echo 1)
if [ "$FAILING" != "0" ]; then
echo "still unhealthy, leaving spool intact"
exit 0
fi
for f in "$SPOOL"/*.json; do
[ -e "$f" ] || break
if nylas email send \
--to "$(jq -r .to "$f")" \
--subject "$(jq -r .subject "$f")" \
--body "$(jq -r .body "$f")" \
--yes; then
rm "$f"
else
break # API regressed mid-flush; stop and retry next run
fi
doneRun the flush from cron every 5 minutes and it becomes self-healing: during normal operation the spool is empty and the pass exits immediately, costing nothing. After an incident, the next pass drains the backlog in send order. Watch for duplicate risk on messages that failed mid-send; the idempotency pattern in the reliable automation guide covers deduplicating against the Sent folder.
What happens to webhooks during an outage?
Inbound notifications have their own retry layer, separate from anything you script. When your webhook endpoint returns a non-200 response, Nylas retries the delivery up to 3 times with exponential backoff, with the final attempt landing 10–20 minutes after the first. The webhook_delivery_attempt field in each payload increments on every retry, so your handler can detect redelivery and dedupe on the event id.
That means a short receiver outage on your side (a deploy, a restart) usually loses nothing: deliveries that failed during the window replay automatically. An outage longer than the retry window is different, and the recovery move is a reconciliation pass that lists recent messages and compares them against what your handler processed. The webhook events reference documents the full retry and payload contract.
Which failure needs which response?
Outage response is a small decision table, and having it written down before the incident beats improvising during one. The 4 rows below cover the failure modes a terminal email integration actually meets, with the signal that identifies each scenario in under 60 seconds and the response that fits it.
| Scenario | Signal | Response |
|---|---|---|
| Local misconfiguration | doctor check fails | Fix credentials/network; retrying won't help |
| Transient blip (seconds) | One failed call, doctor healthy | Backoff loop absorbs it |
| Platform incident (minutes+) | Status page reports incident | Spool sends locally, flush after recovery |
| Your receiver down | webhook_delivery_attempt > 1 after restart | Automatic redelivery; dedupe on event id |
After recovery, verify rather than assume: confirm the last spooled message reached the Sent folder with nylas email list --folder Sent --limit 5, and check nylas audit logs show --status error for anything that failed quietly during the window.
Next steps
- Email API SLAs: uptime vs success rate — what the 52.6-minute error budget means and how vendors measure it
- Build reliable email automation — exit codes, bounded retries, and idempotent sends for normal operation
- Monitor email integration health — catch the outage before your users report it
- Email webhook events reference — payload shapes, retry rules, and HMAC verification
- Full command reference — every flag and subcommand documented