Guide

Email API SLAs: Uptime vs Success Rate

The difference between a 99.9% and a 99.99% SLA is 7 hours and 53 minutes of allowed downtime per year. If your product sends password resets or calendar invites through an email API, that gap is the difference between a quiet quarter and a support-ticket fire. This guide explains what an email API SLA actually guarantees, why success rate is a stricter metric than uptime, and how to verify a vendor's reliability claims from your own terminal.

Written by Caleb Geene Director, Site Reliability Engineering

VerifiedCLI 3.1.16 · last tested June 5, 2026

Command references used in this guide: nylas doctor and nylas audit logs show.

What does an email API SLA actually guarantee?

A service level agreement (SLA) is a contractual commitment to a measurable availability target, with financial remedies (usually service credits) when the vendor misses it. A 99.99% SLA does not promise your requests never fail. It promises that failures stay below 0.01% of the measurement window, and that you get compensated if they don't.

Three details in the contract matter more than the headline number. First, the measurement window: 99.99% per month allows 4.38 minutes of failure each month, while 99.99% per year allows a single 52.6-minute outage. Second, what counts as "down": full outage only, or elevated error rates too. Third, the exclusions: most SLAs carve out scheduled maintenance and upstream provider failures. Read all three before comparing vendors on the number alone.

How much downtime does each SLA tier allow?

Each added nine cuts the allowed downtime by a factor of 10. A 99% service can be dark for 3.65 days a year and still meet its SLA. At 99.99%, the budget is 52.6 minutes a year. The table below converts each tier into concrete time budgets, based on a 365.25-day year.

SLA tierDowntime per yearPer monthPer week
99%3.65 days7.3 hours1.68 hours
99.9%8.77 hours43.8 minutes10.1 minutes
99.99%52.6 minutes4.38 minutes1.01 minutes
99.999%5.26 minutes26.3 seconds6.05 seconds

These budgets are also your error budget in the SRE sense: the amount of unreliability you can spend on deploys, migrations, and experiments before breaching the commitment. A team operating at 99.99% has 4.38 minutes per month to spend. That constraint shapes engineering practice far more than the marketing number does.

Uptime vs success rate: which should an SLA measure?

Uptime measures whether the service answers at all; success rate measures whether each request actually succeeded. A service can report 100% uptime while 5% of requests fail with 500 errors, because the health-check endpoint still responds. Success rate (total successful API calls ÷ total API calls) counts every one of those failures against the SLA, which makes it the stricter and more honest metric.

Nylas measures its SLA as success rate. According to the Nylas engineering post How Nylas improved API reliability from 99.9% to 99.99% (November 2025), the metric is "total successful API calls ÷ total API calls" — a definition that counts elevated error rates, not just full outages. When you evaluate any email API vendor, ask which definition their SLA uses. A 99.99% uptime guarantee and a 99.99% success-rate guarantee are materially different commitments.

How does an email API reach 99.99% in practice?

Moving from 99.9% to 99.99% means cutting the allowed failure budget by 90%: from 43.8 minutes per month to 4.38. The Nylas SRE team documented the specific changes that closed that gap in its reliability engineering post, and the techniques generalize to any high-volume API platform.

Three practices did most of the work. Deploys go through an automated canary phase that shifts traffic from 5% to 50% in stages, and roll back automatically within minutes if the new build shows even a 0.01% regression in success rate. Chaos tests run multiple times per week, simulating database node loss, API rate-limit spikes, and regional disruptions. And the primary API gateway and databases moved off Kubernetes onto dedicated infrastructure, which also produced a 12% reduction in average request latency. The post's summary is blunt: "Reliability isn't a checkbox; it's a culture."

How do you verify an email API's reliability yourself?

Vendor dashboards show the vendor's view; your terminal shows yours. The nylas doctor command runs 5 local diagnostic checks in a few seconds, reporting per-check status plus measured API latency. The --json flag makes the output scriptable, so you can run it on a schedule and alert on regressions.

# One-shot health check with measured API latency
nylas doctor

# Machine-readable output for scripting
nylas doctor --json

# Extract just the failing checks ([] = healthy)
nylas doctor --json | jq '[.checks[] | select(.status != "ok")]'

For request-level verification, audit logging records every CLI command with the Nylas request ID it produced. When a send fails or support asks for details, nylas audit logs show --request-id retrieves the exact entry, so both sides are looking at the same request. Enable it once and the log accrues locally with zero per-command overhead you need to think about.

# Enable audit logging (one-time setup)
nylas audit init --enable

# Show the last 20 commands with their request IDs
nylas audit logs show

# Show only failed commands
nylas audit logs show --status error

# Look up one request by its Nylas request ID
nylas audit logs show --request-id req_abc123

Live platform status is published at status.nylas.com. Pairing the status page with your own doctor checks distinguishes a platform incident from a local misconfiguration in under 30 seconds. The monitoring guide turns these one-shot checks into a scheduled pipeline.

What should you ask a vendor before signing an SLA?

An SLA negotiation comes down to 6 questions, and you can resolve most of them from the vendor's public pricing page before talking to sales. Nylas, for example, publishes a 99.99% availability SLA for annual contracts on its pricing page, with full terms defined in the customer agreement.

  1. Metric definition — uptime or success rate? Success rate counts partial failures; uptime may not.
  2. Measurement window — monthly windows surface short repeated outages that an annual window averages away.
  3. Which plans include it — SLAs often apply only to annual or enterprise contracts, not month-to-month plans.
  4. Exclusions — scheduled maintenance, upstream provider outages (a Gmail outage isn't the API vendor's downtime), and force majeure.
  5. Remedies — credit percentages per breach tier, and whether you must file a claim within a deadline to receive them.
  6. Evidence — does the vendor publish a status page and post-incident reports you can audit against your own logs?

Question 4 matters double for email APIs that sync against Gmail, Outlook, and other upstream providers. Provider-side incidents are outside any vendor's control, so the practical question is how the platform behaves during one: whether requests queue and retry, and whether webhook deliveries replay after recovery. The outage handling guide covers building for that case on the client side.

Next steps