Context
A platform serving students and institutions across several continents ran primarily out of a single cloud region. p95 API latency from Kathmandu, Dubai, and Lagos routinely sat 3× higher than from North America. Mobile web felt “fine” on Wi‑Fi but fragile on 4G once JSON payloads and waterfall calls stacked up.
Leadership wanted improvement before rewriting the whole backend. The mandate: measurable latency wins, minimal downtime, and a credible rollback if something went wrong.
Baseline (honest numbers)
We picked three synthetic checks and three RUM-derived dashboards:
| Check | Before (p95, ms) | Notes |
|---|---|---|
| Auth token exchange | 420–680 | TLS + RTT dominated |
| Application list (paginated) | 510–890 | N+1-ish queries amplified RTT |
| File metadata (signed URL) | 380–620 | Round trip to origin region |
RUM confirmed the same story: TTFB on document requests correlated with distance to origin, not CPU on the cluster.
Strategy: parallel region, not lift-and-shift day one
- Stand up a read-optimized path in Singapore (API + Redis replica + read pool).
- Route only read-heavy, idempotent endpoints to the new region via GeoDNS with a low TTL.
- Keep writes on the primary region initially, with async replication we already trusted.
- Feature flag per tenant cohort so we could steer 10% → 50% → 100% of traffic.
This avoided rewriting data ownership on day one while still moving bytes closer to users.
Implementation notes
Connection pools
Opening one pool per instance against a faraway primary still hurt when bursts hit. We:
- Right-sized pool limits per service so we did not exhaust DB
max_connectionswhen regional replicas doubled instance counts. - Added pool wait time metrics—when waits climbed, we scaled replicas before CPUs maxed.
Caching boundaries
We cached stable reads (reference data, feature flags) at the edge with short TTLs and explicit cache keys per locale. We did not cache personalized lists without a hard invalidation story—stale “application status” is worse than +200ms latency.
Rollback
GeoDNS toggle + flag meant rollback was flip traffic + drain pools, not “restore a database.” We rehearsed it once in staging with a runbook under 15 minutes wall clock.
Results
After full cutover for read paths and moving upload initiation to a regional bucket + S3 transfer acceleration where applicable:
| Check | After (p95, ms) | Delta |
|---|---|---|
| Auth token exchange | 260–410 | RTT + edge TLS termination |
| Application list | 320–520 | Fewer round trips + regional DB reads |
| File metadata | 210–380 | Regional origin for signed URL step |
Numbers varied by ISP and time of day; the direction held across two intake cycles.
What I’d do earlier next time
- Instrument per-query timings broken down by DB region from day one—otherwise you argue about “the network” forever.
- Put connection pool dashboards next to latency dashboards; they are the same story told twice.
- Write the rollback runbook before the first 10% traffic shift, not after the first incident dress rehearsal.
Takeaway
Regional rollout is a product and ops problem as much as an infra one. If you can’t explain who gets which stack and how to undo it in minutes, you don’t have a rollout—you have a hope.