Case study: scaling synchronized video interviews

March 1, 202512 min read

Context

A B2C admissions product added live video interviews alongside async applications. Early traffic was modest; interview rooms were created on demand and torn down after each session. When marketing campaigns and intake windows aligned, concurrent rooms spiked within minutes. The first pain was not “video quality” but connection storms: signalling spikes, runaway room creation, and third-party SFU quotas we had assumed would never matter.

This write-up is about the shape of the system and the guardrails that held—without pretending there was a single magic library swap.

What “good” looked like

Signal	Target
Room join success (first attempt)	≥ 98% under peak
P95 time-to-first-frame	< 4s on median home broadband
Orphan rooms after disconnect	Auto-closed < 2 min
Support tickets tagged “could not join”	Down week-over-week after each change

Architecture snapshot

Client: Next.js app with a dedicated interview route; WebRTC via the vendor’s browser SDK (TURN/STUN configured per environment).
Room service: Small NestJS service that minted short-lived JWTs, mapped interviewId → roomId, and enforced one active session per candidate slot.
State: Redis for ephemeral room metadata, join tokens, and rate-limit counters; Postgres for the source-of-truth schedule and audit trail.
Observability: Structured logs with interviewId, roomId, sfuRegion, and join phase; RED-style metrics on the BFF that fronted the SDK.

Bottleneck 1: unbounded room creation

The first production spike showed duplicate rooms for the same slot when users refreshed or opened two tabs. The SFU billed per room-minute; worse, moderators landed in different rooms than candidates.

Change: idempotent POST /rooms keyed by (slotId, candidateId) with a Redis SETNX lease. If a room existed and was still healthy, return the same credentials. If the lease expired after a crash, create a new room but close the stale one via provider API where supported.

Bottleneck 2: signalling and token mint storms

Every “Join” hit minted tokens and hit the provider. Under load, the provider returned 429 and the UI showed a generic error—users assumed “their Wi‑Fi was bad.”

Changes:

Token cache: cache minted credentials for 60–90s where the vendor allowed; invalidate on explicit “Leave”.
Client-side backoff with jitter on 429/5xx, surfaced as “High demand—retrying…” instead of a dead end.
Queue depth metric on the mint endpoint so we could page before users noticed.

Bottleneck 3: cold regions

Some cohorts were concentrated in South Asia and MENA. Default SFU region was still optimal for US-East shaped traffic.

Change: derive preferred region from the first successful speed sample (lightweight ping) stored in Redis for the session, and pass it when creating rooms. Documented a manual override for ops when a provider had an incident in one POP.

What we deliberately did not do

No “custom SFU” in the first phase—the goal was reliability and cost predictability, not owning media plumbing.
No endless retries without caps—bounded retries + support handoff beat infinite spinners.

Outcome

After three incremental releases (idempotency + caching + regional hints), first-attempt join rate moved into the target band during the next intake window, and room-minute waste dropped sharply. The remaining incidents were mostly client-side camera permissions and corporate VPNs—we addressed those with clearer pre-flight copy and a downloadable checklist for institutions.

If you’re about to ship something similar

Treat room lifecycle as a state machine, not a fire-and-forget API call.
Price and quota the third-party path like any other dependency—load test the mint path, not only the pixel path.
Make join phases visible in logs so on-call can answer “did we fail before or after SDP?” in one grep.

All articles