Scenario
Three engineers, one Node + Postgres API, and a mobile web client. Incidents were rare but slow to diagnose: logs were plain strings, deploy version was not always in context, and we had a Grafana folder full of charts nobody opened after the first week.
We did not buy a new vendor stack. We tightened what we had.
Start with one query
Pick the question you ask most often after an incident—“which deploy broke checkout?”—and make sure your logs can answer it with a single filter. Structured JSON beats clever grep.
Metrics you will actually read
Request rate, error rate, and latency on your outermost API are enough to start. Add business metrics only when someone commits to looking at them weekly.
Alerts that respect sleep
Alert on user-visible symptoms and clear thresholds, not on every blip. Page humans for outages; send everything else to a channel you triage in the morning.