2026-05-09 / 6 min read ops philosophy

Decisions, not dashboards

cubby ai team

Field note

Authored from the ops floor. Cite freely; rewrite only with attribution.

The ops dashboard, as a default surface, is a category error. It optimizes for visibility. The team paying the bill needs the next action. Those are not the same artifact, and the gap between them is where alert fatigue, missed pages, and 9-figure outage tabs live.

We have the numbers. The pingfatigue.com research index, which aggregates SRE and clinical alarm studies, cites the incident.io 2024 On-Call Survey of more than 500 engineers: a median of 42 pages per engineer per week, 41 percent considered leaving over alert load, 62 percent report weekly sleep disruption. The Catchpoint SRE Report 2024 places industry-median false-positive rates between 60 and 80 percent. Healthcare, which has lived inside this problem longer than software has, runs ICU alarm false-positive rates of 85 to 99 percent — the basis for The Joint Commission's NPSG.06.01.01 alarm safety mandate.

Visibility, as a product strategy, is exhausted. A panel that surfaces 42 alerts a week, the vast majority of which are noise, is not a tool. It is a tax on the responder's attention.

The dashboard does a different job than the responder does

A dashboard answers what is the system doing? A responder answers what should I do next? Those questions sit at different altitudes. Splunk's own writeup on Mean Time to Acknowledge lands the consequence in cash: unplanned outages cost roughly $125,000 per hour on average, and Global 2000 companies leak about 9 percent of annual profit — roughly $400 billion across the cohort — to unplanned downtime. The metric that moves that number is not how rich the panel is. It is how fast a human acknowledges and acts.

This is the gap Gartner's decision intelligence framing was built around. Gartner defines decision intelligence as a discipline that uses data, analytics, and AI to support, augment, and automate decisions — the unit of work is the decision, not the chart. The market guide makes the same shift explicit at the platform level: a decision intelligence platform has to model decisions and track their outcomes, not just visualize the inputs.

The research literature is moving the same direction. The 2025 survey AIOps in the Era of Large Language Models reviewed 183 papers across 2020–2024 and traced the field's center of gravity shifting from anomaly detection toward natural-language root-cause analysis and assisted remediation, on a five-level automation ladder that ends in autonomous execution. Visibility is the floor of that ladder. Action is the ceiling.

What a decision-shaped surface looks like

If the unit is a decision, the surface owes the responder three things, in this order:

The next move, named. Not a status — a verb. Roll back deploy 8af2. Page the on-call DBA. Mute the auth-svc latency monitor for 15 minutes.
The single piece of evidence that decided it. One log line, one metric step-change, one diff. The other 27 panels can stay collapsed.
The rejected alternatives and why each lost. We did not page network because traceroute is clean. We did not roll forward because the canary is still failing health checks.

The shape is symmetrical to the way a senior engineer hands off context at 3 a.m. — head, then heart, then footnotes. Most dashboards invert that, and ask the responder to do the synthesis. Under fatigue, the synthesis is exactly what gets dropped.

This is also why metric design matters. Dashboards are measured in coverage. Decision surfaces have to be measured in time-to-action: minutes from page to keystroke. Coverage with no action is the operating definition of alert fatigue.

The stance

We are not against dashboards. We use them — to inspect after the fact, to design SLOs, to teach a new on-call what normal looks like. But the default surface during an incident, or during any operational decision, should be the decision itself. Everything else is a drawer the responder can open if they want to.

Build for the keystroke, not the page view.

Sources

pingfatigue.com — Alert Fatigue Research Index — aggregates SRE, healthcare, and knowledge-worker studies on alert load (Google SRE Book targets, DORA 2024, incident.io 2024).
incident.io — Alert fatigue solutions for DevOps teams in 2025 — survey-style aggregation: ~2,000 alerts/week per team, ~3% needing immediate action.
Splunk — Mean Time to Acknowledge (MTTA) — defines MTTA and cites the $125k/hour and $400B/year downtime figures.
Gartner — Decision Intelligence glossary — Gartner's discipline-level definition; the unit of work is the decision.
arXiv 2507.12472 — A Survey of AIOps in the Era of Large Language Models — 183-paper survey showing the field's drift from detection toward assisted remediation.