Alert Fatigue: Designing Alerting Rules That Get Acknowledged, Not Ignored

Q: How do I convince my team to delete alerts?

Start with data. Pull the last 90 days of alert history and calculate the actionable rate. When the team sees that 70% of pages led to "looked at it, did nothing," the case makes itself. Frame deletion not as removing safety nets but as clearing noise that prevents real safety nets from being heard.

Q: What is a healthy number of alerts per on-call shift?

No universal number, but fewer than two pages per 8-hour shift is a widely cited benchmark. The actionable rate matters more than raw count. Five pages with 90% actionable rate is healthier than one page that's always noise. Track both volume and quality. Optimize for quality first.

Q: Should we alert on infrastructure metrics at all?

Yes, but not on the pager. CPU and memory are invaluable as diagnostic context during incidents and inputs to capacity planning. The distinction is between alerting (interrupting a human to demand action) and monitoring (making data available for humans who choose to look). CPU at 85% should be on a dashboard and maybe trigger a capacity planning ticket if sustained. It should not wake someone up unless directly correlated with user-facing degradation.

Q: How do SLO-based alerts work if we haven't defined SLOs yet?

Start with one. Pick your most critical user journey, define a modest SLO (99.5% success over 30 days), instrument error rate and latency, set up one fast-burn and one slow-burn alert. That single SLO-based pair will be more meaningful than dozens of static rules. Expand from there.

Q: How do we handle the transition period?

Run both systems in parallel for 2-4 weeks. Keep old alerts active but route them to a low-priority audit channel. Route new SLO alerts to the pager. At the end, compare empirically: did the new system miss anything that mattered? This shadow-mode approach builds confidence through evidence rather than faith.

Core Distinction

Signal vs. Noise

The design error that produces alert fatigue is treating every internal metric as if it were a user-facing symptom. CPU spikes during batch jobs. Memory climbs before garbage collection. Disk usage creeps toward a threshold an automated cleanup handles every Tuesday. None of these are user-facing, but if your alerting can't distinguish routine fluctuation from real trouble, it pages someone anyway.

Every alerting system must distinguish signals from symptoms. A signal is a data point: CPU utilization jumped to 85%. A symptom is a user impact: checkout latency tripled. Signals are interesting. Symptoms are actionable.

A single metric crossing a static threshold is a vibration. A combination of elevated error rates, increased latency, and dropping throughput, sustained over a meaningful window, is an actual problem. A well-designed alarm system doesn't fire on vibration alone. It cross-references multiple inputs: vibration plus door ajar plus ignition tampering. Good alerting works the same way.

Moving from single-metric, static-threshold alerts to multi-signal, time-windowed alerts is one of the most useful changes most teams can make. New tooling is not required; new thinking is.

Somewhere during initial setup, often during an adrenaline-fueled week after a bad outage, someone configured alerts on every metric that moved. CPU over 70%? Alert. Memory over 80%? Alert. Disk over 75%? Alert. Any 5xx error? Alert. Every one added with good intentions. None individually wrong. But collectively they create a wall of noise that buries the signal.

Pager Rules

Alert on Symptoms, Not Causes

A simple test: if this alert fires at 3 AM, can the on-call engineer take a specific, documented action to resolve it? If the answer is no, it should not wake anyone up.

Cause-based alerts fire on internal system behavior: CPU, memory, connection pool utilization. They tell you what a machine is doing. They do not tell you whether anyone cares.

Symptom-based alerts fire on externally visible degradation: error rate, latency, availability. They tell you what users are experiencing.

The distinction matters because causes and symptoms have a many-to-many relationship. A single cause can produce multiple symptoms, and a single symptom can have multiple causes. If you alert on causes, you get paged five times for one incident: the database slows down, the API follows, the queue backs up, and the retry rate climbs. If you alert on symptoms, you get paged once: users are seeing errors. You then investigate causes using dashboards and traces, tools built for investigation, not your pager, which is built for interruption.

We learned this the hard way during a database connection pool scare. We had alerts on pool utilization at 80%. One Friday night the pool hit 82% for about six minutes during a batch import. The on-call engineer got paged, spent 40 minutes investigating, found nothing wrong, and went back to bed. The next morning, a different engineer got the same alert during the same batch window. This happened three weeks running. By the fourth week, when the pool hit 92% because of an actual connection leak, the on-call glanced at the alert and dismissed it as "the Friday batch thing." We caught the real leak two hours later from a customer complaint about timeouts. After that, we ripped out every infrastructure-metric pager and replaced them with symptom alerts on error rate and p99 latency.

A well-instrumented service might expose 200 metrics. Perhaps 5-10 represent genuine user-facing symptoms worth alerting on. The rest are diagnostic context: invaluable during an investigation, counterproductive as pager triggers. Cause-based metrics are not useless. During an incident, you absolutely want that CPU chart and connection pool graph. They help you diagnose. But the pager should only ring when the patient is symptomatic, not when lab results are slightly outside normal range.

SLO Lens

SLO-Based Alerting

Symptom-based alerting asks whether users are hurting. SLO-based alerting asks whether they are hurting enough to act now.

Not every elevated error rate is an emergency. If your 99.9% availability SLO allows 43 minutes of monthly downtime, a brief spike consuming 30 seconds is not an emergency. A sustained surge burning minutes per hour is.

Burn rate alerting measures how quickly you consume error budget relative to remaining time.

Fast burns detect acute incidents: consuming budget at 14x sustainable rate over a 1-hour window. A deploy gone wrong, a dependency failure, a config change that broke auth. Something is actively on fire. This needs a human now.

Slow burns detect chronic degradation: consuming at 2-3x sustainable rate over 3 days. Gradual P99 latency rise, a slow memory leak, or a retry storm building over days. Not a page. A ticket. Engineering attention during business hours before it becomes a page.

During low traffic, minor error counts barely dent the budget. During peak, even small percentage increases represent large numbers of unhappy users and the burn rate accelerates. Alerts are proportional to impact, not to a static threshold someone guessed at during initial setup.

SLO-based alerting also gives teams a shared language with the business. "Our error rate exceeded 0.5% for 15 minutes" means nothing to a product manager. "We burned 30% of our monthly reliability budget in one afternoon" gets their full attention.

If you haven't defined SLOs yet, start simple. Pick your most critical user journey, the one that generates the most support tickets if it breaks. Define a modest SLO: "this journey succeeds at least 99.5% of the time over a rolling 30-day window." Set up a fast-burn alert (14x rate over 1 hour) and a slow-burn alert (3x rate over 3 days). You now have one SLO-based alert pair that is more meaningful than dozens of static-threshold rules.

Fast burn

Short windows catch active incidents that are consuming reliability budget at an unsustainable rate.

Slow burn

Longer windows surface creeping degradation before it turns into a pager-worthy incident.

Shared language

Error-budget consumption translates reliability into business terms better than isolated threshold percentages.

Escalation Design

Tiering and Routing

Every false-urgency page erodes trust in the system.

Page: User-facing degradation that will worsen without intervention. Error budget burning unsustainably. Phone call or high-priority notification. Should fire no more than a few times per week across the entire portfolio. If it fires daily, the system has a design problem.

Notify: Slow burn underway. Capacity threshold approaching within days. Dependency showing early signs of instability. Team channel message, automatic ticket creation. Nobody's sleep interrupted.

Log: Mildly anomalous metric within historical variance. Dashboard data for trend analysis. No notification.

Routing matters as much as tiering. An alert routed to the wrong team is functionally no alert at all. It sits unacknowledged in someone else's channel while the people who could fix it remain unaware. Route alerts by service ownership, not organizational hierarchy. Make ownership explicit, review it quarterly, enforce it through automation.

Grouping and deduplication: When a database outage causes 15 downstream services to throw errors, the engineer receives one alert, not 15. This requires defining the dependency graph. I've seen teams skip dependency mapping because it feels like overhead, and every single one regretted it within six months. Without the graph, the platform can't know that 15 simultaneous alerts share a root cause.

Useful benchmark for pages: fewer than two per 8-hour on-call shift. Actionable rate matters more than raw count. Five pages with 90% actionable rate is healthier than one page that's always noise.

Page

Reserved for urgent user-facing degradation that will worsen without immediate intervention.

Notify

Used for slow-burn or early-warning conditions that deserve attention without interrupting sleep.

Log and group

Mild anomalies stay in dashboards, and correlated incidents should collapse into one actionable signal.

Lifecycle

The Alert Lifecycle

Alerts are opinions about infrastructure formed at a specific time by a person who may no longer work at the company. Outdated alerts hurt the team every time they fire.

Quarterly audits ask three questions:

Has this alert fired in the last 90 days? If not, it may monitor a condition that no longer occurs. Delete it or document why it must remain.

When it fired, did someone take action? If consistently acknowledged and resolved without intervention, the alert is noise. Tune it or remove it.

Is the threshold still appropriate? Systems change. A memory threshold meaningful on a 4GB instance is nonsensical on a 32GB container.

If more than 30% of alerts are dismissed without action, your rules need significant rework. Target an actionable rate of 80% or higher.

We ran our first audit expecting to trim maybe 15% of our rules. We ended up deleting 43%. About a third of those were alerts for services that had been decommissioned months earlier but nobody had cleaned up the monitoring. The on-call ticket volume dropped noticeably that same sprint, and not a single deleted alert was missed in the following quarter.

The hardest part of alert pruning is emotional. Engineers are reluctant to delete alerts because each one was born from a real incident. Deleting feels like tempting fate. But an unactionable alert prevents attention, not incidents. Attention is the resource alert fatigue depletes. The engineer who proposes deleting a noisy alert is doing more for reliability than the one who adds three untested alerts after every outage.

Freshness check

If an alert has not fired in 90 days, question whether it still reflects a living risk.

Actionability check

If a rule consistently resolves without human action, it is more likely noise than signal.

Threshold drift

Infrastructure evolves. Thresholds that once made sense can become nonsense after platform changes.

Feedback System

The Feedback Loop

Every post-incident review should include: Did we get alerted quickly enough? With the right urgency? And the question most teams forget: Which alerts fired in the past month that led to no meaningful action?

Track alert quality with a few metrics: volume over time, time to acknowledge, alerts per incident, and percentage resolved without action.

Volume over time — a steady upward trend without increased incident frequency means alerts are being created without scrutiny.
Time to acknowledge — if it's climbing, engineers are overwhelmed or have learned most alerts don't matter.
Alerts per incident — target one or two with correlated alerts grouped. Ten or more per incident means grouping needs work.
Percentage resolved without action — the best noise proxy you have.

The loop closes when post-incident insights turn into rule changes: new alerts for gaps, tuned thresholds for misfires, deletions for irrelevancies. Without this step, retrospectives are theater.

Foster a culture where proposing to delete an alert is respected, not feared. The teams with functional alerting translate post-incident insights into alerting changes. The others hold retrospectives and change nothing.

Engineering review session with alert metrics and incident notes

Overhaul Story

What a Real Overhaul Looks Like

A logistics company we worked with (about 200 business customers, 60 services, four-person on-call rotation) had accumulated over 400 alerting rules in 18 months. The on-call engineer averaged 12 pages per shift, nearly two an hour overnight. About 70% of those pages were acknowledged and resolved with no corrective action: transient CPU spikes, single 5xx errors from health check timeouts, pod restarts Kubernetes handled on its own.

The breaking point came when a database connection pool exhaustion event took down the tracking API for 22 minutes. The alert that should have caught it (elevated P99 latency on the tracking endpoint) did fire, but alongside 11 other alerts from cascading downstream effects. The on-call engineer, already habituated to noise, took 14 minutes to begin investigation. Post-incident analysis showed 8 of those 11 concurrent alerts provided no diagnostic value beyond the original P99 alert.

The overhaul had three phases. Phase one was the audit: every alert tagged with fire count, acknowledgment time, and resolution action over the preceding 90 days, then sorted into four buckets. Actionable: the alert led to a human taking a corrective step. Self-resolving: the condition cleared before anyone acted. Duplicate: another alert already covered the same condition from a different angle. Orphaned: the alert monitored a service or metric that no longer existed. Of 400 rules, 112 were orphaned, 94 were duplicates, 87 were self-resolving. That left 107 that had led to real action.

Phase two was the redesign. The surviving 107 alerts went through the symptom-vs-cause lens. 43 cause-based alerts (CPU, memory, disk, pod health) became dashboard metrics and were removed from the pager. Those metrics didn't disappear. They were still visible on dashboards, still available during investigations. They just stopped interrupting sleep. The remaining 64 symptom-based alerts were consolidated and restructured around SLO burn rates for the five core user journeys: shipment creation, tracking lookup, route calculation, carrier dispatch, and billing reconciliation. Each journey got a fast-burn page and a slow-burn ticket. Rules dropped from 400 to 38.

Phase three was the feedback loop. A biweekly "alert quality standup," 20 minutes, where the on-call engineer from the previous rotation reviewed every alert, classified it as signal or noise, and proposed threshold adjustments. They also tracked three metrics on a team dashboard: alerts per shift, percentage leading to action, and mean time to acknowledge.

After six months: alerts per shift dropped from 12 to 2.4. Actionable rate climbed from 30% to 88%. Mean time to acknowledge genuine incidents fell from 14 minutes to 3 minutes, not because engineers got faster, but because they no longer sorted through noise first. On-call satisfaction scores improved from 3.1 to 7.8 out of 10. When a similar database issue occurred four months later, the single SLO burn-rate alert was acknowledged in under 90 seconds and resolved before most users noticed.

Phase 1: audit

Every rule was tagged as actionable, self-resolving, duplicate, or orphaned based on the last 90 days.

Phase 2: redesign

Cause-based pager rules were removed, and the remaining symptom rules were rebuilt around SLO burn rates.

Phase 3: feedback loop

A regular alert-quality review kept thresholds, rules, and outcomes aligned with operational reality.

Trend Metrics

Alert Volume Over Time

One metric worth calling out specifically: alert volume trend. A steady upward trend in alert count, especially if it correlates with new services being onboarded but not with increased incident frequency, indicates new alerts are being created without corresponding scrutiny. Volume should be roughly stable or declining as the system matures and alert quality improves. If it's climbing, something is wrong with your alert creation culture.

Similarly, track alerts per incident. In a well-tuned system, a single incident should produce one or two related alerts with correlated alerts grouped. If your ratio is 10 alerts per incident or higher, grouping and deduplication need work.

Cutover

Handling the Transition

If you're reducing alerts and moving to SLO-based rules, run both systems in parallel for 2-4 weeks. Keep old alerts active but route them to a low-priority audit channel. Route the new SLO-based alerts to the pager. At the end, compare: did the new alerts catch everything the old ones caught? Did the old ones fire on anything the new ones missed that actually mattered? This shadow-mode approach gives you empirical confidence before you commit to the cutover, and it builds team trust through evidence rather than faith.

Myths

Common Misconceptions

"More alerts equals more safety." More alerts means more noise, which means less attention. A system with 50 rules at 95% actionable rate is safer than one with 500 rules at 20%.

"We cannot delete that alert, it was created after the big outage." An alert's emotional origin says nothing about its current value. Many post-incident alerts are created under pressure with thresholds that are either too sensitive (the team is in "never again" mode) or too narrow (tuned to detect the specific failure that just happened, not the category of failure it represents). It deserves to exist because it is actionable today.

"Reducing alerts means missing incidents." Reducing alerts doesn't reduce observability. Metrics, logs, and traces still exist. What changes is what crosses the threshold of human interruption versus what stays available for investigation.

"Static thresholds are good enough." They assume systems behave the same at all times. A 60% CPU threshold that's appropriate at 3pm is wildly aggressive during a 3am batch processing window. SLO-based burn rates handle natural variability that static numbers cannot.

"Alert fatigue is a tooling problem." Switching monitoring platforms won't solve this. The same team that generated 500 noisy alerts in Nagios will generate 500 in Datadog. The tooling may make better design easier to implement, but the design thinking has to come first.

Noise is not safety

High alert volume often makes the system less safe by teaching engineers that urgency is usually false.

Deletion is not recklessness

An alert earns its place by being actionable now, not by having been emotionally created after a past outage.

Tooling is not the root cause

A platform can help implement better rules, but it cannot invent better alert design thinking for the team.

Takeaways

Alert on symptoms, not causes. The pager is for user-facing impact; infrastructure metrics belong on dashboards.
Every alert should have a documented action. If the on-call response is "look at it and hope it resolves," the alert shouldn't exist.
Use SLO burn rates instead of static thresholds to make alerts proportional to business impact.
Tier ruthlessly: page, notify, or log. Only conditions requiring immediate human intervention should produce an intrusive notification.
Audit alerts quarterly and delete without guilt. An alert that hasn't led to meaningful action in 90 days is noise.
Group and deduplicate. One incident should produce one page, not fifteen.
Close the feedback loop. Post-incident reviews should produce alerting changes, not just meeting notes.
Treat alert design as ongoing work, not a one-time config. Healthy on-call teams normalize proposing deletions and review alerting quality as regularly as code quality.

Keep Reading

FAQ

How do I convince my team to delete alerts?

What is a healthy number of alerts per on-call shift?

Should we alert on infrastructure metrics at all?

How do SLO-based alerts work if we haven't defined SLOs yet?

How do we handle the transition period?