Anomaly Detection and AIOps
1. What an anomaly is (and what it isn’t)
An anomaly is an observation that deviates significantly from a system’s established pattern of behavior. That’s the textbook line. The interesting part is what the textbook line leaves out: an anomaly is not the same as a problem, and most of the difficulty in this field lives in the gap between the two.
There are 3 types of anomalies (taxonomy):
A point anomaly is a single sample that’s weird on its own. CPU jumps from 20% to 100%; one HTTP request takes ten seconds in a service whose p99 is fifty milliseconds. The vast majority of anomaly-detection content addresses this case because it’s the easiest to reason about.
A contextual anomaly is weird in its context but normal somewhere else. Three in the morning is a quiet hour for most consumer services, so a tenfold spike at 3 AM is anomalous, even though the same value at 8 PM would be routine. Context can be temporal (time of day, day of week, holiday vs. ordinary), spatial (this region vs. another), or relational (this service’s error rate vs. its peers’).
A collective anomaly is one where no individual sample is suspicious but the pattern is. Latency creeps up half a millisecond per hour for two days. No single sample violates any threshold. The drift itself is the anomaly.
That taxonomy is useful because each kind demands different detection methods, and most production “anomaly detection” failures come from applying a point-anomaly method to a collective-anomaly problem.
The harder distinction is between an anomaly and a problem. Detectors find the first; on-call engineers want the second. The gap is wider than most teams expect:
- An anomaly without a problem is the noise the pager hates. CRON job fires, batch traffic spikes, autoscaler adds capacity, latency briefly degrades during a release. All anomalous, none worth waking anyone up for.
- A problem without an anomaly is the silent killer. A slow memory leak, a quietly stale cache, a third-party dependency returning the wrong data with a healthy 200 status code. The signal sits inside the noise floor of every detector you have.
The economics make things even worse. Production systems have millions of metrics. A detector with a 0.1% false-positive rate, run every minute against a million metrics, generates well over a million false alerts a day. Tuning that down is half the job. The other half is making sure the genuine problems aren’t being thrown out with the noise.
“Detect anomalies” is the wrong starting point. The framing that pays is “detect actionable signals,” and most of what follows is about closing the gap between the two.
2. The statistical ladder out of static thresholds
Most teams who reach for “anomaly detection” are actually climbing out of static thresholds, and most never need to leave the statistical ladder to do it. The ladder runs through five or six well-understood rungs, each one answering a specific failure of the rung below.
Static thresholds are where everyone starts. Alert if CPU > 80%, latency > 500ms, error rate > 1%. They’re easy to write and easy to understand, which is why they spread. They fail in three ways: services with high normal baselines never breach them, services with low normal baselines breach them under harmless load shifts, and a single global threshold can’t fit the dozens of services that share a dashboard.
Moving averages were the first answer. Compute the average over the last N minutes; alert when the current value exceeds that average by some margin. The threshold becomes adaptive. The catch: a plain moving average has no memory of recency, so old data and recent data weigh equally. Doubling the window doesn’t change that.
Exponentially weighted moving averages (EWMA) fix the memory problem. The update rule is one line:
α is a smoothing factor between 0 and 1. New samples get weight α; older samples decay geometrically. Pick α = 0.1 and the last ten samples dominate. The whole point of EWMA is that it gives recent data more weight without throwing old data away abruptly. Production observability stacks use EWMA all over the place, often without naming it.
Z-scores add a measure of how surprising a sample is. Track the mean and standard deviation of recent values; express the new sample in standard-deviation units away from the mean. The classic “3-sigma” alert fires when a sample is more than three standard deviations from baseline. Z-scores work well when the underlying signal is roughly normal-distributed. They fail badly when it isn’t — which in observability data is most of the time.
Seasonal-trend decomposition (STL) is the rung that matters for any system with daily or weekly cycles. STL splits a time series into three components: a slow trend, a repeating seasonal pattern, and a residual. Anomalies are unusual values in the residual, not in the raw signal. Without seasonal decomposition, every Monday morning traffic ramp looks anomalous. With it, only Monday mornings that don’t look like Monday mornings do.
Holt-Winters forecasting extends the same idea: model trend and seasonality, forecast the next value, alert when reality deviates from the forecast by more than the model’s expected error. Production-grade alerting at smart shops is usually some version of this, dressed up with service-specific tuning.
Change-point detection asks a different question: “did the regime shift?” Rather than scoring each sample, it looks for moments where the underlying distribution changes. Useful for catching slow drift that no point-anomaly detector will notice.
The lesson of the ladder is that most teams jumping to ML are skipping work that pays. EWMA plus seasonal decomposition handles the majority of operational signal at a fraction of the operational cost of a maintained ML model.
3. The ML toolbox for anomaly detection
When statistical methods aren’t enough, ML enters. One question collapses most of the method-choice tree before any of the rest: do you have labels or not?
Most reliability data is unlabeled. A team has millions of metrics streaming in continuously and a handful of postmortem documents describing past incidents in prose. Mapping postmortems back to “which metrics were anomalous between 14:23 and 15:47 on a given Tuesday” is manual archaeology, and most teams don’t do it. So most production ML for anomaly detection is unsupervised.
Isolation Forest is the workhorse of unsupervised tabular anomaly detection. The intuition is elegant: build a forest of random decision trees, and measure how many splits it takes to isolate each point. Anomalies isolate fast (few splits, because they live far from the bulk of the data); normal points take many splits. The method is fast, scales well, needs little tuning, and is the default starting point for batch unsupervised detection.
Autoencoders are neural networks trained to reconstruct their own input through a low-dimensional bottleneck. Trained on normal data, they get good at reconstructing normal patterns and bad at reconstructing anything else. Reconstruction error becomes the anomaly score. They handle complex, high-dimensional inputs (logs, traces, multivariate metric tensors) better than simpler methods, at the cost of training and operational overhead.
Density-based methods like DBSCAN (Density-Based Spatial Clustering of Applications with Noise) and LOF (Local Outlier Factor) flag points that live in low-density regions relative to their neighborhood. Conceptually intuitive: anomalies are the lonely points in feature space. Operationally expensive at scale, because the neighborhood calculation is quadratic in the naive implementation.
Supervised methods enter when labels exist. Two sources of labels actually work in practice: synthetic faults injected by chaos engineering, and carefully curated postmortem databases. Given labels, classic classifiers (random forest, gradient-boosted trees) on engineered features perform well and are easier to operate than deep models. The bottleneck is the labels, not the classifier.
Sequence models address temporal patterns specifically. LSTMs (long short-term memory networks) and, more recently, transformers, learn the shape of a time series and flag departures from learned dynamics. Useful when the anomaly isn’t a single bad value but a pattern: a request rate that’s drifting up too smoothly, a latency distribution that’s deforming in a way no single sample reveals.
The dirty truth that vendor marketing avoids: across published benchmarks, simple methods with careful seasonal preprocessing routinely outperform more sophisticated ones for most operational use cases. A z-score with a sane window beats a poorly tuned LSTM. An isolation forest with engineered features beats a sloppy autoencoder. The method choice matters less than the preprocessing, the parameter tuning and the maintenance.
The real choice isn’t method-vs-method. It’s whether the team can label, can preprocess, can monitor model staleness, and can maintain the pipeline over time. Each of those is harder than picking a method, and each one is what production anomaly detection actually consists of.
4. AIOps as a category
AIOps is a Gartner coinage from around 2016, originally an abbreviation for “Algorithmic IT Operations” before being relabeled “Artificial Intelligence for IT Operations” once the AI branding became more useful. The term describes applying ML and statistical methods to IT operations data: metrics, logs, traces, events, topology. The underlying discipline is older than the term; the term is mostly a marketing artifact that gave vendors a category to sell into.
The canonical AIOps pipeline has four layers:
- Ingest. Collect signals from across the operations stack. Logs from every service, metrics from every host, traces from every request, topology from the deployment platform, change events from CI/CD.
- Correlate. Group related signals together. One real incident routinely produces hundreds of distinct alerts; correlation collapses them into something an on-call human can hold in their head.
- Analyze. Detect anomalies, predict failures, suggest root causes. This is the layer where the ML work actually happens.
- Automate. Close the loop with remediation: auto-restart, auto-rollback, auto-assign-ticket, runbook execution.
The honest assessment of which layers actually benefit from ML is less flattering than the vendor pitch.
Ingest is plumbing. No ML there worth naming. Schema management, pipeline reliability, sampling strategy, retention policy. Important work, not AI work.
Correlate is the layer Gartner sold hardest as ML-driven, and it’s the layer where deterministic methods generally win. The most reliable way to correlate alerts is topology-aware deduplication: if you know service A depends on service B, and B has an alert, you suppress A’s symptoms. Time-window dedup catches most of the rest. Unsupervised clustering of alerts looks good in demos and brittle in production. Some products do use ML here; most that do also have a deterministic backbone underneath.
Analyze is where ML earns its keep. Anomaly detection on metrics, predictive failure on hardware telemetry, log-pattern extraction (turning millions of variable-string log lines into a manageable set of templates), trace outlier detection. These are genuine wins, but they’re a small slice of the AIOps bundle.
Automate is workflow. Take the signal that analysis produced, run a runbook against it. Useful, not ML.
The vendor landscape arranged around this pipeline includes Datadog Watchdog, Dynatrace Davis, New Relic AI, Splunk ITSI, ServiceNow ITOM, and PagerDuty AIOps. Most started as statistical correlation engines, picked up unsupervised anomaly detection along the way, and have recently bolted on LLM incident-assistant features (next section). Quality varies, and the marketing universally outpaces the product.
The honest read on “AIOps” as a purchase decision: the bundle has one or two genuinely ML-driven layers and a lot of plumbing dressed in AI vocabulary. Don’t buy “AIOps.” Buy the specific capability your operations gap actually needs. Alert fatigue gap: look at correlation and dedup. Detection gap: look at analytic capability. Mean-time-to-acknowledge gap: look at the LLM-driven triage layer covered below. The label that wraps all of these is doing less work than vendors imply.
5. LLMs in incident response
Large language models are the most recent entrant in this space, and they solve a different problem from the detection methods covered above. They don’t detect anomalies in continuous metrics. They help humans understand and respond to incidents once detection has fired. That distinction is doing a lot of work, and most of the confusion around “AI in observability” since 2023 collapses once you hold it firmly.
The places LLMs land genuine wins:
- Incident summarization. When five alerts fire across three services within sixty seconds, an LLM can produce a coherent paragraph describing what’s happening, given access to the alert text and recent metric context. The on-call engineer arrives at a chat thread that already says “checkout-service is returning 5xx because payment-gateway-client is timing out, started at 03:47 UTC after the 03:42 deploy” rather than five disjointed PagerDuty notifications.
- Log triage. LLMs are good at reading large bodies of unstructured text and surfacing the unusual lines. A thousand-line stack-trace dump that would take an engineer twenty minutes to skim becomes a three-bullet summary with the suspicious frames highlighted.
- Runbook synthesis and execution. “We’ve seen this before; here’s the runbook we used” suggestions, sometimes paired with the ability to execute the runbook with human approval. The runbook itself is typically deterministic; the LLM is the matcher and the explainer.
- RCA narration. Turning a graph of correlated signals and recent changes into a draft postmortem narrative. The draft still needs review, but the blank-page problem is gone.
- On-call companion. Chat-style interfaces against observability data. Instead of writing a PromQL query, the engineer asks the assistant in English. The assistant generates the query, runs it, returns the result, and explains what it means.
The places LLMs don’t help, and where the marketing gets ahead of the product:
- Continuous-metric detection at scale. LLMs are a poor fit for high-frequency numeric signals. Time-series anomaly detection remains statistical or specialist-ML work, and pretending otherwise produces expensive and slow detectors that perform worse than a tuned z-score.
- Real-time alerting paths. LLM inference is too expensive and too slow to sit in the millisecond-budget hot path of production alerting. Detection runs in a separate, cheap layer; the LLM enters after the detection has already fired.
- Authoritative root cause. LLMs are very good at producing plausible-sounding root-cause narratives. Plausibility is not correctness. A confident, well-written hypothesis that contradicts the actual cause is more dangerous than no hypothesis at all, because it anchors the engineer’s investigation in the wrong direction. The RCA-suggestion features that work treat the LLM output as a starting point for human investigation, not a verdict.
The pattern emerging across mature teams is ML for detection, LLM for triage. Detection runs cheap and fast on time-series data; the LLM enters after a signal has fired, to help the on-call engineer reach situational awareness in seconds rather than minutes. The division matches strengths to costs, and it avoids the temptation to use the most expensive model for the cheapest part of the workflow.
One more note on automation. The “agents that fix incidents on their own” pitch is real in marketing decks and rare in production. Most teams that have tried it use LLM suggestions with a human-in-the-loop check, because the blast radius of an LLM acting unilaterally on prod is roughly the same as a chaos experiment with no abort condition. The autonomy ceiling will rise, but slowly, and behind the same maturity gates that govern any other production automation.
6. Anomaly-based alerting vs SLO-based alerting
Anomaly-based alerting and SLO-based alerting are two competing philosophies for deciding when to wake a human up. They answer different questions, optimize for different failure modes, and have different operational costs. Mature teams run both, for different reasons.
Anomaly-based alerting asks: does the system look weird right now? Page when the current behavior departs from the recent past by more than some threshold. The signal is statistical departure, computed by whichever method from §2 or §3 the team has chosen. Strength: catches novel failures early, including ones that haven’t yet produced user-visible damage. Weakness: noisy. Every deployment, every traffic shift, every cron job firing is anomalous. The cost of running anomaly-based alerting is the never-ending work of tuning the noise floor down.
SLO-based alerting asks: is user-visible reliability threatened? SLO stands for service level objective, a number that defines the acceptable failure rate over a window (for example, “99.9% of requests should succeed over a rolling 30-day window”). The SLI is the service level indicator, the actual measured failure rate. The difference between actual and target, integrated over time, is the error budget. SLO-based alerting fires when the burn rate against that budget exceeds a sustainable pace. Strength: quiet, focused, expressed in terms users feel. Weakness: blind to slow-burn problems that haven’t yet eaten the budget but will.
The trade-off is the precision-recall tradeoff applied to oncall. Anomaly-based alerting is high-recall: it catches everything weird, including a steady stream of false positives. SLO-based alerting is high-precision: it pages only when user pain is real, at the cost of missing weak signals.
A concrete example makes the difference vivid.
Suppose your p99 latency starts climbing slowly. Each individual sample is still within the SLO. The SLI hasn’t moved. The error budget burn rate is fine. An SLO-based alerting system says nothing. An anomaly detector trained on the recent past notices the drift immediately: the trend has changed, even though the threshold hasn’t been crossed. The on-call engineer who responds to the anomaly has hours, possibly days, to investigate before the slow drift becomes a fast incident. By the time SLO-based alerting would have fired, the burn rate has already crossed dangerous thresholds and the situation is no longer ambiguous.
Now imagine the inverse. A flaky downstream service produces brief, self-resolving error spikes every few hours. Each spike is anomalous; none of them last long enough to consume meaningful budget. Anomaly-based alerting pages every time, until the on-call team mutes the metric in self-defense. SLO-based alerting stays silent, because the user impact is negligible.
The mature pattern is two systems running in parallel:
- SLO-based alerting for the pager. Only signals that correspond to real user-visible impact get a human out of bed. The signal-to-noise ratio for the pager is the single most important operational metric the team owns, and SLO-based alerting protects it.
- Anomaly-based alerting for investigation and dashboards. Weak signals, early warnings, novel patterns, regressions that haven’t yet violated any objective. These show up in a separate stream that an on-call human reviews on shift, but doesn’t get paged on. They feed root-cause hypotheses, deployment health checks, capacity planning.
The two systems serve different audiences. The pager serves the on-call human at 3 AM. The anomaly stream serves the team at 11 AM, reviewing what shifted overnight. Conflating them is the most common failure mode in this space: teams either page on every anomaly (and burn out the on-call rotation) or use only SLO-based alerting (and miss everything that the SLO didn’t notice).
7. The hard problems and prerequisites
Anomaly detection is a domain where the methods are easier than the conditions for the methods to work. Most of what kills production anomaly-detection projects isn’t the algorithm. It’s the data, the labels, and the operational discipline around them. Six traps recur across every team that tries this seriously.
Base-rate fallacy. A 1% false-positive rate sounds modest. Multiply it by a million metrics polled once a minute and you generate fourteen million false alerts per day. Methods get evaluated in academic benchmarks where the base rates are tame. Production base rates are brutal, and a method that’s “99% accurate” on a benchmark can be useless in production because the 1% it gets wrong dwarfs the 0.01% it gets right.
Concept drift. The system you trained the detector on is not the system you operate today. Code deploys, infrastructure migrates, traffic patterns shift, new services appear, old ones get deprecated. Models that performed well at launch silently degrade. Retraining cadence is a real problem, and it’s underdiscussed because it’s unglamorous. The model that worked six months ago is now flagging the new normal as anomalous and missing the new failure modes.
Cold start. New services have no history. Anomaly detection methods that need weeks of baseline data leave new code unprotected during the exact window when its riskiest. The pragmatic answer is borrowing baselines from analogous services or running both static thresholds and adaptive methods in parallel until enough data accumulates, but neither is satisfying.
Interpretability tax. An ML alert that says “something is wrong with high probability” is not actionable. The engineer needs to know which dimension is weird, by how much, and ideally why. Methods that can’t explain themselves get distrusted, and trust is what determines whether the alert gets investigated or muted. The most successful production deployments of ML-based detection treat interpretability as a hard requirement, not an afterthought.
Labeling scarcity. “This window was an incident” labels emerge from postmortems, slowly, with disagreement between engineers about when the incident started and ended. Most teams have hundreds of unlabeled incidents and dozens of inconsistently labeled ones. Without a labeling pipeline, supervised methods are off the table, and unsupervised methods can’t be evaluated rigorously.
Adversarial rarity. Real incidents are rare and weird by definition. Training data is dominated by normal operation. The events the detector most needs to catch are the events the detector has seen least often. This is the same shape as the problem fraud detection and rare-disease diagnosis face, and the field has learned that the methods that work are the ones that take the rarity seriously: anomaly scoring rather than classification, careful handling of class imbalance, synthetic data generation, and a willingness to operate at low precision in exchange for usable recall.
The prerequisites for adopting any of this seriously mirror the ones from the chaos engineering prerequisites list:
- Telemetry good enough to characterize normal. Request rate, error rate, latency distribution, resource utilization, with consistent labels across services. If you can’t describe “normal” in numbers, no detector will help you find departures from it.
- Agreed-on signals. What does this team consider an incident? Without that agreement, no detector can be evaluated, and no labeling pipeline can produce coherent labels.
- Postmortem culture that produces usable labels. The labeled-data pipeline begins in the postmortem document. If postmortems are skipped, written carelessly, or buried, the labeled-data pipeline never exists, and the supervised side of this discipline is closed off.
- SLOs, even imperfect ones. SLO-based alerting is the precision side of the alerting pair from §6. Without it, anomaly-based alerting has nothing to be compared against and no quiet fallback.
The closing connection worth naming: chaos engineering and anomaly detection sit on the same axis of the reliability practice. Chaos engineering deliberately produces abnormal conditions in production. Anomaly detection notices abnormal conditions in production. Together they form a feedback loop: chaos generates the labeled abnormal data that anomaly detection needs to be evaluated against, and anomaly detection validates that chaos experiments were observable to begin with. Neither works well in a system that hasn’t done the observability and postmortem groundwork. Both reward teams that have.
The honest version of the field’s promise: anomaly detection won’t tell you when your system is broken. It will surface departures from normal that might mean something is broken, and a disciplined team with good telemetry, agreed signals, and a working postmortem culture will turn those signals into a faster path to recovery. The methods matter, the AIOps category matters, the LLM tooling matters. The prerequisites matter more.