Anomaly Detection for Data Quality: Why Your Monitoring Cries Wolf, and How to Build Alerts the Team Will Trust

Last updated: June 2026

A data team turns on anomaly detection across their warehouse on a Monday. By Wednesday the dedicated Slack channel has 140 alerts in it. Row counts are flagged as unusually high because a marketing campaign launched over the weekend. A nightly table is flagged as late because the clocks shifted for daylight saving. A revenue column gets an alert for an unusual spike that turns out to be one large enterprise deal closing, which is exactly the thing the whole company wanted to happen. By Friday, someone mutes the channel. Three weeks later a real bug doubles every order record for a day, the alert fires into the muted channel, and nobody notices until finance asks why the month came in 8 percent ahead of forecast.

That sequence is the core problem with data quality monitoring, and almost every team that adopts it lives through some version of the story. The same sensitivity that lets a system catch problems nobody thought to test for is the sensitivity that buries the real signal under a pile of false alarms. A monitor that never fires is useless. A monitor that fires constantly is worse than useless, because it trains the team to look away and then fails quietly the one time it counts.

This guide is about closing the distance between those two failure modes. It covers how anomaly detection for data actually works, why the naive version produces so much noise, and the specific practices that separate monitoring a team trusts from monitoring a team mutes. It assumes you own data that other people depend on, not that you are trying to decide whether monitoring is worth having.

What anomaly detection actually means for data

It helps to be precise about what this category does, because it is often confused with ordinary testing.

A test is an assertion you write in advance. In dbt or a similar framework you declare that a primary key is unique, that a column is never null, that an amount is never negative. The check is deterministic and it only catches the failure you already predicted. Anomaly detection works from the other direction. Instead of you stating the rule, the system learns what a metric normally looks like and flags departures from that baseline. Monte Carlo’s primer on the subject lays out the common statistical building blocks, including the Z-score and interquartile range methods that measure how far a value sits from its historical center, alongside machine learning approaches for messier patterns. The point of all of them is the same: catch the breakages, outliers, and unknown unknowns that no one wrote a test for.

A useful way to see the split in practice is AWS Glue’s data quality feature, which keeps rules and anomaly detection as separate but complementary tools. You write explicit rules when you know exactly what to check, and you point analyzers at columns when you know they matter but cannot yet express what good looks like. The analyzers gather statistics over time and surface deviations from the profile they have built. That division captures the whole reason anomaly detection exists. You reach for it precisely where your foresight runs out.

The four things worth watching, and the trap inside each

Most data anomaly detection comes down to four signals, and each one has a built-in way of fooling you.

Freshness asks whether a table updated when it should have. This is the most useful signal and also the one most likely to misfire on a calendar quirk. A table that is reliably late every public holiday, or that shifted by an hour during a daylight saving change, will trip a freshness monitor that does not know about the calendar.

Volume asks whether the number of rows landed inside its normal range. Real business events break this constantly. A campaign, a new partner integration, a pricing change, or a single large customer can move volume far outside its baseline for entirely legitimate reasons.

Distribution asks whether the values themselves still look right, catching null spikes, shifts in averages, or a sudden collapse in the number of distinct values. This is the richest signal and the noisiest, because genuine seasonality looks identical to a problem if the model has not accounted for it.

Schema asks whether the structure changed. Of the four, this is the most trustworthy, since a renamed or dropped column is rarely ambiguous and almost always worth knowing about immediately.

The lesson is not that some signals are good and others are bad. It is that three of the four routinely fire on events that are real and harmless, and a monitoring setup that does not anticipate that will spend most of its alerts on things the business was happy about.

Why the naive version drowns you

The single biggest reason data monitoring generates noise is statistical, and it is worth understanding because it explains the Wednesday-afternoon flood.

Suppose you monitor 300 tables with a dozen metrics each. That is 3,600 checks running on every refresh. Even if each individual check has a small chance of firing falsely, say one in a hundred, you are now generating dozens of false alarms per cycle by pure arithmetic. None of them indicate a real problem. They are the expected behavior of running thousands of independent tests at once. This is the multiple comparisons problem, and it does not go away by tuning a single threshold tighter, because tightening one threshold to cut false alarms also makes you miss real ones. Monitoring has a hard tradeoff at its center: as a practical guide to false alerts in monitoring puts it, you cannot drive both false positives and false negatives to zero at the same time, so the real job is choosing an acceptable balance rather than chasing a perfect one.

This is exactly where the better platforms earn their keep, and it is a question of statistical discipline rather than fancier dashboards. QuantumLayers has written about the advanced statistical safeguards that address the volume problem directly, including false discovery rate correction and effect size reporting, so that what reaches a human is a deviation large enough and significant enough to matter rather than the statistical noise that is guaranteed to appear when thousands of checks run together. The principle holds no matter which tool you use. If your monitoring does not correct for how many checks it is running, the math will hand you a steady stream of alarms that are technically real outliers and practically meaningless.

Seasonality is the difference between a useful baseline and a useless one

Data has rhythms, and a monitor that ignores them will treat every rhythm as a problem.

Most business metrics are seasonal at more than one scale. Web traffic and orders dip on weekends and spike on weekday mornings. Billing tables surge at the end of the month. Retail moves with holidays. A flat threshold, or even a simple average across all of history, will flag the normal Monday rise and the normal month-end surge as anomalies, which means it will be wrong on a predictable schedule.

The fix is to anchor expectations to comparable periods rather than to a single static number. Compare this Tuesday to the last several Tuesdays, not to the weekly mean. Compare this month-end to prior month-ends. The same logic shows up clearly in uptime monitoring, where the advice for cutting false alarms is to set thresholds against your own measured distribution: a guide to reducing false positive alerts recommends putting your timeout comfortably above your real p99 response time, because a limit set below normal variation will trigger on healthy traffic. The data equivalent is to learn the actual shape of a metric, including its weekly and monthly cycles, before deciding what counts as abnormal. A baseline that does not understand your calendar is not really a baseline.

Make every alert actionable, or do not send it

The teams that run quiet, trusted monitoring borrow a rule from site reliability engineering, and it is the most important rule in this whole piece.

Engineers who run production systems learned long ago that volume is the enemy of attention. One analysis of alerting practice across many teams found that of the flood of pages a typical stack produces, only a small single-digit fraction need immediate human action, while the rest quietly burn out the on-call rotation and teach people to ignore the pager. The discipline that follows is simple to state. If an alert fires and no one can take a specific action in response, it should not be an alert. It belongs on a dashboard, or in a log, or in a weekly digest, not in the channel that is supposed to mean something is wrong.

For data teams the practical translation is to alert against a defined expectation rather than against every wobble. Tie your monitoring to data service level objectives, the same way reliability teams tie paging to SLOs. A guide on alert fatigue makes the point that a lot of noise comes not from too many incidents but from alerting on the wrong things and on thresholds that were never tuned, and that SLOs give you a principled line for what deserves attention. Decide that a given table must be fresh by 7 a.m. and complete to within a percent, then alert when you are burning through that budget. Everything that does not threaten a commitment someone actually depends on can be observed without being escalated. Severity tiers and routing matter here too. A schema change on a core table pages the owner. A minor volume wobble on a staging table updates a dashboard and waits.

A tuning loop that keeps trust intact

Monitoring is not something you switch on and walk away from. The teams whose alerts stay credible run a short, repeating loop.

Once a week, look at every alert that fired and sort it into one of three buckets. It was real and you acted on it, in which case the system worked. It was a known benign pattern, like the month-end surge, in which case you teach the model about the seasonality or suppress that specific case. Or it was a threshold set too tight, in which case you loosen it. Over a few cycles this drives the false positive rate down to where the channel becomes worth reading again. It also helps to measure that rate per monitor rather than in aggregate, because the noise is almost never spread evenly. A small number of badly tuned checks usually produce most of the false alarms, and quantifying the false positive rate for each alert type tells you exactly where to spend your tuning effort.

The other half of staying trusted is scope. Resist the urge to monitor everything on day one, which is what produces the 140-alert Wednesday. Start with the handful of tables that would cause the loudest meeting if they were wrong, get those quiet and reliable, and expand only once the team believes the alerts. Coverage that no one reads is not coverage.

What to do first if you are starting from scratch

If you have no anomaly detection today, sequence the rollout instead of buying a platform and pointing it at the whole warehouse.

Begin with freshness and volume on your two or three most-consumed tables, since stale or empty data on a heavily used table is both the most common failure and the easiest to catch. Set the expectations against comparable periods from the start, so you are not fighting seasonality from day one. Route by severity and make sure every alert that does fire carries enough context to act on, naming the table, what moved, and who owns it, so the responder is not starting an investigation from a single red dot. Then run the weekly tuning loop and keep widening scope as trust grows.

The goal in the first month is not to cover everything. It is to produce an alert channel that no one wants to mute. Once the team believes that a message in that channel means something, you can extend the same standard across the rest of the warehouse, and the monitoring will keep working because people are still paying attention to it.

The bottom line

Anomaly detection earns its place by catching the failures nobody had the foresight to test for. That is also the source of its main risk, because a system sensitive enough to find the unexpected is sensitive enough to fire on every harmless surprise the business throws at it. The teams that get value from it are the ones that treat the noise as a design problem rather than a fact of life.

Correct for the number of checks you are running so the math does not flood you. Model seasonality so the calendar stops looking like a crisis. Make every alert actionable and tie it to an expectation someone depends on. Then run a tuning loop that keeps the false positive rate low enough for the channel to stay credible. The worst outcome is not an alert that misses a problem. It is a muted channel, which carries the full appearance of coverage and none of the protection. The next real anomaly is already on its way into your data. The only question is whether anyone will be watching the channel when it arrives.


Lurika is an independent publication covering data analytics. We are not owned by any analytics vendor.