Automation Thesis

Most outages happen with no alert at all. Your dashboard was never the problem.

Alerts tell you what already broke; a coworker notices the thing that has no alert yet.

ASR

Apollo Space Research

Apollo Space

October 4, 2025 · 10 min read

The worst outage I ever watched unfold had a green dashboard the entire time. Every check passed. CPU calm, error rate flat, all the lights the soft reassuring color of a system that is fine. And meanwhile a queue was filling one job at a time, a slow tide that no threshold had been written for, until the morning it crested and took a customer-facing flow down with it. The graphs were honest. They were just answering a question nobody had thought to ask three weeks earlier.

That is the failure that no alert catches: the one you didn’t know to write an alert for. And it is most of them.

Here is the line I want to be precise about, because almost every monitoring stack is built on the wrong side of it. Alerts tell you what already broke; a coworker notices the thing that has no alert yet.

The dashboard answers questions you already had

The model everyone inherits is simple and seductive. You decide, in advance, what “broken” looks like, latency over a number, errors over a rate, disk past a line, and you wire a tripwire to each one. When a wire trips, a page fires. It feels like vigilance. It is genuinely useful. And it quietly assumes something it shouldn’t.

It assumes you already imagined the failure.

Every threshold on that dashboard is a question some engineer asked on a good day, with coffee and foresight, months ago. “What if latency spikes?” Good question, there’s a graph for it now. But the queue that fills one job at a time, the third-party API that started returning subtly wrong data instead of errors, the config drift that only matters under a traffic pattern you’ve never had before, nobody asked those questions, so nobody drew those lines, so the dashboard stays green through the whole slow-motion accident.

A dashboard is a museum of the failures you already survived. It has nothing to say about the one walking toward you.

This is the part that stings for any team that has run a real fleet. Your monitoring is a perfect record of your past incidents and blind to your next one. Every panel was added the morning after something hurt. You are, in effect, fighting the last war on a wall of beautiful screens, and the next outage, the one with no alert at all, slides underneath every one of them.

A wall of green dashboards answers only the failures someone imagined in advance, while the unimagined failure slides underneath every panel untracked.

The naive fix is to add more alerts. It makes things worse.

So the obvious move, the one every team reaches for, is: add more tripwires. Cover more cases. Lower the thresholds. If the problem is the failures you didn’t predict, predict more of them.

I have watched this play out on team after team, and it fails in a specific, almost cruel way. You cannot pre-imagine the failure you’ve never seen, that’s what “never seen” means, so the new alerts still only cover the past. And each one you add raises the noise floor. Lower a threshold to catch more and it fires on Tuesdays when nothing is wrong. Now the on-call engineer has forty pages a week, thirty-nine of them noise, and the human cost arrives: alert fatigue. The page that finally does matter arrives in the same font, at 3am, as the thirty-nine that didn’t, and someone half-asleep swipes it away.

The naive fix doesn’t just fail to catch the unknown outage. It trains your team to ignore the alerts that work. You have spent your vigilance budget on questions you already knew the answers to, and there is none left for the question that hadn’t occurred to you.

And the cost compounds quietly. Every new tripwire is another thing to maintain, to tune, to mute during the maintenance window, to remember the reasoning behind six months from now when it fires and no one on the current rotation knows why it exists. The dashboard that was supposed to reduce your operational burden becomes a second system you operate. Teams end up with hundreds of alerts and a shared, unspoken agreement about which dozen actually mean anything, which is to say, the real monitoring has quietly moved back into people’s heads, where it started, except now there’s a wall of screens pretending otherwise.

The deeper problem is that a threshold is a static guess about a moving system. It cannot ask “is this normal for right now?” because it doesn’t know what right now is. It only knows the number you typed in March. Black Friday traffic looks like an attack. A planned migration looks like a meltdown. The 3am batch job that always runs looks, to a dumb threshold, exactly like the 3am incident that never has. The dashboard can’t tell the two apart, because telling them apart was never something a fixed line could do.

The shift: from thresholds you set to a coworker who watches

So what does the other side of the line look like? Not a smarter threshold. A different kind of thing entirely, something that watches the way a good operations engineer watches, which is to say, for change in character, not for a number crossing a line.

Think about how the best person on your team actually catches problems. They don’t memorize thresholds. They glance at a graph and something feels off, “that’s not how this usually looks at this hour.” They notice the deploy went out and the shape of traffic shifted in a way that has nothing to do with any alert. They connect the support ticket that just came in to the latency that’s still technically within bounds. They are not running queries. They are holding a model of normal in their head and feeling the deviation from it.

That is the capability worth building, and it is not a bigger dashboard. Alerts tell you what already broke; a coworker notices the thing that has no alert yet. The shift is from a system you configure with everything you fear, to a system that learns what your normal looks like and speaks up when reality stops matching, especially for the deviations no one wrote a rule for.

Concretely, that changes what the system is doing minute to minute. A threshold asks one frozen question forever: “is X over N?” A watching system asks a live one: “is what I’m seeing right now consistent with what this system looks like when it’s healthy?” The first question can only catch the failures you enumerated. The second can catch the ones you didn’t, because it isn’t matching against your list of fears, it’s matching against reality, and reality is where the unimagined outage actually lives.

A static threshold only fires when a number crosses a line you set in the past; a watching coworker holds a live model of normal and notices when the system's character changes, even with no rule written for it.

What “notices first” actually requires

It’s easy to say “a system that watches.” It’s worth being honest about what that demands, because it’s exactly the set of things a dashboard doesn’t have. Four of them, and none is a feature you bolt on later.

First, it has to be running when you aren’t looking, not waiting for you to open a tab, but continuously holding the question. The slow-filling queue crested at 6am on a Saturday. A dashboard is only as vigilant as the person staring at it, and at 6am on a Saturday nobody is. A coworker that notices first has to be the thing that’s awake.

Second, it has to see across the seams. The outage with no alert almost always lives between two systems, the deploy that shifted traffic, the upstream API that degraded, the queue that backed up because a downstream worker slowed by a hair. Each individual signal stayed in bounds. Only the relationship between them broke. A wall of separate panels can’t see a relationship; it can only see panels. The watcher has to hold them in one view to notice they’ve started disagreeing.

Third, and this is the part that separates a useful coworker from a screaming smoke detector, it has to bring you the why, not just the that. “Latency is up 12%” is a fact you now have to investigate. “Latency is up 12% and it started four minutes after the 14:02 deploy, concentrated on the checkout path, while the upstream payments API began returning slower responses, here’s the correlated window” is a deviation with a story attached. The story is the product. A flag with no reason is one more thing on your plate. A flag with a reason is an investigation that’s already half done.

There’s a fourth thing, easy to skip and the one that earns trust over time: it has to learn what to keep quiet about. A watcher that flags every deviation is just the noisy dashboard again, wearing a smarter coat. The 3am batch job, the weekly migration, the predictable Monday-morning surge, those are deviations from a flat baseline, but they are not problems, and a coworker learns that the way a person does: by seeing them happen, seeing them resolve fine, and adjusting the sense of normal accordingly. The goal isn’t to notice everything. It’s to notice the right things, and to get quieter, not louder, as it learns your system. That’s the inverse of the alert sprawl, and it’s why this approach gets more trustworthy with age instead of more exhausting.

Put those four together and you don’t have a better alert. You have something closer to a teammate who keeps an eye on production: awake when you’re not, watching across the seams, learning what to ignore, and when something genuinely feels off, telling you what changed and why it might matter, including, and especially, for the kind of failure no one ever wrote a rule to catch.

The turn: vigilance was never a dashboard problem

Here’s the part that isn’t about software.

The reason your best engineer catches the outage with no alert isn’t that they have better dashboards than everyone else. It’s that they carry production in their head. They know what Tuesday looks like, what a healthy deploy feels like, which graph wiggles are boring and which are the first tremor of something bad. That knowledge is the actual asset, and right now it lives in one or two people, it goes home at night, it takes vacations, and it walks out the door when they take another job.

We have spent a decade building dashboards to compensate for the fact that this knowledge doesn’t scale. But a dashboard was always a poor stand-in for the thing we actually wanted, which was a colleague who notices. The screens were never the goal. The noticing was.

Alerts tell you what already broke; a coworker notices the thing that has no alert yet. The most valuable person on your team isn’t the one who built the most thresholds. It’s the one who feels the deviation, and that feeling is finally something you can build into the system itself, so it doesn’t go home at 6pm and doesn’t leave when they do.

That’s what we’re building at Apollo Space, not another wall of green panels, but a coworker that holds your system’s normal in its head and speaks up when reality stops matching it, even when no one wrote the rule. If the quietest fear in your week is the outage your dashboard will be green through, that’s not a gap in your monitoring. It’s the difference between a screen you watch and a teammate who watches with you.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist