Engineering

The error message said "not configured." It was configured.

The most expensive bug isn't the failure, it's the failure that names the wrong cause.

ASR

Apollo Space Research

Apollo Space

· 10 min read

A request died at 2am. The error in the log was four words long: integration not configured. So the on-call engineer did the obvious, responsible thing, opened the settings, checked the credential, re-entered it, restarted the service, watched it fail again. Checked a second credential. Read the setup guide. Three hours in, with the sun coming up, someone finally pulled the raw call and saw the truth: the integration was configured perfectly. The provider on the other end had simply timed out, and somewhere in the stack a generic handler had translated I couldn’t reach them into you set this up wrong.

Nobody wrote bad code that night. The bug was real and ten minutes from fixed. The three hours went to the four words that pointed the wrong way.

The most expensive bug isn’t the failure, it’s the failure that names the wrong cause.

This post is about why those four words cost so much, why almost every system produces them by default, and the discipline that makes an error describe what actually happened instead of what someone guessed.

The bug that lies is worse than the bug that crashes

Start with the two failures every engineer has met, and notice which one you’d rather have.

The first is the honest crash. Something falls over, the stack trace points at the exact line, and you fix it. It’s annoying, sometimes at a terrible hour, but it doesn’t waste you, the failure and its cause are the same object, sitting in the same place. You read it, you believe it, you’re done.

The second is the failure that names the wrong cause. The system breaks for reason A and reports reason B. Now your debugging is aimed at a target that was never broken. Every hour you spend “fixing” B is worse than wasted, because at the end of it the thing still fails and you’ve now convinced yourself the part that’s actually broken is fine, you just checked it. The lie doesn’t just cost time. It actively steers you away from the answer and leaves you more confident in the wrong model than when you started.

A crash costs you minutes. A misnamed failure costs you the hours you spend trusting it.

This is why a generic error handler is so seductive and so dangerous. It looks like care. Every failure gets caught, every failure gets a tidy message, nothing leaks an ugly trace to the user. But a handler that catches everything and labels it with one of three friendly strings has thrown away the one thing that mattered: what specifically went wrong. It traded the messy truth for a clean lie. The most expensive bug isn’t the failure, it’s the failure that names the wrong cause, and tidy error handling manufactures those at scale.

Two lanes from one failure: a provider timeout. In the naming lane, a generic handler labels it 'not configured', a human hunts the config for hours, and the fix never touches the real fault. In the evidence lane, the captured cause says the provider call timed out at the boundary, which points straight at the fix.

The naive version: catch it, name it, move on

The way most systems get here is not negligence. It’s the most reasonable-looking line of code you can write.

A call somewhere might fail, so you wrap it. The wrapper catches the exception, and now you have to say something to whoever’s downstream. You don’t have time, at the moment you’re writing the wrapper, to enumerate every way the call could go wrong, the network, the auth, the rate limit, the malformed response, the provider being down, the provider being up but slow. So you pick the most common cause you can think of, write a message for that, and ship it. Probably it’s a config problem. Let’s say “not configured.”

It works. For months. The first hundred times the call fails, it really is a config problem, and the message is right, and everyone’s happy. Then one night the call fails for the hundred-and-first reason, a timeout, and the message says exactly what it always said, and it’s a lie, and nobody knows, because the message has been trustworthy for so long that no one thinks to doubt it.

That’s the trap. The naive handler isn’t wrong on day one. It’s wrong on the day the failure mode changes, and by then everyone has learned to trust it. You don’t get punished for the guess when you write it. You get punished months later, at 2am, by an engineer who has no reason to suspect the words on the screen.

And the deeper problem: the handler had the truth and threw it away. At the instant the exception was caught, the system knew it was a timeout, the exception type said so, the elapsed time said so, the absence of any response said so. All of that was in hand. The handler looked at a rich, specific failure and flattened it into a string someone typed weeks earlier from a guess. The information loss happened on purpose, in the name of a clean message.

Our way: the failure carries its own evidence

The fix is not “write better error messages.” Better guesses are still guesses. The fix is to stop guessing, to make the failure describe itself from what actually happened, not from what someone predicted would happen.

The key idea is simple. When a call fails, the thing that fails knows more about the failure than anyone who reads about it later ever will. It knows what it was trying to do, who it was talking to, what came back, and where it stopped. An error message should be the messenger for that, not a label chosen in advance and stamped on every failure that walks by.

So at every boundary where one part of the system talks to another, a provider, a database, a tool, another service, we record the cause instead of naming it. Not “not configured,” but: this call, to this provider, with this shape of request, got no response within the timeout. The message isn’t authored ahead of time from a guess. It’s assembled at the moment of failure from the facts of the failure. Whoever reads it next, and we’ll get to who that is, receives what happened, not someone’s old theory of what usually happens.

A failure deep in the stack is caught at the boundary. The naming path stamps it with a guessed label and sends the next reader hunting the wrong thing. The evidence path records what called what, what answered, and where it stopped, handing the reader the true cause to act on the first time.

Notice what this costs and what it doesn’t. It does not cost much code, capturing the cause you already caught is cheaper than inventing a message for it. It does not leak ugliness to the end user, because the rich cause and the friendly user-facing line are two different things; the user still sees “something went wrong on our side,” while the system sees the timeout. What it costs is the comfort of a tidy, pre-written string. You give up the illusion that you knew, in advance, why things would break. In exchange you get a failure that tells the truth on the night the failure mode is one you never imagined.

Why this matters more when the reader is an agent

Here’s the part that turns a debugging nicety into something load-bearing. For most of software’s history, the reader of an error was a human. Humans are forgiving readers, a person can squint at not configured, feel that it doesn’t smell right, pull the raw call, and route around the lie. Slow and expensive, but possible, because a human brings outside suspicion to the text.

An agent reading the same error brings no such suspicion. If the message says not configured, the agent will, sensibly, go try to configure it. It will check the credential, re-enter it, suggest the user re-authenticate, it will faithfully chase exactly the wrong fix, because it has no independent reason to disbelieve the words it was handed. A misnamed failure doesn’t just waste a human’s hours now. It sends an automated actor confidently down the wrong path, and an automated actor goes down that path fast and at scale.

Which flips the priority completely. In a system where agents read failures and act on them, the error message is no longer a diagnostic afterthought you tidy up later. It’s an instruction. The quality of every recovery the system can do, file the right ticket, retry the right call, tell the user the true thing, is capped by whether the failure described itself honestly at the moment it happened. Garbage in, confident garbage out.

When a human reads a lie, you lose hours. When an agent reads a lie, you lose hours at machine speed.

This is why we treat the cause-capturing boundary as part of the foundation, not the polish. An agent that can read a true failure can route it: see the timeout, retry once, and if it still times out, tell the user the provider is down right now, which is true, instead of your integration is misconfigured, which is false and sends everyone, human and agent alike, to fix a thing that was never broken. The recovery is only as good as the message. So the message has to be good first.

The turn: it’s about respect for whoever debugs next

Who is the failure for?

Not for the program, the program already crashed; it doesn’t need the message. The failure is written for the next person, or the next agent, who shows up at the worst possible moment with the least possible context and has to figure out what’s wrong. That reader is tired. It’s 2am, or it’s the fortieth ticket of the day, or it’s an agent three steps into a task with no memory of how the system is wired. They are going to trust the words you left them. That trust is the whole transaction.

A guessed error message is a small act of disrespect toward that reader. It says I didn’t have time to find out why this really breaks, so here’s my best hunch, good luck. And the reader pays for that shortcut in full, at the hour they can least afford it, while you sleep. A captured cause is the opposite. It says I caught this when I knew the most about it, and I wrote down what was actually true, so you wouldn’t have to reconstruct it from nothing. That’s not a feature you can install. It’s a habit, the discipline of telling the truth about a failure at the one moment you’re certain of it, on behalf of someone you’ll never meet.

The most expensive bug isn’t the failure, it’s the failure that names the wrong cause. Everything we build to recover from failures, automatically and at speed, rests on the failures telling the truth first.


That’s the discipline we’re building into Apollo Space: a system where every failure describes what actually happened, so the next reader, a teammate at 2am or an agent mid-task, never burns an hour fixing a part that was never broken. If you’ve ever lost a night to four words that pointed the wrong way, you already know which bug we decided to kill first.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist