Engineering

Observability for agents is a different sport

Logs tell you what the code did. They cannot tell you why an agent decided to do it, and for a non-deterministic system, the why is the only debugging unit that matters.

ASR

Apollo Space Research

Apollo Space

December 8, 2025 · 11 min read

A user types one sentence to an agent, the agent does the wrong thing, and you open the logs to find out why. The logs are immaculate. The function was called with the arguments it was called with, it returned what it returned, the request was a clean 200. Every line is true and not one of them tells you the thing you need: why did the agent decide to do that, instead of the right thing?

That gap is not a missing log line. It is a different sport.

Logs answer “what did the code do.” Debugging an agent needs “why did it decide that”, and for a non-deterministic system, the decision trace is the only unit that matters.

This post is about why the dashboards that ran software for forty years go quiet exactly when an agent goes wrong, and what you have to record instead.

The naive version: treat the agent like a service

The obvious move, when you put an agent behind an API, is to observe it the way you observe every other service. You already have the stack. Request in, span out, latency on a graph, error rate with an alert on it, a stack trace when something throws. It’s a solved problem. You wire the agent into the same pipeline and call it covered.

It works right up until the agent is wrong without being broken.

Here is the case that breaks the model. A user asks the agent to do one thing. The agent does a different, plausible thing instead, confidently, successfully, with no exception anywhere. The user asked it to look something up; it went and wrote something. The request succeeded. The latency was fine. The error rate didn’t move. By every signal your service-grade observability collects, that request was a good request.

So you open the trace, and what’s in it is the shape of the code: which function ran, what it returned, how long it took. What’s not in it is the shape of the decision: that the agent read the sentence, classified the intent as a write when it was a read, planned a sequence around that wrong classification, and executed it perfectly. The bug isn’t in any function. The bug is in a choice. And your logs don’t record choices, because for forty years the code didn’t make any, it followed them.

The bottleneck never disappears. It just moves, from “did the code run” to “why did the agent choose.”

That is the whole difference, and it is not small.

Why “what” stops being enough

Step back and look at what made traditional logs work for so long. The reason a stack trace is useful is that ordinary code is deterministic. Same input, same path, every time. So when something goes wrong, “what happened” is “why it happened”, the path the code took is fully explained by the inputs and the branches. You read the trace, you find the line, you found the cause. What and why are the same word.

An agent severs that. Same input does not mean same path. The model can read one sentence two ways on two runs. It can pick tool A today and tool B tomorrow on the identical request. The path is no longer explained by the input alone, it’s explained by a judgment the model made given the input. So “what happened” and “why it happened” come apart, and the entire value of a log, that what and why are the same thing, comes apart with them.

Once you see that, the right unit of observation changes. For deterministic code, the unit is the call: this function, these arguments, this return. For an agent, the call is the least interesting thing in the room. The unit you actually need is the decision, and a single user message produces a short chain of them.

We found it useful to name the four links that recur on almost every turn:

Classify, what did the agent decide the user actually wanted? This is where the read-vs-write confusion lives, where “schedule a call” gets read as “draft an email,” where the whole turn is won or lost before any tool runs.
Plan, given that reading, what sequence did it decide to do? One step or five, in what order, with what assumptions.
Tools, which capabilities did it actually reach for, with what arguments, and what came back? Not just “the API returned 200,” but why this tool and not that one.
Reply, what did it tell the user, and was that reply grounded in what the tools actually returned, or improvised over a gap?

A log captures the third one, tools, and only the mechanical half of it. The bug usually lives in the first one. You were watching the only link that was working.

A non-deterministic agent turn is a chain of four decisions, classify the intent, plan the steps, choose and call tools, compose the reply, and a traditional log only captures the tool call. The classify link, where most agent bugs are actually born, is invisible to it.

The decision trace: record the why, not just the what

So the fix is not a better log format. It’s a different thing to record.

A decision trace captures, for one user turn, the chain of choices the agent made and the reason attached to each link. Not “tool X ran.” Rather: the agent classified this message as this kind of intent, which is why it planned these steps, which is why it called this tool with these arguments, and composed this reply from that result. The trace is a record of judgment, not just execution, every link carries the choice and enough of the why to argue with it.

The difference shows the instant something goes wrong. Picture the read-misclassified-as-write bug again, the one the clean 200 hid.

With a log, you see a write tool fired and succeeded. Dead end, the write worked, so where’s the bug? You start guessing. Maybe it’s the tool. Maybe it’s the data. Maybe the user phrased it oddly. You’re reconstructing the agent’s reasoning from the outside, like reading a verdict with no trial.

With a decision trace, you open the turn and the first link tells you everything: the classify step labeled a read as a write. The plan was correct for that wrong label. The tool ran correctly for that wrong plan. The reply was honest about that wrong tool. The entire chain was internally flawless and rooted in one bad call at link one, and you can see it, because the trace recorded the call, not just its consequences. You don’t guess. You read the verdict and the trial.

This is the move: stop reconstructing the agent’s reasoning after the fact, and record it as it happens. The reason “why did it decide that” felt unanswerable is that the answer was never written down. The decision existed and then evaporated, and you were left holding its exhaust, the tool call, and asked to infer the engine.

A log is the agent’s exhaust. A decision trace is its reasoning, written down before it evaporates.

And once the reasoning is written down, two things you couldn’t do before become routine. You can replay the same message across two different models and lay their decision chains side by side, same input, different judgment, and now you can see exactly which link diverged. And you can grade a turn not on “did it 200,” but on “did each decision in the chain make sense given the one before it.” The trace stops being a forensic tool you reach for after a fire. It becomes the thing you watch to know whether the agent is reasoning well at all.

Two debugging sessions for the same wrong-action bug. With a log, you see only that a write tool returned 200, a dead end that sends you guessing at the tool, the data, the phrasing. With a decision trace, the classify link shows a read was labeled a write, and the rest of the chain is correct for that one bad call: the cause is visible, not inferred.

Why the alerts have to change too

There’s a second half to this, and it’s the one teams discover the hard way after they’ve already added decision traces: the alerting was built for the old sport too.

Service alerts fire on the failures the code can have, an exception, a timeout, a 500, a latency spike. Those are real and you still want them. But the characteristic agent failure throws none of them. The agent that confidently did the wrong correct-looking thing produced a 200 and a fast response. If your pager only knows how to wake you for exceptions, it will sleep through every interesting agent bug there is. The worst failures of a non-deterministic system are the ones that look, to a service monitor, like complete success.

The naive instinct is to alert harder on the same signals, tighten the latency threshold, watch the error rate more closely. That catches nothing new, because the bug was never in those signals.

What you actually need is alerting on the shape of the decision chain. A turn where the agent classified an intent it has no tool to serve. A reply that claims an action the tool log shows never happened, the agent saying “done” over a step that didn’t run. A plan that looped without converging. A classify-to-reply chain where the final answer doesn’t follow from anything the tools returned. None of those is an exception. Every one of them is a turn that went wrong. You can only alert on them because you recorded the decisions, the same trace that lets you debug after the fact is the thing that lets you notice before the user does.

The dashboard was never the problem. It was answering the question it was built for, what did the code do, flawlessly, about a system whose failures don’t live there anymore.

The turn: you are debugging a coworker now

Here’s the part that isn’t about telemetry.

When the thing inside your system stopped being code that follows instructions and became something that makes judgments, the questions you ask of it changed, and that’s a more human shift than it sounds. You no longer debug a machine by finding the broken line. You debug a decision-maker by understanding why it decided what it decided. That’s not a software question. It’s the question you ask a new colleague who did something surprising: not “what did you do,” which you can already see, but “walk me through your thinking”, and then you find the one assumption that sent the rest astray.

A decision trace is how you ask an agent to walk you through its thinking. It’s the difference between managing a coworker whose reasoning you can inspect and trusting a black box because its last response happened to look right. One of those scales to running real work. The other is a demo that hasn’t failed in front of a customer yet.

The agents will keep getting more capable. The models under them will keep getting smarter. None of that removes the need to see why they decided what they decided, it raises it, because a more capable agent makes bigger, faster, more consequential choices, and a choice you can’t inspect is a choice you can’t trust. Logs answer what did the code do. Debugging an agent needs why did it decide that, and that question only has an answer if you wrote the decision down.

This is part of what we’re building at Apollo Space, an operating system where agents do real company work, and where every turn they take leaves a decision trace you can open, read, and argue with. If you’ve ever stared at a perfectly clean log next to a perfectly wrong outcome, you already know that “what happened” was never the question. The question was always why.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist