What is confidence drift in AI agents?

Confidence drift occurs when an agent's internal confidence scores become miscalibrated over time, it reports high confidence in decisions that turn out to be wrong, or low confidence in decisions that would have been correct. It's analogous to a sensor going out of calibration: the readings look normal but don't reflect reality.

How do you detect hallucination in agent actions?

Through grounding checks, verifying that the agent's stated reasoning references real data in its memory or tool results, not fabricated information. We cross-reference every factual claim in the agent's reasoning chain against its source data and flag any claim that can't be traced to a real source.

What metrics should you track for AI agent health?

Decision accuracy (did the agent's actions produce expected outcomes), confidence calibration (do high-confidence decisions actually succeed more often), memory freshness (is the agent's knowledge current), action success rate (do tool calls succeed), and reasoning quality (are the agent's explanations consistent with its actions).

Engineering

Your monitoring was built for software that waits

Your monitoring stack was built for services that respond to requests. Agents don't respond, they decide, act, and learn. Here's why your observability needs a fundamental rethink.

ASR

Apollo Space Research

Apollo Space

September 2, 2025 · 13 min read

The Dashboard That Lies

Imagine a system where every dashboard reads green. Uptime: 99.97%. Latency p99: 184ms. Error rate: 0.02%. CPU and memory within normal bounds. By every traditional metric, the system is healthy.

And yet the SDR agent is producing the worst outreach emails it has ever written.

Not wrong emails. Not error emails. Not emails that bounce or fail to send. The emails send successfully. They are grammatically correct. They reference real prospects with accurate company information. They follow the brand voice rules. They pass every automated quality check in place.

They are just bad. Generic. Off-target. The kind of cold email you delete without finishing the first sentence. The reply rate quietly drops by more than half over a couple of weeks, and nobody notices until a human looks at the actual emails and says, “These are terrible.”

The cause is a memory retrieval regression. A dependency update changes the similarity scoring algorithm in the vector database, which shifts which episodic memories get retrieved during the reasoning phase. The agent is still accessing memories, just slightly wrong ones. Instead of retrieving the most relevant past outreach for a given prospect profile, it retrieves adjacent but less relevant episodes. The reasoning looks coherent. The outputs look plausible. The quality degrades silently.

Every monitoring tool says the system is fine. Because the system is fine. The infrastructure is fine. The agent’s judgment is compromised.

This is the scenario that reveals why monitoring AI agents requires fundamentally different observability than monitoring traditional services.

Why Traditional Observability Breaks Down

Traditional observability is built around a simple model: a service receives a request, processes it, and returns a response. You measure whether the service is up (availability), how fast it responds (latency), and whether it returns errors (error rate). The “Four Golden Signals” from Google’s SRE book, latency, traffic, errors, saturation, have been the gospel of monitoring for a decade.

Agents break this model in three ways.

Agents Don’t Have a Request-Response Cycle

A traditional service is reactive: something comes in, something goes out, you measure the in-to-out. An agent is proactive. It perceives signals from its environment, reasons about them, decides to act, and acts. There’s no “request.” There’s no “response.” There’s a continuous loop of perception, reasoning, and action.

You can measure whether the agent’s API is up. But the agent’s API being up tells you nothing about whether the agent is making good decisions. It’s like measuring whether a doctor’s office has electricity, technically relevant, but it tells you nothing about whether the doctor is giving good diagnoses.

Agent Failures Are Semantic, Not Structural

When a traditional service fails, it fails structurally: a 500 error, a timeout, a crash. The failure is visible in logs and metrics. You can alert on it because the failure manifests as an anomaly in a measurable signal.

Agent failures are semantic. The agent sends an email (success from an infrastructure perspective) that says the wrong thing (failure from a business perspective). The agent approves a PR (no errors) that introduces a subtle logic bug (catastrophic outcome). The agent generates a report (200 OK) with a hallucinated statistic (trust violation).

Traditional monitoring catches structural failures. Semantic failures are invisible to it.

Agent Quality Degrades Gradually

When a traditional service fails, it usually fails obviously. The error rate spikes. The latency shoots up. There’s a cliff, and you fall off it. Traditional alerting is designed for cliffs, you set a threshold, and when the metric crosses it, you get paged.

Agent quality doesn’t cliff. It erodes. The memory retrieval gets slightly less relevant. The confidence calibration drifts a few percentage points. The decision quality degrades from “excellent” to “good” to “acceptable” to “mediocre” over weeks. By the time the quality hits a threshold that would trigger a traditional alert, you’ve been producing subpar results for weeks and your users have noticed.

Gradual degradation is the adversary, and traditional monitoring has no weapon against it.

The New Observability Stack

Here’s what we built. Not as a theoretical framework, as the actual monitoring system running in production for Apollo Space’s twelve agents.

1. Confidence Calibration Monitoring

Every time a Apollo Space agent makes a decision, it reports a confidence score. “I’m 87% confident this email will get a reply.” “I’m 63% confident this PR change is safe.” “I’m 92% confident this competitor signal is significant.”

Confidence scores are only useful if they’re calibrated, meaning an action the agent reports as “90% confident” actually succeeds about 90% of the time. When calibration drifts, the agent’s self-assessment becomes unreliable, and the entire trust architecture (which uses confidence thresholds to decide what needs human review) breaks down.

We monitor calibration using reliability diagrams. We bucket decisions by confidence level (80-85%, 85-90%, 90-95%, etc.) and compare the agent’s predicted success rate against the actual success rate over a rolling 30-day window. Perfect calibration is a diagonal line: predicted = actual. Drift appears as deviation from the diagonal.

Our alert triggers when the calibration error (the average absolute difference between predicted and actual success rates across all buckets) exceeds 8 percentage points. At that level, the agent’s confidence scores are unreliable enough that the trust architecture’s thresholds need recalibration.

In practice, this alert fires rarely. Once from a memory-retrieval regression of exactly the kind described above. Once from a shift in ICP that made historical performance data less predictive, where the agent’s procedural memory was optimized for a prospect profile the team had moved away from.

Both times, traditional monitoring showed nothing. Both times, the confidence calibration alert caught the problem within 48 hours.

2. Hallucination Detection

Hallucination in agents is different from hallucination in chatbots. When a chatbot hallucinates, it says something false. When an agent hallucinates, it acts on something false. The stakes are categorically higher.

Our hallucination detection works by grounding checks. Every factual claim in the agent’s reasoning chain is checked against source data:

If the agent says “the prospect raised a Series B in January,” that claim is verified against the data in semantic memory or the most recent tool retrieval.
If the agent says “similar prospects have a 34% response rate to this type of email,” that statistic is verified against episodic memory aggregations.
If the agent says “the competitor’s pricing increased by 40%,” that claim is verified against the competitor watch agent’s stored snapshots.

Any claim that can’t be traced to a verifiable source is flagged as a potential hallucination. We don’t block the action immediately, not every ungrounded claim is a hallucination, and sometimes the agent makes valid inferences that aren’t directly traceable to a single source. But we log the flag and include it in the agent’s health score.

Our hallucination rate across all agents is approximately 1.2% of reasoning steps. Sounds low. But across twelve agents making thousands of decisions per week, that’s dozens of potentially ungrounded claims. We review the flagged claims weekly. Most are benign (reasonable inferences, paraphrased facts). About 15% are genuine hallucinations that would have led to incorrect actions.

The 15% is the number that matters. Without grounding checks, those hallucinations would have entered the action stream, emails with wrong data, reports with fabricated statistics, competitive briefs based on non-existent signals. Silent, plausible, and wrong.

3. Action Audit Trails

This is the most operationally important piece of agent observability, and it’s the one most teams skip.

Every action every Apollo Space agent takes is logged with:

Timestamp: When the action was taken
Trigger: What signal initiated the decision cycle
Context: What memory was retrieved and what tools were consulted
Reasoning: The agent’s stated reason for the action (extracted from the LLM’s reasoning output)
Action: What was done
Expected outcome: What the agent predicted would happen
Actual outcome: What actually happened (filled in asynchronously as outcomes become known)
Confidence: The agent’s reported confidence level

This isn’t a log file. It’s a queryable database. You can ask questions like:

“Show me all SDR agent actions in the last week where confidence was above 85% but the outcome was negative”
“Show me all QA agent decisions where the reasoning referenced a procedural memory that was last validated more than 60 days ago”
“Show me all competitor watch alerts where the actual business impact was zero, false positives that wasted attention”

The audit trail serves three purposes. First, debugging: when an agent does something wrong, the audit trail tells you exactly why, what data it had, what it was thinking, and where the reasoning went off track. Second, accountability: when a client asks “why did your agent send that email,” you can provide a complete chain of reasoning, not a shrug. Third, improvement: by analyzing patterns in the audit trail, which types of decisions have the highest failure rates, which reasoning patterns correlate with bad outcomes, you can systematically improve agent performance.

We store approximately 12,000 action records per week across all twelve agents. The storage cost is negligible. The value of being able to answer “why did the agent do that” within seconds is immeasurable.

4. Agent Health Scores

Traditional health is binary: the service is up or it’s down. Agent health is a continuous score composed of multiple dimensions.

Apollo Space’s agent health score is a weighted composite:

Dimension	Weight	What It Measures
Decision Accuracy	30%	Percentage of actions that produced expected outcomes (30-day rolling)
Confidence Calibration	25%	How well predicted confidence matches actual success rates
Memory Freshness	15%	Percentage of semantic memory entries validated within the last 90 days
Action Success Rate	15%	Percentage of tool calls that succeed (API calls, sends, queries)
Hallucination Rate	10%	Percentage of reasoning steps with ungrounded claims
Loop Efficiency	5%	Average number of PRAO cycles needed to complete a decision (lower is better)

Each agent’s health score is calculated hourly and displayed on a dashboard that looks nothing like a traditional monitoring dashboard. There are no green/red status lights. Instead, each agent has a health trajectory, a time series of its composite score over the last 90 days.

What you’re looking for isn’t a threshold breach. It’s a trend. An agent whose health score has been declining 0.5 points per week for three weeks is concerning even if the absolute score is still above any threshold you’d set. That’s the gradual degradation problem, and trend detection is how you catch it.

Our alerting rules:

Health score below 70: immediate alert, agent enters supervised mode (all actions require human approval)
Health score declining for 5+ consecutive days: warning alert, investigation required
Any single dimension below its individual threshold: dimension-specific alert (e.g., hallucination rate above 3% triggers a grounding review)

The Observability Tax

We won’t pretend this is free. Agent observability adds overhead, compute overhead for grounding checks, storage overhead for audit trails, latency overhead for confidence calibration.

Our numbers: observability adds approximately 12% to the compute cost of running each agent and 180ms to the average decision cycle latency. For the SDR agent, where decisions are measured in hours (when to send the next email), 180ms is imperceptible. For the observability agent itself, which needs to react in near-real-time to system anomalies, we’ve optimized the grounding checks to run asynchronously, the action proceeds immediately, and the grounding check confirms or flags retroactively.

The storage cost is about 4GB per month per agent for full audit trails. At current cloud storage prices, that’s roughly $0.10 per agent per month. Negligible.

Is the 12% compute overhead worth it? Consider what happens if you ship an observability agent without confidence calibration monitoring. A couple of weeks later, a vector database regression silently degrades your SDR agent for close to two weeks. The cost of that kind of degradation, measured in lost pipeline from below-par outreach, can run to a meaningful chunk of expected pipeline value, far more than the monitoring would have cost to catch it early.

The 12% compute overhead for confidence monitoring is modest, on the order of a few hundred dollars a month. The alternative is finding out your agent is broken when a human reads the output and says, “This is terrible.”

We’ll take the 12% overhead.

What Datadog Won’t Tell You

We have respect for the traditional observability vendors, Datadog, New Relic, Grafana, the whole ecosystem. Their tools are excellent for what they were designed to monitor: infrastructure, services, and request-response systems.

They weren’t designed to answer the question that matters for agents: “Is this system making good decisions?”

Good decisions aren’t a metric you can scrape from a /metrics endpoint. Good decisions require understanding the agent’s reasoning, comparing its predictions against outcomes, and detecting gradual calibration drift across hundreds of decisions over weeks. This is a fundamentally different data model than time-series metrics.

Some vendors are starting to add “AI monitoring” features, mostly token usage tracking, prompt latency, and error rates on LLM API calls. These are useful for cost management but irrelevant for agent quality. Knowing that your LLM calls cost $47.32 today tells you about your infrastructure spend. It tells you nothing about whether the agent’s SDR emails are any good.

The observability stack for agents needs to be built from first principles, not bolted onto existing monitoring tools. The data model is different (decision records, not time-series metrics). The alerting logic is different (trend detection, not threshold breaches). The debugging workflow is different (trace a decision chain, not trace a request).

Most organizations running AI agents in production find that their existing monitoring tools weren’t built to catch agent-specific failure modes. The few teams that build custom observability for agent quality tend to see meaningfully fewer production incidents tied to agent behavior.

The Principle

Here’s the principle we keep coming back to: you get what you monitor.

If you monitor uptime, you get uptime. Your agents will run. They may run badly, but they’ll run.

If you monitor decision quality, you get decision quality. Your agents will make good decisions, or you’ll know within 48 hours that they’ve stopped making good decisions.

Traditional observability was designed for an era when software’s job was to execute instructions. Agents don’t execute instructions. They make judgments. And monitoring judgment requires a fundamentally different approach, one built around confidence, calibration, grounding, and the continuous question: “Is this system’s reasoning sound?”

That question doesn’t have a Prometheus metric. But it’s the only question that matters.

See how Apollo Space monitors agent health, book a demo

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist