Engineering

The day our agent filed its own bug

The agent hit a broken tool, apologized to the user, opened the ticket itself, and by morning the fix was merged and it simply finished the job.

ASR

Apollo Space Research

Apollo Space

September 22, 2025 · 8 min read

A person asked the agent for something ordinary, pull a thread together, line up the next step, get it moving. Halfway through, a tool it needed threw an error. Not a model mistake. A genuinely broken pipe somewhere downstream, the kind that would have stopped any program cold.

Here is what it didn’t do. It didn’t loop. It didn’t hallucinate a cheerful “all done!” over a task that hadn’t happened. It didn’t bury the failure three layers deep in a log nobody would read until Tuesday.

It told the person, plainly, that this one part was broken right now. Then it opened a ticket, the bug, the steps to reproduce it, the place it had snagged. By the next morning the fix was merged, and the same conversation picked the thread back up and finished the job. Nobody on the team typed a word into a bug tracker.

That’s the whole story, and the rest of this post is why it matters. The agent hit a broken tool, apologized to the user, opened the ticket itself, and by morning the fix was merged and it simply finished the job.

What software usually does when a tool breaks

Picture the normal version, because everyone has lived it.

A program calls something that isn’t there. The call fails. And then one of two bad things happens. Either the whole thing falls over with a red stack trace that means nothing to the person who triggered it, or, worse in the AI era, it pretends. It returns a confident answer assembled from nothing, because returning something is what it was trained to do, and a failure is just an awkward gap to paper over.

Both outcomes push the same work onto a human. Someone has to notice the failure. Someone has to read the trace, understand it, decide it’s real, find the right repository, and write up what went wrong clearly enough that a second person can fix it. Then someone has to remember to follow up on the original request, which by now has scrolled off the screen.

That handoff, from “a tool broke” to “a human is writing the ticket”, is where most reliability quietly dies. Not because the bug was hard. Because nobody was on the hook to carry it from the moment it broke to the moment it got logged.

On the left, a tool throws an error a human has to notice, triage, and write up before anyone fixes it. On the right, the agent reads the error like any other result, files the ticket with steps to reproduce, and keeps its promise to the user.

The gap is human attention, and attention is the most expensive, most interruptible resource a company owns. A broken tool that needs a person to notice it is a broken tool that stays broken until someone happens to look.

The naive fix: tell the agent to be honest

The obvious first move is to make the agent admit failure instead of faking success. That’s real progress, a system that says “I couldn’t do this part” is already better than one that confidently lies.

But honesty alone just relocates the problem.

Now the failure is visible, which is good, and it’s sitting in a chat window, which is not. The person reads “I couldn’t complete that step,” shrugs, and moves on, because they came for an outcome, not a diagnosis. The error is honest and orphaned at the same time. No ticket exists. The next person who triggers the same flow hits the same wall and gets the same polite apology. The system is transparent about being broken and does nothing about it.

An honest dead end is still a dead end. Naming the failure is the floor, not the ceiling.

What the agent actually did

The difference in our story isn’t that the agent was smarter. It’s that the agent treated the error as its own problem to route, the way a good colleague would.

When the tool failed, the agent read the error the way it reads any other result, as information, not as the end of the road. It saw that the failure was real and reproducible, not a fluke of phrasing. It told the user the truth about what it couldn’t do, so nobody was left waiting on a promise it couldn’t keep. And then, without anyone asking, it did the thing a human would have had to remember to do: it filed the ticket, with enough detail that a fix could start from the steps instead of from a guess.

The agent hit a broken tool, apologized to the user, opened the ticket itself, and that last clause is the one that changes the economics. The failure didn’t wait for a human to discover it. It announced itself, fully written up, the moment it happened.

Overnight, a separate session picked up the ticket, made the fix, and merged it. And here’s the part that turns a clever trick into an actual loop: in the morning, the same conversation, the one that had snagged the night before, came back, found the tool working, and finished what the person originally asked for. The user’s request and the bug fix closed in the same thread, on the same problem, without a relay race of humans in between.

A broken tool becomes a closed loop: the agent hits the failure, tells the user, files the ticket itself, an overnight session fixes and merges it, and the same conversation resumes and finishes, so next time the tool just works.

Why a loop beats a fire alarm

Most monitoring stops at the alarm. Something breaks, a notification fires, and a human is now responsible for everything after the beep: triage, repair, follow-up, and remembering the original ask. The alarm is honest. It is also just the start of the work.

A loop is different because it doesn’t end at the noticing. The agent that hit the broken tool was the first link in a chain that included filing, fixing, and finishing, and the chain closed without handing the baton to a tired person at 11pm.

The mechanism that makes this possible is unremarkable once you see it. An error is just another result the agent can read. A bug report is just structured writing, which is the thing these systems are best at. A fix session is just another task that can run while the office is dark. And a conversation that remembers where it was can resume instead of restarting. None of those pieces is exotic. What’s new is wiring them into a single circuit, so a failure flows all the way around to a fix without falling out of the loop in the middle.

The naive system asks a human to carry the failure from “it broke” to “it’s logged” to “it’s fixed” to “it’s done.” Every one of those handoffs is a place the ball gets dropped. The loop removes the handoffs. The agent that found the bug is the same actor, across sessions, that sees it through.

A fire alarm tells you something is wrong. A loop is already on its way to the fix.

The turn: trust is built from moments like this

Think about what makes you trust a coworker with something real.

It is almost never the polished answer they give when you ask a direct question. It’s the time they hit a wall on your project, told you straight instead of hiding it, wrote up exactly what went wrong so it wouldn’t bite the next person, and then came back and finished the thing they said they’d finish. That’s not intelligence. That’s ownership, carrying a problem all the way to its end instead of dropping it the second it got inconvenient.

That is the bar for software that’s going to run real operations. Not a model that’s confident when things go right, but a system that’s honest when things go wrong, and accountable enough to close its own loop. The agent that filed its own bug wasn’t impressive because it was clever. It was trustworthy because, faced with a failure, it did the boring, responsible, unglamorous sequence a good colleague does, and the person who asked for help woke up to a finished job and a fix they never had to chase.

The agent hit a broken tool, apologized to the user, opened the ticket itself, and by morning the fix was merged and it simply finished the job. That sentence is short. The trust it earns is not.

We’re building this at Apollo Space, software that owns the failure as well as the win, closes its own loops, and comes back to finish what it started. If the most exhausting part of your week is carrying other people’s dropped balls to the finish line, that’s exactly the job we think a system should do first.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist