Use Cases

Can Apollo process your refunds? Only the ones the policy already decided

A refund is not a judgment call, it is a policy with an audit trail. The agent settles the in-policy ones in seconds, routes the edge cases to a human, and logs every decision either way.

ASR

Apollo Space Research

Apollo Space

· 11 min read

Say a customer writes in on a Saturday: “Charged twice for the same order, please refund one.” The order is there. The duplicate charge is there. Your policy says duplicate charges get refunded, no questions. And yet that message sits in a queue until Monday, because the person who could approve it is asleep, and the system that could approve it was never told what “approve” means.

That delay is not a hard problem. It is an unread instruction.

Here is the line this whole post orbits: a refund is not a judgment call, it is a policy with an audit trail. Most refunds were already decided the day you wrote the policy. The agent’s job is not to decide them. It is to recognize the ones the policy already decided, do them in seconds, and hand the rest to a human, with a record of every choice.

Why “let the AI handle refunds” scares the right people

The naive pitch is the scary one: point a smart model at your support inbox and let it issue refunds. Most operators hear that and flinch, correctly. Money is leaving the building. A confident model that misreads “I’d like to understand the refund policy” as “issue me a refund” doesn’t make a typo. It makes a transaction.

So the instinct is to do the opposite, keep every refund manual, route all of them to a person, and treat the model as a glorified suggestion box. Safe. Also slow, and expensive in a way nobody puts on a dashboard: the in-policy refunds, the boring duplicate-charge ninety percent, now wait in line behind the genuinely hard ten percent. Your most expensive judgment is spent on cases that needed no judgment at all.

Both versions miss the same thing. They treat “process a refund” as one decision with one risk level, when it is really two completely different jobs wearing the same label.

The boring refund and the risky refund are not the same task. Treating them as one is the whole mistake.

One job is recognition: does this request match a rule we already wrote? That job is bounded, checkable, and safe to automate, because the human already made the decision, months ago, in the policy. The other job is judgment: this case is outside the rules; what should we do? That job is genuinely human, and the agent’s only correct move is to stop and ask.

The trick is not making the model brave enough to decide refunds. It is making the system honest about which of the two jobs it is looking at.

The naive way: one model, one inbox, one big “issue refund” button

Let’s show the dumb version first, because it fails in an instructive way.

You give a capable model access to your payments tool and your support inbox and a prompt that says, roughly, “read the message, decide if a refund is warranted, and issue it.” It works in the demo. The demo always uses the duplicate-charge case, because that one is easy.

Then production arrives with the messages the demo never showed. “Refund me or I’m calling my bank.” A request for an order from fourteen months ago. A refund on an item that was clearly used for three weeks and is now being returned the day before a known sale. A polite note that, read closely, is asking how refunds work, not asking for one. The model now has to be a policy expert, a fraud analyst, and a customer-service voice all at once, on every message, with real money on the line and no second reader.

The failure is not that the model is dumb. It is that you asked one prompt to hold the entire policy in its head and apply it under pressure, and a policy held in a prompt is a policy nobody can audit. When the model gets one wrong, you can’t point to the rule it broke, because there was no rule. There was a paragraph of instructions and a hope.

This is the same failure mode behind most “the AI did something weird” stories. The intelligence wasn’t missing. The structure was. The model was handed a decision that should have been a lookup.

A duplicate-charge refund follows a closed loop: the agent matches the request to the order, checks it against the written refund policy, finds it clearly inside the rules, issues the refund through the payments tool, and writes a decision to the audit log, all without waiting for a person.

Our way: the policy is the program, the agent is the interpreter

Here is the reframe that fixes it. Stop asking the model to decide refunds. Start giving the model a policy it can read, and ask it only to recognize when a request clearly falls inside or clearly falls outside that policy.

The policy is the part that’s already written down: duplicate charges get refunded. Returns inside the window get refunded. Items outside the window need a reason. Refunds above some amount need a second look. A customer flagged for repeated chargebacks gets a human, always. None of that is a model decision. It’s the company’s decision, made once, in advance, exactly the way you’d write it for a new support hire on their first day.

The agent’s job, then, splits cleanly into three outcomes for any incoming request:

  • Clearly in-policy. The request matches a rule with no ambiguity, the duplicate charge, the in-window return. The agent issues the refund through the payments tool, sends the customer a clear note, and logs the decision with the rule it matched. Seconds, not days.
  • Clearly out-of-policy. The request violates a rule plainly, the order is past every window, the amount is over the auto-approve line, the account is flagged. The agent does not refund and does not argue. It routes to a human with the case assembled: the order, the rule it failed, what the customer asked, and a draft of either a refund or a kind decline.
  • Ambiguous. The agent genuinely can’t tell. It routes to a human too, but it never guesses with money. The default for uncertainty is ask, never act.

Notice what changed. In the naive version, every message was a high-stakes decision. Here, most messages are a lookup, and only the genuinely hard ones reach a person. The human’s time stops being spent on duplicate-charge triage and starts being spent on the cases that actually need a human: the angry one, the borderline one, the one where saying yes is good business even though the policy says no.

And critically, the agent is never the last word on letting money out for anything it’s unsure about. The policy is the ceiling. The agent works strictly underneath it.

The part operators actually ask about: the audit trail

Every operator who’s been burned has the same two questions, and they’re the right ones. How do I know it followed the rules? And what happens when it gets one wrong?

The answer to both is the same: a refund is not a judgment call, it is a policy with an audit trail. The audit trail is not a nice-to-have bolted on after. It is the thing that makes automating any of this defensible.

The naive logging you’ve seen elsewhere is a flat activity feed: “Refund issued, $40, 2:14pm.” That tells you something happened. It tells you nothing about why, which is the only thing that matters when you’re reviewing whether to trust the system. A log that records the action but not the reason is a receipt, not an audit.

Apollo is built so every refund decision writes a record with the shape that actually answers the question: which request came in, which order it matched, which policy rule it triggered, what the agent did about it, and, when it routed to a person, who decided and what they chose. Read top to bottom, the log reconstructs the reasoning, not just the result. You can sample ten refunds at random and, for each, see the exact rule the system claims it followed, then check that rule against your real policy.

That’s what turns “trust the AI” into something an operator can actually grant. You’re not trusting the model’s good intentions. You’re auditing its decisions against a written policy, the same way a finance lead spot-checks a junior’s expense approvals. The trust lives in the record, not in the model’s confidence.

You don’t trust the agent. You trust the policy, and you audit the agent against it.

And the wrong ones? When a refund is later judged a mistake, you don’t get a shrug. You get the exact record: the rule the agent matched, the request it matched it to. That tells you whether the agent misread the request, a recognition bug you can fix, or whether the policy itself let it through, which means your rule was wrong and no human reviewer would have caught it either. Most “the AI messed up” turns out to be the second kind on inspection: the policy said yes, and the policy was the thing that needed editing.

Two ways to run refunds. The naive lane sends every request, boring and risky alike, to one overloaded human queue, where in-policy refunds wait behind the hard cases. The Apollo lane sorts each request by the policy: clearly in-policy refunds settle automatically and get logged, while out-of-policy and ambiguous cases route to a human with the case already assembled.

Where the line sits, and who draws it

The honest part of this is that the line between “agent handles it” and “human handles it” is not fixed, and you shouldn’t pretend it is.

On day one you draw it conservatively. Maybe only duplicate charges auto-refund, and everything else routes to a person while you watch. You read the audit log for a couple of weeks. You see that the in-window returns were boringly correct every single time, and you move that rule under the line too. You see that refunds above a certain amount occasionally surprised you, so you keep those above the line, where a human always looks. The policy is a dial, and the audit trail is how you turn it with your eyes open.

This is the opposite of the “set it and forget it” pitch, and that’s the point. A refund system you can’t watch is one you shouldn’t have turned on. A refund system that shows you its reasoning, decision by decision, is one you can expand on evidence instead of faith. You start narrow, you read the record, you widen the lane where the record earned it.

The customer-flagged-for-chargebacks case never moves under the line, no matter how clean the log looks. Some decisions are human by design, not by limitation, and a good system knows the difference between “I’m not allowed to decide this yet” and “no one should ever let software decide this alone.”

The turn: the goal was never to automate refunds

Step back from the payments tool and the audit log and look at what actually changed for the person running support.

Before, their best judgment was spent on the wrong cases. The duplicate-charge refund, which needed no judgment, only attention, consumed the same queue, the same fatigue, the same Saturday, as the genuinely hard call where a loyal customer is asking for an exception the policy doesn’t cover. The boring work crowded out the work only a human can do. That’s the real tax, and it never showed up as a line item. It showed up as the good people being too busy with lookups to handle the moments that decide whether a customer stays.

A refund is not a judgment call, it is a policy with an audit trail. When the system finally believes that, the boring ninety percent disappears into a logged, auditable, instant lane, and the person who used to clear that queue gets handed only the cases that were always theirs: the exception worth making, the angry customer worth saving, the borderline call where the right answer is a little kindness the policy didn’t anticipate. That is not a smaller job. It is the actual job, finally uncrowded.


This is what we’re building at Apollo Space, not an agent that’s brave enough to spend your money, but a system honest enough to know which decisions you already made, and disciplined enough to leave the rest to you. If you’ve ever watched a duplicate-charge refund wait three days behind a queue, you already know that most refunds were decided long before the request came in. The only thing missing was something that read the policy and didn’t need to sleep.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist