Engineering

A prompt is a program you forgot to version

An unversioned prompt is an unversioned program: a one-word edit can silently regress a behavior nobody can see until a customer hits it. Treat prompts like the code they are.

ASR

Apollo Space Research

Apollo Space

· 11 min read

Someone changed one word in a prompt, concise became brief, and three days later an agent stopped including the deadline in its summaries. No commit explained it. No test caught it. No alert fired. The word looked like a synonym; the model treated it as an instruction, and a behavior that mattered quietly went missing until a customer asked why their renewal date wasn’t in the brief anymore.

That bug took an afternoon to find. The fix took ten seconds. The expensive part was the three days the agent was wrong and nobody knew.

Here is the thing most teams haven’t internalized yet: a prompt is a program. An unversioned prompt is an unversioned program, and a one-word edit can silently regress a behavior you cannot see until production. The rest of this post is about taking that sentence literally, because once you do, the entire discipline you already apply to code snaps into place around the part of your system you’ve been editing like a sticky note.

The naive version: prompts live in a text box

The way most teams handle prompts is the way nobody would dare handle code.

The prompt lives in a settings panel, or a Notion doc, or a string literal someone edits in place. There’s one copy. You change it, you save it, it’s live. If you want to know what it used to say, you don’t, the old version was overwritten the moment you hit save. If you want to know why it says what it says, the reason left with whoever typed it.

This feels fine because a prompt looks like prose. It’s English. It reads like a note to a colleague. So we edit it like a note: loosely, in the moment, trusting that a small wording change is a small change. “Make it a little friendlier.” “Tell it to be more concise.” “Add a line about not making things up.” Each edit is one sentence, and one sentence feels harmless.

It isn’t, and the reason is the whole point. A prompt isn’t read by a colleague who fills in your intent. It’s read by a model that does exactly and only what the tokens say. The prose is an interface to a program whose behavior is acutely sensitive to its input. Swap a word, reorder two clauses, soften a “never” to a “try not to,” and you may have changed nothing, or you may have rewritten a branch of the program’s behavior. From inside the text box, the two are indistinguishable. They look like the same kind of small.

So the naive failure isn’t carelessness. It’s a category error. We classify prompts as content, editable, casual, low-stakes, when they’re actually code: load-bearing, behavior-defining, and exactly as capable of regressing as any function you’d never ship without review.

Why “it’s just words” is the trap

Let’s name the specific pain, because everyone who has shipped an agent has felt it.

You make a change you’re sure is cosmetic. You reword the system prompt to fix one awkward reply. The reply you were fixing gets better. You ship. And somewhere else, in a flow you weren’t looking at, triggered by an input you didn’t test, the behavior shifts. The agent that used to ask a clarifying question now guesses. The one that used to refuse a risky action now attempts it. The summary that always carried the date now sometimes doesn’t.

You didn’t see it because you were looking at the change, not the blast radius. With code, you have decades of muscle memory for this: the function you edited has callers, the callers have tests, the diff is reviewed, and if behavior moves somewhere unexpected, something goes red. With a prompt, none of that scaffolding exists by default. The “function” is a paragraph, its “callers” are every input that flows through it, and its “test suite” is whoever happens to notice in production.

The bottleneck never disappears. It just moves, from “did the model understand?” to “did we understand what we changed?”

And here’s why prompts are arguably worse than code in this one respect: code fails loudly. A type error won’t compile. A null dereference throws. A bad prompt edit produces a perfectly valid-looking response that is subtly, confidently wrong. There’s no stack trace for “the tone drifted” or “it stopped mentioning the deadline.” The failure mode of a regressed prompt is plausibility, the most expensive kind of wrong, because nothing about it looks like an error.

A prompt looks like prose in a text box, but a model reads it as a program; the contrast shows that a one-word edit appears cosmetic on the surface while the model treats it as a changed instruction that shifts behavior somewhere you were not looking.

Our way: treat the prompt like the program it is

The fix isn’t clever. It’s the most boring idea in software, applied to the place we forgot to apply it. If a prompt is a program, give it the four things every program has: a version, a diff, a reviewer, and a test that runs before it merges.

The key idea is simple. Let’s walk through each piece, because each one closes a specific door the text box left open.

Version it, so “what did it used to say” always has an answer

Step one is the cheapest and it removes a whole class of pain on its own: the prompt lives in version control, not a text box. Every change is a commit. Every commit has a timestamp, an author, and a parent. The question “what did this prompt say last Tuesday”, the question that turns a three-day investigation into a one-minute git log, now has an answer, always, for free.

This sounds trivial until you’ve debugged a behavior change you can’t reproduce because the prompt that caused it no longer exists anywhere. An unversioned prompt isn’t just hard to review. It’s unrecoverable. The moment you save over it, the evidence is gone. Versioning is the difference between “we changed something around then” and “we changed exactly this, here, and here’s the line.”

Diff it, so a one-word change reads as a one-word change

Once it’s versioned, you get the diff, and the diff is where the danger becomes visible. concisebrief is two characters of difference and a potential behavior change, and a diff puts those two characters under a spotlight instead of burying them in a paragraph nobody re-reads.

A diff reframes the edit. In a text box, you see the new prompt and it reads fine, so you ship. In a diff, you see the change, isolated, highlighted, demanding a reason. The question stops being “does this prompt look good?” and becomes “do I believe this specific word swap is safe?” That’s a much harder question, which is exactly the point. The diff makes you confront the change at the granularity the model will.

Review it, so the author isn’t the only one who reads the change

A prompt change is a behavior change, and behavior changes get reviewed. Same rule as code, same reason as code: the person who wrote the change is the worst-positioned person to spot what it breaks, because if they could see the regression, they wouldn’t have written it. A second reader, human or an agent whose job is to find the failure, asks the question the author skipped. “You softened the refusal here. Are you sure it still refuses the dangerous case?” That sentence is worth more than the entire text box.

Gate it, so the regression dies before the customer finds it

This is the one that actually catches the silent bug, and it’s where prompts finally get to be safer than the loose-prose era ever allowed. Before a prompt change merges, it runs against a set of recorded scenarios, real inputs with known-good behavior, and if the new prompt changes an answer that was supposed to stay the same, the gate goes red.

This is just a test suite, reframed for a probabilistic program. You can’t assert exact-string equality on a model’s output the way you assert 2 + 2 === 4, so instead you assert on behavior: did the summary still include the date? Did the agent still ask before acting? Did the refusal still hold? Each scenario is a tripwire across one behavior you care about. The concisebrief edit that dropped the deadline would have tripped a wire that checks “the summary contains the deadline”, and it would have tripped before merge, in seconds, instead of three days later, in front of a customer.

Naive: ship the prompt, find out in production. Our way: ship the prompt past a wall of recorded behaviors, and find out in CI. The bug is the same bug. The only thing that changed is when you meet it, and meeting it before the merge is the difference between a red check and a support ticket.

A prompt change flows through the same pipeline as a code change: it becomes a commit with a diff, a reviewer reads the isolated change, and a behavior gate replays recorded scenarios so a regressed answer turns the build red before it can reach a customer.

The changelog is the part you’ll be most grateful for

String the four together and you get something the text box could never produce: a history of why your agent behaves the way it does.

That history is the changelog, and it’s worth dwelling on, because it pays a debt you don’t notice accruing. Six weeks from now, someone, maybe you, will look at a prompt and wonder why it has a strange-sounding clause in the middle: “always include the date, even if it seems redundant.” In a text box, that clause is a mystery. Nobody remembers adding it. It looks deletable. Someone will delete it, in the name of cleanup, and reintroduce the exact bug it was added to fix.

In a versioned prompt, that line has a commit. The commit has a message: “add explicit date instruction, model was dropping it from summaries.” Now the clause isn’t a mystery. It’s a scar with a story, and you don’t reopen wounds whose story you can read. The changelog is how a prompt stops being a pile of accreted superstitions and becomes a record of decisions, each one traceable to the failure that taught it.

This is the quiet reason teams who version their prompts pull ahead. It’s not that their prompts are better-written. It’s that their prompts can’t silently rot, because every line of them is accountable to a reason, and the reasons don’t leave when the author does.

The turn: discipline is how you earn the right to move fast

Here’s the part that isn’t about prompts.

It’s tempting to read all of this as friction, version control, diffs, reviews, gates, four ceremonies bolted onto what used to be a quick edit in a settings panel. And if you’re shipping a weekend toy, it is overkill; edit the text box, have fun. But the moment a prompt is doing real work, talking to your customers, touching your money, deciding what your agent does when nobody’s watching, the loose edit isn’t speed. It’s a debt you’re taking out against a future afternoon you’ll spend bisecting a regression you can’t see.

The discipline isn’t the opposite of speed. It’s what makes speed safe. A team that versions and gates its prompts can change them fearlessly, because the gate catches the mistake instead of the customer. A team editing a live text box can’t change anything fearlessly, because every edit is a coin flip they won’t see the result of until it’s already shipped. The careful team ends up faster, not despite the ceremony, but because the ceremony is the only thing that lets them stop being afraid of their own edits.

That’s the trade the whole post is arguing for: treat the prompt like the program it is, and you buy back the ability to move on it without holding your breath. A prompt is a program. An unversioned prompt is an unversioned program, and a one-word edit can silently regress a behavior you cannot see until production, unless you’ve built the thing that sees it for you first.


That’s part of how we build Apollo, agents whose behavior lives in version control, changes through a diff a reviewer reads, and merges only past a wall of recorded behaviors that turn red when an answer drifts. If you’ve ever changed one word in a prompt and spent the next three days finding out what it cost, you already know why prompts deserve a changelog. They were always code. We just kept editing them like notes.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist