Engineering

We resolved a three-way design argument by refusing to argue

Run all three specs against identical scenarios and let the numbers fold two of them.

ASR

Apollo Space Research

Apollo Space

· 11 min read

Three engineers, three designs for the same agent memory layer, and three explanations of why the other two were wrong. One wanted a flat log the agent re-reads every turn. One wanted a vector store that retrieves by similarity. One wanted a small structured table the agent writes facts into by hand. Each could defend their choice for twenty minutes without repeating themselves. None of them could prove it.

We had a meeting on the calendar to settle it. We cancelled the meeting.

Run all three specs against identical scenarios and let the numbers fold two of them.

This post is about what happened when we stopped treating a design disagreement as a debate to win and started treating it as a measurement we hadn’t taken yet, and why that one move changed how we make every load-bearing decision now.

The naive version: argue until someone gives up

The default way a team resolves a design argument is to argue. You book a room, you put the three options on a whiteboard, and you let the strongest case win.

It feels rigorous. It is not.

What actually decides the room is rarely the design. It’s who’s most senior, who’s most fluent, who has the most patience for a long meeting, and who happened to frame the question first. The vector store sounds sophisticated, so it gets the benefit of the doubt. The flat log sounds primitive, so it has to fight uphill, even if, for this workload, primitive is exactly right. The person who can draw the cleanest diagram wins, and a clean diagram is a measure of drawing, not of the thing the diagram describes.

And here’s the failure underneath the failure: even when the room agrees, nobody has learned anything. You picked the option that argued best. You still don’t know how any of them behave under load, how they degrade when the agent’s context fills up, what they cost per conversation, or which one quietly drops a fact the user mentioned forty turns ago. You traded a real question, which design actually works here, for a cheaper one: who made the better case in a room. Those are not the same question, and confusing them is how teams ship the most articulate option instead of the best one.

The most expensive design decisions get made this way constantly. Not because engineers are lazy, because the alternative looks slow, and arguing feels like progress.

The move: stop arguing, start measuring

So we did the thing that felt slower and was faster. We refused to argue.

Run all three specs against identical scenarios and let the numbers fold two of them.

The key idea is simple. A design disagreement is almost never a disagreement about values, everyone in that room wanted the same thing: an agent that remembers the right facts, cheaply, without falling over. It was a disagreement about predictions. Each engineer was predicting how their design would behave on a workload none of them had run yet. A prediction you can’t test is just a strongly held guess, and three strongly held guesses don’t add up to a decision. They add up to a meeting.

The fix isn’t to find a better arguer. It’s to turn the predictions into something you can check.

So instead of one team picking a winner on a whiteboard, all three designs got built, small, just enough to run, and all three got pointed at the exact same set of scenarios. Same conversations. Same facts to remember. Same recall questions, forty turns later. Same measurement of cost, latency, and whether the fact actually came back. The argument didn’t get resolved. It got dissolved, because once the three designs ran the same gauntlet, two of them visibly fell down and there was nothing left to argue about.

Three engineers each predict that their memory design wins; instead of debating, all three specs run against one identical scenario set, and the measured results fold two designs and leave one standing.

The whiteboard asks which design sounds best. The scenario set asks which design did best. Only the second question has an answer you can trust on a Tuesday when everyone’s tired.

Why identical scenarios are the whole trick

There’s a weaker version of this that looks the same and doesn’t work, and it’s worth staging, because it’s the one most teams reach for when they try to “test it first.”

The weak version: each engineer goes off and benchmarks their own design. The vector-store author runs it on the queries that flatter retrieval. The flat-log author tests it on short conversations where re-reading everything is free. The structured-table author picks the cases where facts are clean and well-typed. Everyone comes back with green numbers. Everyone’s design “won” on the test they designed. You’ve now turned three opinions into three benchmarks, which is worse, because a benchmark wearing a number looks like evidence even when it’s just the same bias with a decimal point.

A measurement only resolves an argument if all sides face the identical test. The instant each design gets graded on its own home turf, you’re back to arguing, just with charts.

A benchmark you wrote to make your design look good is not a measurement. It’s an opinion with error bars.

So the discipline is unglamorous and absolute. One scenario set, written before anyone’s design is favored, by someone who isn’t trying to win. The same hostile cases for all three: the conversation that runs long enough to overflow context, the fact mentioned once and needed much later, the near-duplicate facts that a similarity search confuses, the user who corrects themselves and expects the correction to stick. Every design meets every case. Nobody grades their own homework.

That single constraint, identical scenarios, written by a disinterested hand, is what converts a debate into a decision. Drop it, and you don’t have a bake-off. You have three engineers each demoing on the laptop where it works.

What the numbers did that the meeting couldn’t

When the three designs ran the same gauntlet, two of them folded on their own, not because anyone argued them down, but because the scenarios showed something no diagram could.

I’ll keep the specifics illustrative, because the shape is the lesson, not the digits. Say the flat log recalled almost everything but cost more per conversation each turn, because re-reading the whole history is fine at turn five and ruinous at turn fifty. Say the vector store was cheap and fast but confused the near-duplicate facts, when a user mentioned two similar things, it sometimes returned the wrong one, and “sometimes returns the wrong fact” is a quiet way to lose a user’s trust. Say the structured table held up on both: it remembered the corrected fact, it didn’t confuse the duplicates, and its cost stayed flat as the conversation grew, because it only ever wrote down what mattered.

None of that was visible from the whiteboard. All of it was obvious from the scenarios. The vector store’s confusion wasn’t a flaw anyone could have argued into existence in a meeting, it only appeared when a real conversation gave it two facts it couldn’t tell apart. The flat log’s cost curve wasn’t a number anyone had in the room, it only showed up when a scenario ran long enough to bend it. The measurement didn’t pick the design that sounded best. It picked the one that behaved best on the cases that matter, and the other two folded without a single voice raised.

Each design's measured behavior across the identical scenarios: the flat log's per-turn cost climbs as the conversation grows, the vector store confuses near-duplicate facts, and the structured table holds recall and cost flat, so it's the one left standing.

Here’s what’s easy to miss: the winner could have lost. That’s the point. If the structured table had quietly dropped the corrected fact, the same gauntlet would have folded it, and we’d have shipped the vector store with a clear conscience. The method doesn’t have a favorite. It just refuses to let the most fluent argument stand in for the most tested one. Run all three specs against identical scenarios, and the numbers fold two of them, and you never have to take it personally that yours was one of the two.

Why this is cheaper, not slower

The objection writes itself: building three designs is more work than picking one. Of course it is. We did it anyway, and it saved time, which only sounds like a paradox until you price the alternative.

A design decision made by argument isn’t free. It’s expensive on a delay. You pick the option that argued best, you build it for real, and then, weeks later, in front of an actual workload, you discover the thing a scenario would have shown you in an afternoon. Now the cost isn’t a meeting. It’s a rewrite, plus everything that was built on top of the wrong foundation, plus the credibility you spend explaining why the sophisticated choice didn’t pan out. The bill for a bad design decision is always paid later, with interest, and the interest is the part nobody budgets for.

The bake-off pays the bill early and small. Three throwaway implementations and one honest scenario set cost a few days. A load-bearing design choice that turns out wrong costs a quarter. We pay the few days on purpose, the same way you’d rather spend an hour on a survey than rebuild a house on a foundation that’s off by a foot.

Measuring three designs looks slower than picking one. Rebuilding the wrong one is what’s actually slow.

And there’s a compounding return the meeting never gives you. After the bake-off, you don’t just have a decision, you have the reason, written down, reproducible, with the scenarios attached. When someone asks six months later why the agent uses a structured table, the answer isn’t “we discussed it and that’s what we landed on.” It’s “here are the cases, here’s how each design behaved, run it yourself.” A decision you can re-run is a decision that doesn’t have to be re-argued every time a new engineer joins and finds the choice surprising.

When you still have to argue

I want to be honest about the edge, because a method that claims to settle everything is selling something.

Not every decision has a scenario you can write. Some choices are about values, or strategy, or a bet on where the world is going, and you can’t measure your way to those, because the data doesn’t exist yet. Whether to build for the customer you have or the customer you want is not a bake-off. Those decisions stay human, and they should.

But the trap is calling a measurable question unmeasurable so you can keep arguing it. Most “we just have to make a judgment call” design fights aren’t judgment calls at all. They’re predictions about behavior, dressed up as taste, defended in a meeting because measuring felt like too much trouble. The discipline is knowing the difference: if there’s a scenario that would change someone’s mind, you don’t have a debate. You have an experiment you haven’t run yet, and the meeting is just procrastination with snacks.

The turn

The three engineers who couldn’t agree are still on the team, and not one of them lost an argument that day, because there wasn’t one to lose. That’s the part that matters more than the memory layer.

When a decision comes down to who argued best, somebody walks out having lost, and they carry it. The next design fight gets a little more personal, a little more about being right than being correct, because last time being right is what got rewarded. But when the scenarios decide, nobody loses to anybody. The vector-store author didn’t get out-talked by a colleague, they got out-measured by a workload, and that’s a thing you can shrug off and learn from instead of resent. The method took the ego out of the room by taking the room out of the decision. What’s left is three engineers who trust the next call a little more, because they’ve seen that the call gets made by the work, not by the loudest correct-sounding voice in it.

That’s the quiet thing a bake-off buys that no diagram does. Not just the right design, a team that doesn’t have to win against each other to get to it.


That’s what we’re building at Apollo Space: a way of working where the hardest decisions get settled by what actually happens, not by who explains it best, so the smartest person in the room never has to be the one who talks the most. If you’ve ever watched the better-sounding design beat the better one, you already know why we’d rather run the scenarios than have the meeting.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist