Engineering

Latency is a product decision, not an engineering one

A deep agent that thinks for three minutes can feel like a coworker or a frozen tab, and the difference is never the speed. It's whether you told the user what the wait is for.

ASR

Apollo Space Research

Apollo Space

· 10 min read

Ask a shallow agent a question and the answer lands in a second. Ask a deep one, the kind that reads your company brain, pulls a contract, checks the calendar, drafts the reply, and double-checks its own work, and the honest answer can take three minutes. Three minutes is forever in a chat box. It is nothing in a workday. The same three minutes feels like a coworker thinking, or a frozen tab that needs killing, and the wait is identical in both cases.

What changed wasn’t the latency. It was what the screen said while you waited.

That’s the whole post. The wait is not the problem. The unexplained wait is the problem. Latency is a product decision, not an engineering one, and most teams lose the moment by treating a slow-but-deep agent like a slow-but-shallow one, and reaching for the spinner.

The reflex everyone has, and why it fails

The naive move is to make it faster. A request takes three minutes; that feels too long; so the engineering instinct kicks in and we go optimize. Cache the retrieval. Parallelize the tool calls. Swap the big model for a smaller one on the cheap legs. Shave the chain.

It’s the right instinct in the wrong place. You can shave a few seconds off a multi-specialist chain, and you should. But a genuinely deep answer, one that reads four sources, reasons across them, drafts something real, and checks it, has a floor. The floor is the work. You cannot make “read the contract, find the renewal date, draft the email, verify the date is right” instant, because none of those steps is filler. Every second is a step the user actually wanted done. Optimize all you like; the floor stays a floor.

So the team that only optimizes hits a wall and concludes the product is too slow to ship. They were measuring the wrong thing. They measured the duration. The user was never reacting to the duration.

Here’s the experiment that settles it. Take two identical three-minute waits. In front of the first, put a spinner. In front of the second, put a line of text that updates: Reading the contract… found the renewal date… drafting the reply… checking the date against the calendar. Same three minutes. Same answer. The first feels broken by second forty. The second feels like watching someone competent work, and people will sit through all three minutes without a flicker of doubt.

The bottleneck never disappears. It just moves, from how long the work takes to how well you narrate it.

Three honest shapes for a wait

Once you accept that the wait is a design surface and not a bug, the question stops being “how do we make it fast” and becomes “what kind of wait is this, and what does an honest one look like.” There are three shapes, and the entire craft is picking the right one for the job in front of you.

The naive version uses one shape for everything: the spinner, on every request, regardless of how long the work will take. A spinner is a promise that says this is almost done. Put it in front of a two-second answer and it’s fine. Put it in front of a three-minute one and it lies. By second thirty the user has been told “almost done” for twenty-eight seconds too long, and a promise that keeps breaking is worse than no promise at all. The spinner didn’t fail because it was ugly. It failed because it told the wrong story about the length of the wait.

The three honest shapes map to three honest answers to one question: how long is this, and can you watch it happen?

A request arrives and the agent first estimates how long the work will take. A short job gets an instant answer. A medium job streams its reasoning live so the user can watch it work. A long job acknowledges immediately, runs in the background, and notifies the user the moment it is done.

Acknowledge instantly is for the work you can’t show and can’t rush. The user asks for something that will take long enough that even narrating it would be a strange thing to stare at, a research run, a multi-document synthesis, a job that touches a dozen sources. You don’t make them watch. You answer in under a second with the only honest thing you can say: On it. I’ll have this for you in a few minutes, I’ll ping you. Then you hand the work to the background and you keep your word. The acknowledgment costs nothing and changes everything, because it converts an open-ended wait into a closed one. The user goes and does something else. The agent comes back. That’s not a degraded experience; for long work, it’s the only good one.

Stream the reasoning is for the medium wait, the ten-to-ninety-second chain where the steps are legible and watching them is reassuring rather than tedious. This is the shape that does the most work, because it turns dead time into evidence. Every step you surface, searching the brain, found three relevant notes, drafting, checking the figure, is simultaneously a progress bar and a receipt. The user isn’t just waiting; they’re watching the thing be done correctly, which is exactly the anxiety a spinner leaves unaddressed. And when the answer finally lands, they already trust it, because they watched it get built.

Answer instantly is the easy case, and the trap is forgetting it exists. Not every request is deep. “What’s on my calendar today” should never stream its reasoning at you like it’s solving a hard problem, that’s theater, and theater erodes trust as fast as a broken spinner. A shallow question deserves a shallow, instant answer. The skill is knowing which question you’re holding before you pick the shape.

The decision happens before the work starts

Here’s the part that’s easy to get backwards. The latency shape is not something you choose after the work is slow. It’s something you choose before the work begins, because by the time you know it was slow, the user has already been staring at the wrong thing.

The naive system runs the work, notices it’s taking a while, and then scrambles to show something. Too late. The first five seconds set the user’s expectation for the whole wait, and you spent them with a spinner that was about to break its promise. You can’t recover a frozen-tab feeling by adding narration at second sixty. The story has to start at second zero.

So the deep version makes one cheap decision up front: roughly how long is this going to take, and which of the three shapes fits? That estimate doesn’t have to be precise. It has to be early. A request to read one note and answer is short. A request that will fan out to three specialists is medium-to-long. A research run is long. You don’t need to know it’ll be 47 seconds versus 71; you need to know it’s the streaming kind and not the instant kind, and you need to know it before the user’s first second of waiting is spent.

Get the classification right and the wait explains itself from the first frame. Get it wrong and no amount of polish later saves it.

Two ways to handle a three-minute request. On the left, the system runs first and shows a spinner that quietly promises almost done, so the wait reads as a frozen tab. On the right, the system classifies the length first, then either streams the reasoning live or acknowledges instantly and notifies on completion, so the same wait reads as a coworker working.

Where the shapes break, and the rule that fixes it

Even with the right three shapes, there are two failures that show up in practice, and naming them is how you avoid them.

The first is the wait that ends in silence. You told the user I’ll ping you, and then the background job finishes and nothing pings. Now you’ve taught them that your acknowledgment is a lie, and the next time you say “on it,” they won’t believe you. An async promise is only as good as the notification that closes it. If you can’t reliably deliver the done, you have no business offering the instant acknowledgment; you’d be better off making them wait where they can at least see the spinner. The close is not optional. It’s the other half of the promise.

The second is the stream that says nothing. Streaming the reasoning only reassures if the reasoning is legible. A live feed of thinking… thinking… thinking… is a spinner with extra steps, it has motion but no information, and the user learns just as fast that the motion is meaningless. The steps you surface have to be real and specific: the source you read, the draft you made, the check you ran. Suppose a chain runs eight internal steps; you don’t show eight, you show the four a human would recognize as progress. Narration is a curation problem, not a firehose.

Both failures share one root, and it gives us the rule. The wait is not the problem. The unexplained wait is the problem, and “explained” means the user always knows three things: that you heard them, what you’re doing, and when it ends. Miss any one of those and the best engineering in the world reads as broken. Hit all three and a three-minute wait reads as competence.

This is why latency is a product decision. The engineering decides how fast the work can go. The product decides how the wait feels, and the second number is the one the user actually grades you on.

The turn: waiting was never the enemy

Step back from agents for a second, because this isn’t really about agents.

Think about the last time you trusted someone with something hard. You didn’t ask them to be instant. You asked a contractor to renovate a kitchen and you did not expect it done by lunch, you expected a start date, a sense of the steps, and a call when it was ready. You trusted them more for the honest timeline, not less. Depth takes time, and a person who pretends otherwise is a person you stop believing. We somehow forgot to extend that to software, because software taught us to expect everything in a tenth of a second, and so we treat any wait as a failure.

But the most valuable work an agent can do for your company is exactly the work that takes a minute, the reading-across-everything, the drafting, the checking. If we insist that all of it be instant, we get a fast agent that does shallow things, and we throw away the deep work because it didn’t fit in a spinner. The better trade is the human one: let the deep work take the time it needs, and spend real care on the wait, so a minute of thinking reads the way it should, as someone good, doing something worth waiting for.

That’s the part we keep returning to at Apollo: a coworker is allowed to think. The job isn’t to make every answer instant. It’s to make every wait honest, to tell you what the agent is doing, and to come back the moment it’s done. If you’ve ever stared at a spinner wondering whether anything was happening at all, you already know the wait was never the thing that bothered you. The silence was.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist