Engineering

Execution is local, coordination is networked

The safest agent control plane has no "run this" endpoint, by design.

ASR

Apollo Space Research

Apollo Space

· 9 min read

There is a request our control plane will never serve, and it isn’t blocked by a permission check. It’s missing. Search the whole API surface, every route, every handler, every version, and you will not find an endpoint that means run this command on that machine. Not disabled. Not gated behind an admin scope. Absent, the way a door you never cut into a wall is absent.

People assume that’s a feature we haven’t built yet. It’s the one we deliberately refused to build.

The safest agent control plane has no “run this” endpoint, by design.

This post is about why a system that coordinates dozens of agents across several machines should never be able to tell a machine what to do, and what you build instead when you take that power off the table on purpose.

The naive control plane: a server that gives orders

The obvious way to run a fleet of agents is to put a brain in the middle. One coordinator, reachable over the network, holding the list of every worker and every machine. A task comes in. The coordinator picks a worker, opens a connection, and sends the order: you, over there, run this.

It’s the shape every orchestration tutorial draws, and for a weekend it’s wonderful. You get one place to look, one place to click, one throat to choke. The coordinator knows everything and commands everything.

Now follow what that coordinator actually is. It is a server, exposed to a network, whose legitimate job is to make other machines execute arbitrary work on command. That is not a description of an orchestrator. That is the textbook definition of remote code execution, the single most catastrophic class of vulnerability there is, except you built it on purpose and called it a feature.

The danger isn’t hypothetical, and it isn’t about a bug in your code. The danger is the capability. The moment a network endpoint can cause code to run on a machine it doesn’t live on, every authentication layer in front of it is now the only thing standing between an attacker and a shell on your whole fleet. One stolen token, one confused-deputy request, one agent talked into calling the wrong endpoint, and the blast radius is every machine the coordinator can reach. You didn’t add a risk. You built a risk and made it the center of the architecture.

The right response is not a thicker wall around the dangerous door. It is to not cut the door.

The rule: execution is local, coordination is networked

So we split the system along the one seam that matters. Doing work and agreeing on work are different jobs, and they get different rules.

Execution is local. A machine runs only what its own configuration tells it to run. The decision to start a worker, claim a task, or spawn a sub-agent is made on the machine itself, by a supervisor reading its own local config, and it is carried out as a plain local process, never as a command that arrived over a wire. Nothing on the network can reach in and make a machine do work. The machine decides, the machine acts.

Coordination is networked. The shared layer, the part every machine can see, is a record of what’s happening, not a lever that makes things happen. Machines publish to it: heartbeats, outcomes, “I’ve started this,” “I’ve finished that.” Machines read from it to stay in sync. But the data flows one way when it comes to causing anything. A machine writes its own status outward; nothing writes commands inward.

The network can see everything and command nothing.

That’s the whole rule, and it inverts the naive design exactly. In the order-giving model, the network’s job is to cause execution. In ours, the network’s job is to observe it. Execution is local, coordination is networked, and the line between them is the line an attacker would need to cross, drawn so there’s no crossing to make.

Two control-plane designs side by side: on the left a networked coordinator sends run-commands into each machine, so any stolen credential becomes remote code execution across the fleet; on the right each machine executes only its own local config and the shared network layer can read status but can never send a command inward.

How machines agree without anyone giving orders

Here’s the question that sounds like a wall: if no machine can command another, how do two machines avoid doing the same job at once? A fleet that can’t be ordered around sounds like a fleet that trips over itself.

The naive answer reaches straight back for a boss. Have the coordinator assign each task to exactly one worker. And there it is again, to assign is to command, and to command is to cut the door we just refused to cut. The instinct to add a dispatcher is the instinct that rebuilds the remote-code-execution surface one convenience at a time. Resist it, and you have to find another way for machines to agree.

The other way is a claim, not an assignment. The shared layer holds the list of available work. A machine that wants a task doesn’t wait to be told, it reaches for the task and tries to claim it. The claim is atomic: the shared store lets exactly one machine win. Two supervisors grab for the same task in the same instant, one gets it, and the other is simply told taken and moves on to the next.

Read that carefully, because the direction of control flipped. Nobody pushed work onto a machine. The machine pulled work toward itself, and the only thing the network did was act as a referee for who got there first. Coordination happened, exactly one worker runs the task, and not a single order was given. The shared layer never said “you, do this.” It only ever said “that one’s taken.” Pull, not push, is what lets execution stay local while coordination stays networked.

This is also why the failure modes are gentle. A machine that goes dark doesn’t strand a queue, because it was never holding an order it now can’t follow, it simply stops claiming, and its un-renewed claims free up for whoever’s still awake. There’s no coordinator to fail, because there’s no coordinator. The fleet degrades one worker at a time instead of all at once behind a single brain.

A claim loop with no dispatcher: the shared work list offers an available task, two machines reach for it at the same moment, the atomic store lets one win and tells the other it is taken, the winner runs the task locally and publishes its outcome back to the shared layer, which never sends a command in return.

Why “just lock it down” is the wrong fix

The reflex of every security-minded engineer who hears “no run endpoint” is to argue the other side. Keep the dispatcher, just secure it properly. Strong auth, scoped tokens, an allowlist, rate limits, an audit log. We could do all of that. It’s good practice. It is also the wrong altitude for this problem.

Here’s the distinction that matters. Securing the order-giving endpoint is a bet that your controls never fail, that no token leaks, no agent is ever socially engineered into calling the wrong route, no dependency ships a flaw, no future teammate quietly widens a scope to ship a feature on a Friday. Removing the endpoint is a bet that you can live without the capability. The first bet you have to win every single day, forever, across everyone who will ever touch the system. The second bet you win once, in the architecture, and it stays won.

There’s a saying in security that the only truly safe code is the code you didn’t write. The control-plane version is sharper: the only endpoint that can’t be exploited is the one that doesn’t exist. A door defended by a guard is only as safe as the guard’s worst day. A wall with no door in it has no worst day.

This is why we treat “could we add a dispatch endpoint for convenience?” as a question with a permanent answer. Not no for now. No, structurally, the way a load-bearing wall is not a place you put a window because the view would be nice. The convenience it would buy is real. The capability it would hand to anyone who ever compromised one credential is the entire fleet, and that trade isn’t close.

The turn: the safest power is the one you chose not to hold

You can usually tell how much someone has run real infrastructure by how nervous they get around their own god-mode buttons.

The junior instinct is to build the powerful lever and feel proud of it, look how much this one endpoint can do. The instinct that comes from having been paged at 3am because a powerful lever got pulled by the wrong hand is the opposite. You start to see every capability you hold as a capability someone can eventually take from you. And the most durable way to make sure a power is never abused is to not have it in the first place, not for yourself, not for your future self, not for the attacker wearing your future self’s credentials.

That’s the discipline under the missing endpoint. It isn’t that we don’t trust our own controls; it’s that we refuse to make the whole fleet depend on them being perfect forever. We gave up the satisfying central button so that no stolen key ever unlocks the building. A system you can’t fully command from the outside is a system that can’t be fully commanded against you from the outside, and on a platform meant to run a real company’s real operations, with agents touching real work, that trade is the only one that lets you sleep.

The hard part of this was never the code. The atomic claim is a few lines. The hard part was the restraint, looking at a power that would have made a dozen things easier and deciding, on purpose, never to build it.


That’s the line we hold at Apollo Space: execution stays on the machine that owns it, coordination stays a thing machines observe rather than obey, and the most dangerous button on the board is the one we never wired up. If you’ve ever stared at an internal tool that could do anything and felt your stomach drop a little, you already understand why the safest control plane is the one that politely refuses to take orders.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist