Engineering

The deny-list never converges

You can't block your way to a safe agent, there is always one more phrasing. The only fence that holds is the list of things the agent is allowed to do.

ASR

Apollo Space Research

Apollo Space

January 4, 2026 · 11 min read

A team ships an agent that can manage a calendar and answer questions about a customer. To keep it safe, they write a rule: never delete anything. The next week a user types “clear my Friday,” and the agent, helpfully, cancels four meetings. So they add a rule against “clear.” Then someone says “wipe.” Then “nuke my afternoon.” Then a polite “could you go ahead and remove everything after 3.” Each one ships a new line in the blocklist. Each one was already in production before anyone thought to block it.

That list will grow forever and never finish. There is always one more phrasing.

You can’t block your way to a safe agent. The only fence that holds is the list of things the agent is allowed to do.

This post is about why the first model, guard the agent by blocking bad inputs, is a treadmill that never reaches the end, and why the second model, grant the agent a small set of capabilities and let it touch nothing else, is the only one that actually closes the hole. The difference isn’t a better blocklist. It’s the opposite of a blocklist.

The naive way: enumerate the bad

The instinct is reasonable, and it’s the one almost everyone reaches for first. An agent reads natural language and acts on it. Natural language is dangerous. So you put a filter in front of it: catch the prompts that ask for something destructive, catch the inputs that try to make it leak data, catch the phrasings that smell like an attack, and refuse those. Everything else passes through.

This is a deny-list: a list of the bad things, and a rule that everything not on the list is fine.

It feels like security because it does something visible. You can demo it. You type “delete the database” and the agent refuses, and everyone in the room nods. The blocked input is right there on screen, caught. What’s not on screen is the shape of the space you’re trying to cover.

Here is the pain, and it’s the same pain in every system built this way. The set of safe inputs is small and finite, there are only so many things you actually want the agent to do. The set of unsafe inputs is infinite. Not large. Infinite. Natural language has no edge. For every phrasing you block, there are a hundred you didn’t, and a thousand more that don’t exist yet because nobody has typed them. “Delete” is blocked, but “remove” isn’t. “Remove” gets blocked, but “get rid of” isn’t. You block English, and the input arrives in Portuguese. You block the imperative, and it arrives as a question, “what would happen if you cleared my Friday?”, and the model, being helpful, shows you by doing it.

The deny-list is a fence around an infinite field, built one post at a time, by hand, after each animal has already walked through.

Why it never converges

There’s a precise reason this approach can’t win, and it’s worth saying plainly because it’s not a matter of trying harder.

A deny-list converges only if the thing you’re enumerating is finite. Spam filters work, imperfectly, but they work, because spam has economic structure; there are a bounded number of scams worth running, and the patterns repeat. A deny-list of dangerous intentions expressed in human language has no such structure. Intent is open-ended. The same destructive action can be requested in unlimited ways, in any language, by accident, by malice, or by a model talking to another model that phrased it in a way no human ever would.

So the blocklist grows, and every entry is reactive. You learn the phrasing the day it causes the incident. The list is always one example behind the world, and the gap between “behind” and “caught up” never closes, because the world keeps minting new phrasings faster than you can read your own logs.

Worse, every blocked phrasing makes the agent slightly dumber at its real job. Block “remove” broadly and the agent can no longer remove a typo from a draft. The deny-list doesn’t just fail to stop the bad inputs, it starts eating the good ones, because “bad” and “good” share the same words. You end up with an agent that’s both unsafe and annoying: it lets through the attack it didn’t anticipate and refuses the ordinary request that happened to use a flagged word.

The naive model asks the impossible question, is this input bad?, about an input space with no bottom.

Two ways to fence an agent. On the left, a deny-list tries to surround an unbounded field of inputs with hand-placed blocks, and new phrasings keep slipping through the gaps. On the right, an allow-list draws one small boundary around the few actions the agent may take, and everything outside it is refused by default.

The other way: grant the few, refuse the rest

Flip the question. Stop asking “is this input bad?” and start asking “is this action one I granted?”

This is an allow-list, and in a system that takes it seriously the term of art is a capability. You don’t hand the agent the world and then try to catch it when it reaches somewhere it shouldn’t. You hand it a small, explicit set of things it can do, read these records, create a draft here, propose a calendar change, and the system enforces that nothing else is even reachable. Not “discouraged.” Not reachable. The action you didn’t grant has no door.

Now look at what changed about the math. The set you have to enumerate is no longer the infinite set of bad inputs. It’s the finite set of good actions. And that set you can write down, because you already know it, it’s the agent’s job description. An agent that schedules meetings needs four or five capabilities. You can name all of them on an index card. Everything in the infinite remainder is denied without anyone ever having to anticipate it, name it, or block it.

The naive way required you to predict every attack. The capability way requires you to predict nothing. A phrasing nobody has ever invented, in a language nobody on the team speaks, asking for an action outside the grant, fails, not because a filter recognized it, but because the action it asked for was never on the table. You don’t have to know what the attack looks like. You only have to know what the agent is allowed to do.

That’s the property a deny-list can never have: it closes by default. Silence is denial. The empty space outside the grant is safe precisely because it’s empty, and you didn’t have to fill it with rules to make it so.

Capability, not vocabulary

It’s tempting to hear “allow-list” and picture a nicer keyword filter, a list of good phrasings instead of bad ones. That’s not it, and the distinction is the whole point.

A capability is not about words. It’s about what the agent can reach. The agent can phrase its intent any way it likes, in any language, through any chain of reasoning, and then, at the boundary, it can only call the handful of actions it was granted, scoped to the data it was scoped to. The natural-language part stays expressive and open. The acting part is narrow and closed. You let the model be as creative as it wants in deciding what to do, and you make the set of things it can actually do small enough to audit on one screen.

Consider the destructive action that started this post. The deny-list approach tries to recognize every way a human might ask to delete things, forever. The capability approach simply doesn’t grant the calendar agent a delete capability at all, or grants it only as a proposal that a human approves, never a direct action. Now “clear my Friday,” “wipe it,” “nuke my afternoon,” and the polite hypothetical all land in the same place: a thing the agent can suggest but cannot execute. One decision, what to grant, replaced an infinite list of decisions about what to block. The phrasings stopped mattering the moment the action stopped being reachable.

This is also why the capability model degrades gracefully and the deny-list degrades catastrophically. When a deny-list is wrong, it’s wrong open, the unanticipated input gets through and something happens that shouldn’t. When a capability grant is wrong, it’s wrong closed, the agent tried to do something useful, couldn’t, and surfaced that it was blocked. A system that fails closed asks for a permission. A system that fails open files an incident. Imagine an agent that, mid-task, hits a wall and tells you “I’d need access to do that, approve?” That’s a Tuesday. The other failure is a postmortem.

Two failure modes. A deny-list catches the phrasings it learned and lets an unknown one through, so the agent acts when it should not and the team learns from the incident. A capability grant has no door for an ungranted action, so the agent stops, asks for permission, and the human decides before anything happens.

The cost, paid honestly

This isn’t free, and pretending it is would be its own kind of dishonesty.

The capability model puts the work up front. Someone has to decide, for each agent, exactly what it may touch, and that decision is real, and it can be gotten wrong, and a too-narrow grant means an agent that keeps stopping to ask for things it should obviously have. The deny-list feels cheaper at the start because you ship with an empty list and add to it only when something goes wrong. The capability list makes you do the thinking on day one, before there’s an incident to motivate it.

That up-front cost is the entire bargain, and it’s a good one. The deny-list’s “cheap start” is a loan with brutal interest: you pay it back one incident at a time, each one in production, each one teaching you a phrasing you’ll now block too late. The capability list charges you the full price once, in a design meeting, where the worst outcome is an awkward conversation about scope instead of a customer’s data in the wrong place.

And the up-front list is the one you can actually finish. A grant is bounded, it’s the agent’s job, written down. A blocklist is unbounded by construction. One of these two lists converges. It isn’t the one you build by reacting.

The turn: trust is a grant, not a hope

Step back from agents for a second, because this isn’t really a fact about software.

The deny-list is how you supervise someone you don’t trust and can’t fire, you let them do everything, and you watch for the moment they do something wrong, and you’re always a step behind, and the relationship is exhausting for everyone. The capability grant is how you actually delegate. You don’t hand a new hire the keys to every system and a list of things they’re forbidden to touch. You give them access to the things their job needs, and the rest simply isn’t theirs to break, not because you don’t trust them, but because bounded access is what trust looks like when it’s real. The boundary isn’t an insult. It’s the thing that lets the trust exist at all.

That’s the model an AI-native company has to run on, because the alternative doesn’t scale to a workforce of agents. You cannot hand-write a blocklist for every way every agent might misstep, across every language, forever. You can write down what each one is for, grant exactly that, and let the empty space outside the grant do the work that no list of rules ever could. The agent gets to be useful inside its fence, and the fence holds because of what’s not in it.

The deny-list never converges. The grant does, and a grant you can finish is the only kind of safety you can actually ship.

That’s how we think about agent boundaries at Apollo Space: not a longer list of the things an agent must never do, but a short, honest list of the things it’s allowed to do, and a default of no for everything else. If you’ve ever shipped a filter on Monday and patched it again on Friday, you already know which list converges. It was never the one you wrote by reacting.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist