Engineering

The agent that builds its own tools

When the agent writes the function it needs instead of waiting for someone to ship it, the line between using a tool and making one disappears, and the guardrail moves to where it belongs.

ASR

Apollo Space Research

Apollo Space

December 23, 2025 · 11 min read

An agent is mid-task. It needs to total a column of numbers that live in three different places, in three different formats, none of which line up. It has a calculator tool, a search tool, a dozen integrations, and not one of them does this exact thing. In the ordinary setup, this is where the agent stops, shrugs in prose, and files a request for a human to build the missing function. In ours, it writes the function, runs it in a box that can’t hurt anything, gets the number, and keeps going.

The whole detour took seconds. Nobody shipped a feature.

That is the idea this post is about, and it’s smaller than it sounds: the moment an agent can write the tool it’s missing, “using tools” and “building tools” stop being two different jobs. They become one continuous motion. The interesting part isn’t that it’s possible. It’s where you put the guardrail once it is.

Why the tool list always runs out

Start with the way every agent works today, because the failure is built into the shape.

You give an agent a list of tools. Each one is a function with a name, a description, and a schema, send_email, create_task, search_docs. The model reads the task, picks a tool, fills in the arguments, and the runtime executes it. This is a good design. A tool is a promise with a shape, and that shape is what makes the agent’s choices checkable. We’re not arguing against tools.

We’re arguing against the list being finite.

Because the list is always finite, and the world isn’t. Whoever built the agent imagined some set of things it would need to do, wrote a tool for each, and stopped. The first time the agent meets a task that falls between two tools, total these three mismatched columns, reformat this oddly-shaped export, diff yesterday’s report against today’s, it has nothing. The capability it needs is one small function away, and that function doesn’t exist, because nobody imagined this exact request in advance.

Here’s the part that stings: the missing function is usually trivial. Ten lines. A loop, a parse, a sum. A junior engineer would write it in a minute. But the agent can’t, because in the ordinary architecture the agent is a user of tools, never an author of them. So it does the only thing a user can do when the catalog comes up short. It improvises in natural language, “the total appears to be roughly…”, and gets it subtly wrong. Or it stalls and asks a human to go build the ten-line function, which lands in a backlog, which is where small obvious things go to wait behind big ones.

The naive fix is to add more tools. Anticipate harder. Ship a bigger catalog. But you can’t pre-build every ten-line function a real company will ever need, any more than you can pre-write every sentence someone might say. The catalog grows, the gap moves, and the agent is still standing at the edge of it. The bottleneck never disappears. It just moves to the next thing nobody thought to build.

Let the agent write the function

So flip the architecture. Instead of handing the agent a fixed list of tools, hand it one more tool, a tool that makes tools.

The idea is simple. The key idea is always simple; let’s walk through why this one works. The agent already speaks the language tools are written in. The same model that picks send_email and fills its arguments can also write the body of a small function from scratch, that’s just code, and code is something it’s fluent in. The catalog was never a limit on what the agent could express. It was a limit on what it was allowed to run.

A finite tool catalog always runs out because the world has more tasks than any list can hold; a synthesis tool turns the agent from a user of a fixed list into an author of the small functions it discovers it needs.

When the agent hits a task with no matching tool, it doesn’t stop. It writes a small function that would do the job, hands that function to a synthesis tool, and the runtime executes the freshly written code in a sandbox, returns the result, and the agent folds the answer back into its work. The missing capability existed for exactly as long as the task needed it.

This is the move that collapses the two jobs. Using a tool is choosing a function and filling its arguments. Building a tool is writing a function and giving it a shape. When the agent can do both in the same breath, the wall between them is gone. It isn’t reaching into a catalog anymore. It’s reaching for a capability, and if the capability isn’t there, it writes it.

And notice what this does to the backlog. The ten-line function that used to wait behind the big features, the reformatter, the one-off diff, the odd little sum, never enters the backlog at all. It gets written, run, and discarded inside a single task, because it was never worth a human’s afternoon and now it doesn’t need one. The moment an agent can write the tool it’s missing, “using tools” and “building tools” stop being two different jobs.

If you stop reading here, you’d be right to be nervous. An agent that writes and runs its own code sounds like exactly the thing you don’t want loose in a company’s systems. Which is the real subject of this post.

The guardrail doesn’t disappear. It moves.

Every instinct says: this is dangerous, so forbid it. Don’t let the agent write code. Keep the catalog locked.

That instinct is reaching for the wrong control. Forbidding synthesis doesn’t make the agent safe; it makes the agent useless at the exact moment it could have been most helpful, and it pushes the same risk somewhere quieter. Because the agent that can’t write a function will instead guess the answer in prose, and a confident wrong number with no code behind it is far more dangerous than a function you can read. You didn’t remove the risk. You hid it inside fluent English where nothing can check it.

So we don’t forbid the writing. We move the guardrail.

Here’s the principle, and it’s the load-bearing one: you don’t make a code-writing agent safe by stopping it from writing code. You make it safe by controlling what the code is allowed to touch. The danger was never that the agent authored a function. The danger is what that function can reach, the network, the filesystem, the production database, the customer’s money. Lock those down and the authorship is harmless. The function runs in a box with no door to the outside; it gets inputs, it returns a value, and the worst a bad one can do is return a bad value, which the next layer is built to catch anyway.

The naive version of safety is a list of forbidden actions, don’t delete files, don’t call this API, don’t spend money, written into a prompt and hoped over. That fails the way all prompt-rules fail: the model can be talked out of them, and a rule the model can reason around is not a guardrail, it’s a suggestion. The real version isn’t a rule the agent reads. It’s a wall the agent can’t see past. The sandbox doesn’t ask the function to behave. It removes the ability to misbehave, no network unless granted, no filesystem unless granted, a hard ceiling on time and memory so a runaway loop dies in the box instead of in production.

Two ways to make a code-writing agent safe: a prompt rule the model can reason around versus a sandbox wall it cannot reach past, with the synthesized function isolated to inputs and a returned value while the network, files, and money stay outside the box.

There’s a second guardrail, quieter than the sandbox and just as important: a tool the agent writes once doesn’t have to be written from scratch the next time. When a synthesized function turns out to be good, it ran clean, it returned the right shape, it solved a task a real person actually had, it can be promoted. Reviewed the way any change is reviewed, given a stable name and a schema, and added to the catalog as a first-class tool. The next agent that needs it doesn’t re-derive it. It just picks it, the way it picks send_email.

That promotion path is what keeps synthesis from becoming chaos. The throwaway functions stay throwaway. The ones that prove themselves graduate, from a thing one agent improvised in a sandbox into a tool the whole system can trust, with a shape you can check and a history you can audit. The catalog stops being a fixed list someone wrote in advance. It becomes a thing that grows from real use, one proven function at a time.

What an agent actually does with this

It’s worth grounding this, because “writes its own tools” can sound grander than the day-to-day reality, and the day-to-day reality is the point.

Most of what a synthesis tool gets used for is unglamorous. Reshaping a messy export into the shape the next step expects. Computing something the calculator can’t express, a weighted total, a date difference across a quarter boundary, a check that two lists actually match. Parsing a format nobody anticipated. These are not feats of engineering. They’re the small connective functions that sit between the big tools, the glue that a human operator would write without thinking and a fixed catalog can never quite cover.

Say an agent is reconciling two records that should agree and don’t, a count in one system, a count in another. No tool compares them; why would there be one, for this exact pair? So it writes the six-line comparison, runs it in the box, and reports the three rows that differ. The capability lived for one task and then evaporated, and the agent never had to say “I can’t do that, please build me a tool.” It built the tool. The whole thing was invisible.

That invisibility is the tell that it’s working. Nobody filed a feature request. Nobody waited a sprint. The agent met a gap and closed it, inside its own turn, and the only trace it leaves is a function in a log that a reviewer can read if they ever need to ask how did it get that number. The answer is right there, in code, which is exactly where you want the answer to be, not buried in a sentence the model wrote, where nothing can audit it.

The turn: capability stops being something you ship

Step back from the sandbox and the schemas, and here’s what actually changed.

For the whole history of software, capability was something you shipped. Someone decided the product should do a thing, an engineer built the thing, it went through review and release, and then, weeks later, if you were lucky, users could do the thing. The list of what the software could do was always a list someone wrote in advance, and the gap between “I need this” and “I can do this” was measured in releases. Most small needs never crossed that gap at all, because they weren’t worth a release, so they just stayed needs forever.

An agent that writes its own tools quietly erases that gap for the small stuff, and the small stuff is most of the stuff. The ten-line function that would never have justified an engineer’s afternoon now gets written the instant it’s needed and forgotten the instant it isn’t. Capability stops being a thing you ship on a schedule and becomes a thing that appears on demand, inside the work, while you’re asleep. The product is no longer the fixed set of things someone built. It’s the open set of things the agent can build, fenced by what you’ll let it touch.

That’s the part worth sitting with. The hard question stops being what features did we ship. It becomes what are we willing to let an agent reach, and that’s a much better question, because it’s the one that was always actually load-bearing. The fence is the product now. Get the fence right and the capability takes care of itself.

That’s what we’re building at Apollo, not a smarter agent with a longer list of buttons, but an agent that writes the button it’s missing and runs it somewhere it can’t do harm. The line between using a tool and making one was never a law of nature. It was just a wall nobody had a safe reason to take down yet.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist