Engineering

A tool is a promise. Give it a schema.

A tool with a loose interface is a promise the model can break, a schema at the call layer turns that promise into a contract the system can enforce, so the model retries instead of hallucinating a shape.

ASR

Apollo Space Research

Apollo Space

· 11 min read

An agent is asked to create a calendar event. It produces a tool call with the date written as “next Thursday,” the duration as a friendly “about an hour,” and the attendee as a name instead of an address. Every word is reasonable. Every word is wrong for the function on the other side, which wanted an ISO timestamp, an integer count of minutes, and an email. The call fails, or worse, it half-succeeds and books the wrong slot. Nobody typed a bad instruction. The model just made up a shape the tool never agreed to accept.

That gap, between what the model emits and what the tool will actually take, is where most agent failures live. Not in the reasoning. In the handshake.

A tool with a loose interface is a promise the model can break. The fix is to make the promise something the system can check.

This post is about that fix: why structured, schema-validated input and output is the difference between a tool you hope the model uses correctly and a tool the model cannot use incorrectly without being told, on the spot, to try again.

The naive tool: a docstring and a prayer

The first way everyone builds a tool is the obvious way. You write a function, you give it a name, you write a sentence of documentation, and you hand the model a list of those sentences. create_event(details). send_email(message). record_expense(info). The model reads the description, decides this is the tool it wants, and writes whatever arguments seem right.

It works in the demo. The demo prompt is clean, the model is fresh, the example is the happy path. So the model writes a tidy little JSON blob, the function accepts it, and everyone nods.

Then it meets the real world. The real user writes “book me 30 with the Acme folks tomorrow afternoon.” Now the model has to invent the shape entirely on its own. Is attendees a list or a string? Did I call the key start or start_time or when? Is the date a word or a number? The description said details, details is not a shape. So the model picks one, confidently, and roughly a third of the time it picks one the function never expected.

Here’s the part that makes it dangerous instead of merely annoying. The function often accepts the wrong shape and does the wrong thing quietly. A loosely-typed handler takes the string “tomorrow,” fails to parse it, falls back to now, and books the meeting for this exact minute. No error. No retry. Just a wrong result wearing the costume of a successful one. The model thinks it kept its promise. The user finds out on Thursday that it didn’t.

The naive tool’s contract lives in one place only: an English sentence the model is free to misread. There is nothing in the system that can tell the difference between a call that honored the contract and a call that violated it. So nothing does. The violation sails straight through to the user.

The schema is the contract, written where a machine can read it

The key idea is simple. Stop describing the tool’s input in prose the model interprets, and start declaring it in a schema the system validates.

A schema is just the promise made explicit and machine-checkable. Not “pass the details,” but: start is a required ISO-8601 timestamp; duration_minutes is a required positive integer; attendees is a required list of strings, each matching an email pattern; title is a required non-empty string. Every field has a type. Required fields are marked required. Strings that have a shape, emails, dates, currency codes, carry that shape as a constraint, not a hope.

Now the call passes through a gate before it ever reaches your function. The model emits its arguments; the gate checks them against the schema; only a call that satisfies every constraint is allowed through. A string where an integer was promised doesn’t reach the function and quietly corrupt a booking. It bounces off the gate.

Two things change the instant you do this, and they matter in different ways.

The first is that the model is steered before it even calls. A capable model that’s shown a schema generates against that schema, it sees duration_minutes: integer and emits a number, sees start: date-time and emits a timestamp. The shape stops being a guess because the shape is written down. Half the malformed calls never happen, because the description was never ambiguous in the first place.

The second is the one that actually saves you: when the model gets it wrong anyway, and on a hard, real-world request it sometimes will, the wrong call is caught at the gate, not in production. And a caught call can be handed back.

Two ways an agent's tool call reaches a function. On the left, the naive path: a prose docstring, the model invents a shape, the loosely-typed handler accepts a wrong value and quietly produces a wrong result that reaches the user. On the right, the schema-validated path: the model emits arguments, a validation gate checks them against a declared schema, a malformed call is rejected with a precise reason and never touches the function, and only a conforming call runs.

A rejection the model can read is a retry, not a crash

This is the move that turns validation from a safety net into a feedback loop, and it’s the part most people miss.

The naive instinct, once you add a gate, is to treat a rejected call as an error, log it, fail the turn, surface a stack trace, give up. That’s strictly better than the silent-wrong-result, because at least it’s honest. But it wastes the most useful thing the gate just produced: a precise, structured explanation of exactly what was wrong.

When the schema rejects a call, it doesn’t say “invalid input.” It says duration_minutes: expected integer, got string "about an hour". It says attendees[0]: "the Acme folks" does not match email format. That message is not for a human log. It’s for the model.

So you hand it back. The rejected call, with the validator’s reason attached, goes straight into the model’s next turn: that call didn’t satisfy the contract, here’s exactly which field and why, try again. And the model, which is genuinely good at fixing a specific, named mistake, fixes it. It re-reads “tomorrow afternoon,” consults the constraint it just violated, and emits a real ISO timestamp. The second call passes the gate.

The naive way to picture a tool boundary is a wall: the call either gets through or it crashes into the wall. The schema turns the wall into a turnstile that talks back. A bad call isn’t the end of the attempt, it’s one corrected draft on the way to a good one. The model isn’t punished for the malformed shape; it’s told the shape, in the one moment it can act on the information.

A validation error the model can read is not a failure. It’s the most useful sentence in the loop.

This is why “make the model smarter” is the wrong place to spend. A smarter model still has to guess at an undeclared shape, and still guesses wrong on the hard inputs. A schema’d tool with a readable rejection lets a current model self-correct, because the correction the model needs isn’t more intelligence, it’s a specific fact about what the tool will accept, delivered at the exact moment it’s calling.

A malformed tool call entering the validation loop and coming out conforming. The model emits a call with a bad field; the schema gate rejects it with a precise structured reason naming the field and the violation; that reason is fed back into the model's next turn; the model emits a corrected call that satisfies the schema and is allowed to execute. The loop turns a rejected call into a retry instead of a crash.

The output side has the same promise, and the same break

Everything so far has been about what goes into the tool. The half people forget is what comes out.

A tool that returns an unstructured blob is the same broken promise, pointed the other direction. The function runs, succeeds, and returns “Created the event for you!”, a friendly sentence with no structure. Now the next step in the agent’s plan, which needs the event’s ID to send the invite, has to parse English to find it. Sometimes there’s no ID in the sentence at all. The agent, needing a value that isn’t there, does the thing models do when a value is missing under pressure: it makes one up. A plausible-looking ID that points to nothing. The chain keeps going, confidently, on a fabricated handle.

The fix is symmetric. The tool’s return is schema’d too: { event_id: string, start: timestamp, status: enum }. The function is contracted to produce exactly that, and the value gets validated on the way out the same way the arguments got validated on the way in. The next step doesn’t parse prose to find the ID, it reads a field that is guaranteed to be there, or the tool failed honestly and said so. A structured output is a promise the tool makes to the rest of the system, and validating it is how you stop a downstream agent from hallucinating around a missing field.

Input schema keeps the model from lying to the tool. Output schema keeps the tool from lying to the model. A tool worth trusting honors the contract in both directions, and the contract is enforced at the boundary, not assumed in the prose.

What this discipline actually buys

It’s worth being plain about the trade, because schemas aren’t free. Writing them is more work than writing a docstring. You have to decide, up front, exactly what every field is and what shape it takes, which means you have to actually know what your tool accepts, instead of discovering it in production when a wrong value sails through.

That up-front cost is the whole point. The expensive failure in an agent system was never the malformed call itself. It was the malformed call that succeeded at being wrong, the meeting booked for the wrong minute, the expense recorded against the wrong field, the downstream step running on a fabricated ID, all of it silent, all of it discovered late, by a user, as a bug that’s hard to trace because nothing ever threw an error.

A schema moves that failure from production to the call boundary, and from silent to loud-but-recoverable. Suppose a tool gets called a thousand times in a busy week across a company’s agents. The handful of malformed calls that a loose interface would have turned into quiet wrong results instead get caught, named, handed back, and corrected inside the same turn, before anything touched a calendar, a ledger, or a customer. You don’t see them as incidents. You don’t see them at all. That’s the tell that it’s working.

The trust doesn’t live in the model being careful. It lives in the boundary being checkable. That’s a much more reliable place to keep it, because the boundary doesn’t get tired, doesn’t get a confusing prompt, and doesn’t have a good day and a bad day. It just holds the contract.

The turn: a contract is a kindness, not a constraint

Strip the agents out of this and it’s an old idea about working with anyone, machine or person.

The fastest way to make a collaborator unreliable is to be vague about what you need and then be annoyed when they guess wrong. The fastest way to make them reliable is the opposite: tell them exactly what “right” looks like, and tell them the instant they’ve missed it, while they can still fix it. We don’t think of a clear spec as a cage. We think of it as respect, here’s precisely what I need, and here’s a fast, specific no when it’s not that yet, so you can get to yes.

A schema is that, for a tool. It’s not there to catch the model out. It’s there so the model can succeed on purpose instead of by luck, so a wrong shape becomes a corrected draft instead of a quiet disaster, and a tool stops being a promise you hope holds and becomes a contract the system keeps for you.

That’s a piece of what we’re building at Apollo: agents whose tools have real edges, where a bad call gets a precise no and a second try instead of slipping through to bite someone on a Thursday. If you’ve ever shipped an agent that confidently did the wrong thing and never threw an error, you already know the cheapest place to catch that wasn’t the model. It was the handshake.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist