What's the difference between an AI agent and an LLM API call?

An LLM API call is stateless, it takes input, produces output, and forgets everything. An agent maintains memory across interactions, has access to tools that let it act on the world, and runs a decision loop that lets it pursue goals autonomously over time.

What programming languages are Apollo Space's agents built in?

The orchestration layer and most agents are built in Python. Performance-critical components like the memory retrieval system use Rust. The tool integrations use a mix of Python and TypeScript depending on the APIs involved.

How do you prevent agents from getting stuck in loops?

Through circuit breakers (maximum iterations per decision cycle), divergence detection (recognizing when an agent is repeating the same action without progress), and escalation protocols (routing to a human when the agent's confidence drops below a threshold).

Engineering

Anatomy of an AI agent: memory, tools, and decision loops

Most people think an AI agent is an LLM with a system prompt. It's not. An agent is memory, tools, and a decision loop. Here's how we built twelve of them.

ASR

Apollo Space Research

Apollo Space

September 1, 2025 · 14 min read

The Confusion That Costs Teams Months

Every week, we talk to a CTO or engineering lead who says they’re “building agents.” Then we ask three questions:

Does it remember what happened yesterday?
Can it take actions without your code explicitly calling an API?
Can it decide what to do next without a human choosing the step?

If the answer to all three is no, you haven’t built an agent. You’ve built a prompt chain. And that’s fine, prompt chains are useful. But confusing a prompt chain with an agent is how teams spend six months building something that can’t do what they promised it would do.

An agent is three things: memory, tools, and a decision loop. Remove any one, and you have something else, something potentially valuable, but not an agent. Understanding these three components and how they interact is the difference between building systems that actually work and building elaborate auto-complete.

This is how we build agents at Apollo Space. Not theory, implementation.

Component 1: Memory

The most misunderstood part of agent architecture is memory. Most teams confuse memory with context windows, and the confusion is expensive.

A context window is the text you pass to the LLM on each call. It’s the prompt, the recent conversation, the retrieved documents. It’s what the model sees right now. When the call ends, the context window disappears. The model has no knowledge that the conversation ever happened.

Memory is what persists between calls. It’s the agent’s accumulated knowledge, experience, and procedures. It’s what allows the agent to improve over time, recognize patterns, and act on information from weeks or months ago.

We implement three types of memory in Apollo Space’s agents, borrowing terminology from cognitive science:

Episodic Memory: What Happened

Episodic memory stores events, specific things that happened, when they happened, and what the outcome was. It’s the agent’s autobiography.

For Apollo Space’s SDR agent, episodic memory looks like this:

Event: Sent outreach email to VP Engineering at AcmeCorp
Date: 2026-01-14 09:32:00
Action: Cold email, personalized on recent Series B
Outcome: Opened, no reply
Follow-up: Scheduled for 2026-01-18

Event: Sent follow-up to VP Engineering at AcmeCorp
Date: 2026-01-18 14:15:00
Action: Follow-up referencing their job posting for DevOps lead
Outcome: Reply received, positive, meeting booked for 2026-01-22
Follow-up: Meeting prep triggered

This isn’t a conversation log. It’s a structured record of actions and outcomes that the agent can query. When the SDR agent is deciding how to approach a new prospect, it doesn’t just use the current context, it queries its episodic memory for similar prospects, similar outreach strategies, and their outcomes.

The query might be: “Show me the last 20 outreach sequences to VP-level prospects at Series B companies in the developer tools space, ranked by meeting conversion rate.” The results inform the agent’s strategy for the next prospect.

Episodic memory is stored in a vector database (we use Qdrant) with structured metadata. Each episode is embedded for semantic similarity search and tagged with structured fields (outcome, prospect attributes, strategy used) for filtered queries. The combination of semantic and structured search is important, pure semantic search returns similar events, but structured filters let the agent ask precise analytical questions about its own history.

Semantic Memory: What’s True

Semantic memory stores facts and knowledge, things the agent knows that aren’t tied to specific events. It’s the agent’s understanding of the world.

For the competitor watch agent, semantic memory includes:

CompeteLogic’s current pricing tiers and feature breakdown
The CTO of AcmeCorp previously worked at Datadog (and therefore likely values observability features)
Q4 is historically the slowest quarter for logistics companies’ technology purchasing
Our win rate against CompeteLogic is 62% when we lead on pricing stability

This knowledge is accumulated over time from the agent’s episodic memory (patterns extracted from events), external data sources, and human-provided context. It’s updated as new information arrives, when CompeteLogic changes their pricing, the semantic memory updates immediately.

The distinction between episodic and semantic memory matters for retrieval. When the SDR agent is preparing for a call, it needs both: episodic memory tells it what happened in previous interactions with this prospect, and semantic memory tells it what’s true about the prospect’s industry, their company, and competitive dynamics.

We store semantic memory as a knowledge graph with weighted edges. Entities (companies, people, products, concepts) are nodes. Relationships (competes-with, works-at, prefers, used-to-use) are edges with confidence scores. The confidence scores decay over time, a fact from six months ago is less reliable than a fact from yesterday, which prevents the agent from acting on stale information without being told it’s stale.

Procedural Memory: What To Do

Procedural memory stores how to do things, learned procedures, heuristics, and decision frameworks. It’s the agent’s muscle memory.

This is the least intuitive type of memory because it feels like it should be code. And some of it is, hardcoded procedures, explicit rule sets, API call sequences. But the most valuable procedural memory is learned, not coded.

For example, Apollo Space’s QA agent has a procedural memory entry that says:

Procedure: Testing auth middleware changes
Learned from: Bug #1847 (November 2025) — login redirect loop
Steps:
  1. Test the happy path (standard login flow)
  2. Test with expired sessions
  3. Test with SSO providers if SSO code is in the diff
  4. CRITICAL: Test redirect behavior for unauthenticated users
     hitting protected routes — this is the most common failure
     mode for auth middleware changes
  5. Test with multiple concurrent sessions
Confidence: 0.94
Last validated: 2026-02-15

This procedure wasn’t written by a human. It was extracted from the agent’s episodic memory after it caught a login redirect bug. The agent identified the pattern, auth middleware changes often break redirect behavior, and codified it as a reusable testing procedure.

Procedural memory is the mechanism through which agents improve at their jobs over time. Each significant outcome (especially failures) gets analyzed, and if a generalizable procedure can be extracted, it’s added to procedural memory. Over months, this accumulates into something that looks remarkably like expertise.

We store procedural memory as structured procedures with confidence scores, context triggers (when to apply this procedure), and validation timestamps. Procedures that haven’t been validated in 90 days are flagged for review, and procedures whose confidence drops below 0.7 (due to contradictory outcomes) are retired.

Component 2: Tools

Memory gives agents knowledge. Tools give agents power.

A tool, in agent architecture, is any function the agent can call to act on the world or retrieve information from it. Tools are the bridge between reasoning and action.

Apollo Space’s agents have access to tools in four categories:

Information Tools

These retrieve data the agent doesn’t have in memory. Examples:

Query the CRM for a prospect’s engagement history
Search the web for recent news about a company
Read a GitHub PR diff
Check current system metrics from the observability stack

Action Tools

These change the state of the world. Examples:

Send an email
Post a Slack message
Create a JIRA ticket
Approve or request changes on a GitHub PR

Analysis Tools

These perform computation the LLM can’t reliably do alone. Examples:

Run a statistical analysis on pipeline conversion rates
Calculate the financial impact of a budget variance
Compare two code snapshots for behavioral differences
Score a lead against the ICP criteria

Meta Tools

These let the agent manage itself. Examples:

Write to its own memory
Adjust its confidence thresholds
Escalate to a human
Request more information before proceeding

The tool design matters more than most teams realize. A common mistake is giving agents tools that are too granular (send an HTTP request) or too abstract (do the right thing). The right level of abstraction is the level at which a competent human would describe the action. Not “construct a POST request to the Slack API with this JSON payload” but “send a message to the #sales channel saying the deal with AcmeCorp has been updated.”

Each tool has three metadata properties that the agent uses when planning:

Reversibility: Can this action be undone? Sending a Slack message is low reversibility. Creating a draft document is high reversibility. The agent factors reversibility into its decision-making, it’s more willing to take high-reversibility actions without human approval.
Cost: What’s the cost of using this tool? Not just compute cost, but attention cost (sending a notification interrupts someone) and reputation cost (sending a bad email damages a relationship). The agent weighs cost against expected value before acting.
Latency: How long does this tool take to return a result? A CRM query takes milliseconds. A web search takes seconds. Running a QA test suite takes minutes. The agent plans its actions to parallelize high-latency tool calls when possible.

Component 3: The Decision Loop

Memory is knowledge. Tools are capability. The decision loop is the mechanism that turns knowledge and capability into autonomous action.

Every Apollo Space agent runs the same core decision loop, which we call PRAO: Perceive, Reason, Act, Observe.

Perceive

The agent receives input from its environment. This can be:

A trigger event (a new PR was opened, a scheduled timer fired, a Slack message was received)
A message from another agent (the competitor watch agent detected a pricing change)
A human request (a user asked the agent to do something)

Perception isn’t passive reception, it includes filtering and prioritization. The SDR agent receives dozens of signals per hour: email opens, link clicks, CRM updates, calendar changes. Perception is the stage where the agent decides which signals are relevant right now.

We implement perception as a priority queue with relevance scoring. Each incoming signal is scored based on the agent’s current goals, the signal’s urgency (time-sensitive vs. can-wait), and the signal’s expected impact (how much would acting on this signal change outcomes). Only signals above the relevance threshold enter the reasoning stage.

Reason

The agent decides what to do. This is where the LLM earns its keep.

Reasoning takes the perceived signal, retrieves relevant memory (episodic: what happened in similar situations; semantic: what’s true about this context; procedural: what’s the standard approach), gathers any additional information via tools, and produces a plan.

The plan isn’t a single action, it’s a sequence of actions with conditional branches. “Send a follow-up email. If the prospect replies within 48 hours, schedule a meeting. If they don’t reply, wait 5 days and try a different angle. If they reply negatively, log the objection and archive the sequence.”

The critical engineering decision in the reasoning stage is what goes into the LLM context. The context window is finite (and expensive), so we can’t dump all memories and all tool results into every reasoning call. We use a retrieval strategy that combines:

Recency-weighted episodic memory: Recent events about this specific prospect/project/system
Similarity-weighted semantic memory: Facts relevant to the current situation
Trigger-matched procedural memory: Procedures that match the current context
Tool results: Fresh data retrieved in the perception stage

The total context is typically 4,000-8,000 tokens of memory plus the current signal. We found that more context doesn’t reliably improve reasoning quality, the signal-to-noise ratio degrades, and the agent starts paying attention to irrelevant details. Precision in context construction matters more than volume.

Act

The agent executes its plan by calling tools. Each action is logged to episodic memory with a timestamp, the reasoning that led to it, and the expected outcome.

The key engineering challenge in the action stage is error handling. What happens when a tool call fails? What happens when the action produces an unexpected result? What happens when the agent’s plan has a step that depends on the result of a previous step that didn’t go as expected?

We handle this with a replanning mechanism. After each action, the agent compares the actual result against the expected result. If they diverge significantly, the agent re-enters the reasoning stage with the new information. This is the “loop” in “decision loop”, it’s not a linear pipeline, it’s an iterative cycle.

The replanning mechanism has circuit breakers. If the agent replans more than three times in a single decision cycle, it escalates to a human. This prevents infinite loops where the agent keeps trying the same failing approach with minor variations.

Observe

After acting, the agent observes the outcome. Did the email get sent? Did the test pass? Did the system metrics change? Observation closes the loop and feeds back into memory.

Observation is also where the agent updates its own effectiveness model. If it predicted that a particular outreach strategy had a 30% response rate and the actual response rate over the last month is 45%, it updates its predictive model. This is how agents calibrate their confidence over time, not through explicit retraining, but through continuous observation and adjustment.

Stateless Calls vs. Stateful Agents

Let’s make the difference concrete with an example.

Stateless LLM call: You send the model a prospect’s LinkedIn profile and ask it to write a cold email. The model writes a generic personalized email. Good enough. But it doesn’t know that you emailed this prospect three weeks ago. It doesn’t know that prospects in this industry respond better to case-study-led emails. It doesn’t know that the prospect’s company just raised a round. Every call starts from zero.

Stateful agent: The SDR agent sees a trigger (new prospect added to CRM). It queries episodic memory (we’ve never contacted this person, but we contacted their colleague in Q4, no response). It queries semantic memory (Series B fintech, 80 employees, CTO from Stripe, high ICP fit). It retrieves procedural memory (fintech prospects respond 2.3x better to ROI-focused emails than feature-focused; best send times for this segment are Tuesday/Wednesday 9-10 AM). It reasons through the approach, drafts the email, sends it at the optimal time, logs the action, and schedules the follow-up.

The email from the stateful agent isn’t just personalized, it’s informed by months of accumulated experience. That’s not a better prompt. It’s a fundamentally different architecture.

How Apollo Space’s Twelve Agents Map to This Architecture

Every Apollo Space agent uses the same PRAO loop and the same three-tier memory system, but the implementation details vary by function.

SDR Agent: Heavy episodic memory (every outreach interaction logged), rich tool set (email, LinkedIn, CRM, calendar), fast decision loop (multiple decisions per hour during business hours).

QA Agent: Heavy procedural memory (testing procedures learned from past bugs), specialized tools (browser automation, API testing, visual regression), event-driven decision loop (triggered by new PRs and commits).

Competitor Watch Agent: Heavy semantic memory (competitor knowledge graph), information-heavy tools (web scraping, change detection, data aggregation), slow decision loop (runs on 6-hour cycles for monitoring, real-time for alerts).

Observability Agent: Minimal episodic memory (metrics are time-series, stored externally), analysis-heavy tools (metric correlation, anomaly detection, root cause analysis), real-time decision loop (continuous monitoring with configurable alert thresholds).

Meeting Digest Agent: Moderate episodic memory (past meetings and action items), focused tools (transcription, summarization, task extraction, calendar integration), event-driven loop (triggered by meeting completion).

The architecture is consistent. The configuration is specific. This is the design principle that lets us run twelve agents without twelve separate codebases, one framework, twelve configurations, and the specialization lives in memory and tool sets rather than in code.

The 80% That’s Not the Model

Here’s the thing nobody in the AI hype cycle wants to admit: the LLM is maybe 20% of what makes an agent work.

The other 80% is the memory system (how you store, retrieve, and maintain the agent’s knowledge), the tool layer (how the agent acts on the world reliably and safely), and the decision loop (how perception flows to reasoning flows to action flows to observation). These are systems engineering problems, not machine learning problems.

When a team tells us their agent isn’t working well, nine times out of ten the problem isn’t the model. It’s that the agent is reasoning with insufficient context because the memory retrieval is poor. Or the agent is making good decisions but can’t execute them because the tool layer is brittle. Or the agent gets stuck in loops because the decision loop doesn’t have proper divergence detection.

Swap the model from GPT-4 to Claude to Gemini, and agent performance changes maybe 10-15%. Improve the memory retrieval system, and performance changes 40-60%. That’s the gap between the narrative (AI is about models) and the reality (agents are about systems).

Build the system right, and the model is a replaceable component. Build the system wrong, and no model will save you.

Get technical deep-dives on agent engineering, subscribe to the Apollo Space blog

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist