
How to Build an AI Agent: Architecture and Frameworks
A technical guide to agentic AI - how autonomous reasoning loops, tool integration, memory, and orchestration frameworks combine to turn a language model into a goal-pursuing agent.
Key facts
- An AI agent is a reasoning loop around an LLM, not a model architecture.
- ReAct and Plan-and-Execute are the two foundational agent patterns.
- Function calling is the universal contract for tool use across frontier models.
- Agent reliability degrades multiplicatively with task length.
- Prompt injection is the dominant security failure mode for tool-using agents.
- Long-term memory requires explicit summarisation and retrieval - it is not built-in.
What Agentic AI Actually Means
Agentic AI refers to systems where a large language model is wrapped in a control loop that lets it observe state, plan, call tools, and revise its approach until a goal is reached. The model is the reasoning core; the agent is the loop and the scaffolding around it.
The shift from chatbot to agent is architectural, not just behavioural: a chatbot returns one response per turn, while an agent can decide to search the web, run code, query a database, and self-correct across many turns before producing a final answer.
Core Architecture: The Reasoning Loop
Every modern agent implements some variant of the perceive-reason-act loop. The two dominant patterns are ReAct (Yao et al., 2022), which interleaves chain-of-thought reasoning with tool calls, and Plan-and-Execute (Wang et al., 2023), which separates a planner that writes a multi-step plan from an executor that carries each step out.
ReAct excels at tasks where the next action depends heavily on the previous observation - browsing, debugging, exploration. Plan-and-Execute is stronger when the goal is decomposable up front and the cost of replanning is high. Production agents typically combine both: a planner emits a sketch, and a ReAct-style executor refines each step.
- Perceive: read inputs, tool outputs, and prior context.
- Reason: produce a thought or a plan with the LLM.
- Act: emit a structured tool call (function calling).
- Observe: receive the tool result and append it to context.
- Repeat until a termination condition (goal met, budget exhausted, human handoff).
Tool Use and Function Calling
Function calling is the API contract that makes agents possible. The developer declares a set of tools as typed JSON schemas; the LLM emits a structured call (name plus arguments) that the runtime executes, returning the result back into the model's context.
Common tool categories are retrieval (vector search, SQL, web search), computation (code execution, calculators), I/O (file system, email, APIs), and computer use (mouse, keyboard, screen). Anthropic's Claude computer use (2024) and OpenAI's Operator (2025) generalised this to controlling arbitrary GUI applications.
Memory: Short-Term and Long-Term
Short-term memory is the model's context window. Long-term memory is whatever you persist between turns and retrieve on demand - usually a vector store of past interactions, plus a structured store of facts the agent has learned about its user or environment.
The standard pattern is summarise-and-store: at the end of a session the agent writes a compressed summary plus extracted facts to long-term storage, and at the start of the next session it retrieves the most relevant entries by embedding similarity. Without this, an agent has amnesia between sessions.
Frameworks: What to Build On
The framework landscape in 2026 has consolidated around a few production-ready options. OpenAI Agents SDK and Anthropic's Claude Agent SDK ship typed tool-calling, tracing, and built-in evaluation hooks tied to their own models. LangGraph models agents as explicit state graphs with checkpointing - the right choice when you need durable, resumable workflows. AutoGen targets multi-agent conversations. CrewAI focuses on role-based teams of agents.
For a single-agent task with a single model provider, start with that provider's first-party SDK. Reach for LangGraph when you need long-running workflows that span hours or days and have to survive process restarts. Reach for multi-agent frameworks only when the problem genuinely decomposes into specialised roles - otherwise the coordination overhead exceeds the benefit.
Why Agents Fail and How to Harden Them
Reliability is the central engineering problem. Per-step success rates compound multiplicatively: a 95% reliable agent step run 20 times in sequence completes successfully only ~36% of the time. Real-world agent reliability tracks task length sharply.
Hardening tactics that consistently move the needle: aggressive validation of every tool argument before execution, sandboxed execution environments for code and computer use, explicit termination criteria and budget caps, human-in-the-loop confirmation on irreversible actions, and structured evaluation against benchmarks like SWE-bench Verified, OSWorld, WebArena, and GAIA before shipping.
- Treat all retrieved content (web, email, files) as untrusted - it can carry prompt-injection payloads.
- Cap loop iterations and token budget; fail closed.
- Log every tool call with inputs, outputs, and reasoning for replay.
- Run evals on each model upgrade - capability changes are not always monotonic.
A Minimal Implementation
A working agent in pseudocode is short: define tools as a list of JSON schemas, then loop - call the model with the conversation, if it emits a tool call execute it and append the result, otherwise return the final message. Add a max-iterations guard and a budget check.
Production agents add tracing, retries with backoff, structured logging, evaluation harnesses, secret management for tool credentials, and a sandbox for any tool that executes code. None of those change the shape of the core loop - they wrap it.
Frequently asked
What is the difference between an AI agent and a chatbot?
+
A chatbot returns one response per user message. An agent runs a control loop that lets the model call tools, observe results, and continue reasoning across many turns before producing a final answer.
Which framework should I use to build an AI agent?
+
For single-agent tasks, start with the first-party SDK of your model provider (OpenAI Agents SDK or Claude Agent SDK). Use LangGraph for long-running, resumable workflows. Use multi-agent frameworks only when the problem clearly decomposes into specialised roles.
How do you give an AI agent memory?
+
Persist a summary of each session plus extracted facts to a vector or structured store, and retrieve the most relevant entries at the start of the next session using embedding similarity. The model's context window provides short-term memory; long-term memory is application code.
Why are AI agents unreliable?
+
Errors compound across steps - a 95% per-step success rate is only ~36% over 20 steps. Combine that with prompt injection from untrusted inputs and brittle long-horizon planning, and reliability becomes the hardest part of shipping an agent.
What is the ReAct pattern?
+
ReAct (Yao et al., 2022) interleaves Reasoning steps with Acting steps in a single loop: the model produces a thought, picks a tool, observes the result, and reasons again. It is the most widely-used scaffolding for modern agents.
Sources & further reading
ReAct: Synergizing Reasoning and Acting in Language Models
Yao et al., 2022
Plan-and-Solve Prompting
Wang et al., 2023
Computer Use (Claude 3.5 Sonnet)
Anthropic
OpenAI Agents SDK Documentation
OpenAI
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Jimenez et al., 2024
GAIA: a Benchmark for General AI Assistants
Mialon et al., Meta
Continue in this series
Foundations
Machine Learning: The Foundations
Neural Networks
Deep Learning: Hierarchical Representation from Raw Data
Architecture
The Transformer Architecture
LLMs
Large Language Models: How They Work and Where They Fail
Cross-Modal
Multimodal AI: Text, Vision, Audio, Video, and Action
Learning from Reward
Reinforcement Learning: From AlphaGo to RLHF
