::deep-dive

Agents and Tool Use

ReAct, Toolformer, MCP, and the architectures of language models that act in the world

An agent is a language model that takes actions in the world — calls tools, browses the web, executes code, writes files, makes API requests — and iterates on the results. The foundational paper is ReAct (Yao et al., 2022), which interleaved reasoning steps with action steps and established the template for nearly every subsequent agent framework. Toolformer (Schick et al., 2023) showed how a model could learn to invoke tools through self-supervised fine-tuning. The intervening years have produced an enormous proliferation of agent frameworks (LangChain, AutoGPT, BabyAGI, OpenAI's Assistants API, Anthropic's tool use, the Model Context Protocol or MCP), most of which iterate on the same core loop. A doctorate-grade learner needs to understand: the ReAct-style reasoning-action interleaving; the difference between function calling (the model emits structured output that the harness executes) and code execution (the model writes arbitrary code that the harness executes); the planning literature (Tree of Thoughts, Reflexion, the various explorations of search over agent trajectories); the failure modes specific to agents (context window degradation over long trajectories, error cascades when an early step is wrong, the inability to recover from environments that diverge from training distribution); the AutoGPT-era postmortems (the early autonomous-agent hype produced systems that did not generalize well, and the field has learned from this); and the modern protocol layer (MCP for tool definition, the OpenAI function-calling and Anthropic tool-use schemas, and how to design tool surfaces that models actually use well). The Anthropic engineering blog and the Model Context Protocol specification are required reading for understanding how the current frontier labs think about tool surfaces. Agents are the area of AI where capability gains translate most directly into economic value — but also where reliability and safety problems are most acute. A doctorate-grade understanding requires you to have actually built an agent loop, watched it fail, debugged it, and built a more robust one.

::reading path · in order

::01 · paper
~4h
ReAct: Synergizing Reasoning and Acting in Language Models — Yao, Zhao, Yu, Du, Shafran, Narasimhan, Cao (2022)
The foundational reasoning-and-acting paper. Every modern agent framework descends from this loop.
::02 · paper
~5h
Toolformer: Language Models Can Teach Themselves to Use Tools — Schick et al. (Meta 2023)
Self-supervised tool-use fine-tuning. Foundational for understanding how models can be taught to call APIs.
::03 · paper
~4h
Tree of Thoughts: Deliberate Problem Solving with Large Language Models — Yao, Yu, Zhao, Shafran, Griffiths, Cao, Narasimhan (2023)
Search over reasoning trajectories. The canonical extension of ReAct toward deliberate planning.
::04 · paper
~4h
Reflexion: Language Agents with Verbal Reinforcement Learning — Shinn, Cassano, Berman, Gopinath, Narasimhan, Yao (2023)
Self-reflection as a learning mechanism for agents. Useful both as technique and as a study of agent failure modes.
::05 · paper
~5h
Voyager: An Open-Ended Embodied Agent with Large Language Models — Wang et al. (NVIDIA 2023)
Long-horizon agent in Minecraft. Demonstrates skill library acquisition and lifelong learning patterns.
::06 · blog
~3h
Anthropic — Building Effective Agents (engineering blog, 2024)
Practical doctrine from a frontier lab. The workflows-vs-agents distinction is load-bearing.
::07 · blog
~6h
Model Context Protocol specification — Anthropic (modelcontextprotocol.io)
MCP is the emerging standard for tool servers and connectors. Read the spec and implement a minimal server.
::08 · blog
~4h
OpenAI Function Calling and Assistants API documentation
The function-calling schema and assistants framework. The other major industrial approach.
::09 · blog
~3h
Anthropic Tool Use documentation (docs.anthropic.com)
Claude's tool-use schema. Read alongside OpenAI's function calling for cross-vendor literacy.
::10 · paper
~5h
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? — Jimenez et al. (Princeton 2023)
The canonical agent benchmark. Read the paper and inspect a few of the trajectories that succeeded and failed.
::11 · blog
~4h
AutoGPT, BabyAGI, and the agent-hype-cycle postmortems (various community writeups, 2023-2024)
Understand why the early autonomous-agent hype produced unreliable systems. Read several community postmortems to internalize the failure modes.
::12 · code
~15h
LangChain and LangGraph documentation (or alternative agent frameworks)
Read the docs, understand the abstractions, then build something without them. Frameworks are useful for understanding the design space.

::exercises · build · derive · reproduce

01Implement a minimal ReAct loop from scratch — a model, a tool registry, and a parser — without any agent framework.
02Build an MCP server that exposes three real tools (search, file read, code execution) and connect it to Claude.
03Run an agent on a long-horizon task (e.g., SWE-bench-style issue resolution) and analyze where it fails.
04Implement Reflexion-style verbal self-reflection on top of your ReAct loop. Measure improvement on a benchmark.
05Design a tool schema that minimizes agent confusion (clear names, narrow parameters, examples) and A/B test against a deliberately bad version.
06Write a post-mortem of one AutoGPT-era system, identifying what went wrong and what modern frameworks fix.

::milestones · observable

▲You have built an agent loop from scratch.
▲You can explain why function calling is more reliable than free-form tool invocation.
▲You have built an MCP server.
▲You can debug an agent that is failing on long trajectories.
▲You can design a tool surface that an agent will actually use well.