Build a Deep Research Agent
Free spec to build a deep research agent — Anthropic's multi-agent research pattern with orchestrator, parallel sub-agents, verifier, and memory. Framework-agnostic: pick the harness in STACK.md (Claude Agent SDK, LangGraph, OpenAI Agents, or CrewAI). Includes typed contracts, deliberate context engineering, eval harness, and production hardening.
How to use this spec
- Click any row above to open the full task — title, description, subtasks, AI instructions, the works. Same layout the product uses internally.
- Hit Copy as Prompt in the right sidebar of any task. You'll get the XML-wrapped prompt Tekk uses internally — paste it into Cursor, Claude Code, Codex, ChatGPT, or anywhere else and the agent has the full task context.
- Open in jumps the same prompt directly into v0, Lovable, Bolt, Magic Patterns, Replit, or Cursor with one click.
What you're building
Goal 1: Ship a working multi-agent deep-research pipeline: an orchestrator that decomposes one research question into a typed plan, dispatches N sub-agents in parallel with isolated contexts, aggregates with conflict-surfacing, verifies the result, and survives production with retries + traces + circuit-breakers.
Anthropic's deep-research agent pattern as a tool-agnostic starter kanban. Pick your stack once in STACK.md (language / harness / typed-contracts library / async primitive / memory substrate); every later card carries per-variant snippets. Acceptance bar: eval harness passes on 8 seed fixtures with p50 latency < 30s, p50 tokens < 80k, ≥80% per-fixture sub-agent success rate, verification never 'fail', and a retried-429 case shows retries=1 in the trace.
Architecture
flowchart LR
User[User Request] --> Planner[Planner<br/>OrchestratorPlan]
Planner --> Dispatcher{Parallel Dispatcher}
Dispatcher -->|SubagentTask 1| W1[Sub-agent]
Dispatcher -->|SubagentTask 2| W2[Sub-agent]
Dispatcher -->|SubagentTask N| WN[Sub-agent]
W1 --> Aggregator[Aggregator<br/>AggregatedReport]
W2 --> Aggregator
WN --> Aggregator
Aggregator --> Verifier[Verification Agent<br/>VerificationResult]
Verifier --> Memory[(Memory<br/>per-thread state)]
Memory --> Planner
Verifier --> Output[Final Report] The three load-bearing decisions: parallel dispatch (Promise.all over sub-agents, not sequential), isolated sub-agent contexts (deliberate context engineering — each sub-agent gets its own conversation), and conflict surfacing in aggregation (when sub-agents disagree, both outputs presented with attribution, never silently merged).
When multi-agent makes sense (and when it doesn't)
Multi-agent is not free. Anthropic's own data: their research system uses ~15× more tokens than a single-agent baseline. Reach for it when the boundary earns the cost — otherwise ship a single agent.
✓ Use multi-agent when
- Sub-agents need different tools. Billing uses Stripe, technical reads logs, sales hits a CRM. Bundling these into one agent inflates the tool surface and dilutes the system prompt.
- Parallelism saves real wall-clock. Three sub-agents fan out on independent subqueries (research three competitors, classify three docs) and finish in the time of the slowest one.
- Per-domain memory boundaries matter. Keep sales-conversation memory out of support-conversation context. Context bleed at the model is the silent killer.
- You have a real eval pipeline. Multi-agent failure modes (silent state corruption, classifier collapse) are only catchable with measurement.
- The value justifies the token cost. Anthropic's research system: +90.2% accuracy over single-agent on their eval. If the answer matters that much, the 15× spend is fine.
⚠ Ship a single agent when
- One sub-agent handles 80%+ of cases. Don't add a second sub-agent for edge cases — add an
escalatefallback to the single agent. - Token cost matters more than latency. Multi-agent burns ~15× more tokens. If your unit economics are sensitive, the math will not work.
- You don't have an eval set yet. Multi-agent introduces classifier-collapse and silent sub-agent failures — without measurement, you can't tell if a change helped or broke routing.
- Your team is one person. Observability adds at least an engineer-week of infra (tracing, cost tracking, concurrency probe) before it's safe to run unattended.
- You haven't shipped a single-agent version first. Single → multi is the right migration order, never the reverse.
What the community says
How to know it's working
Ship the spec, then measure on these criteria (the eval harness task grades them):
- Eval harness passes ≥80% sub-agent success rate per fixture across 8 seed inputs (evals/fixtures.jsonl)
- p50 end-to-end latency < 30s on the 8 fixtures with ANTHROPIC_API_KEY set
- p50 token usage < 80,000 tokens per run (within Anthropic's 15× chat-multiplier headroom for 8 subtasks)
- Parallel fan-out completes within 1.5× the slowest sub-agent's wall-clock time (proves real parallelism, not sequence)
- Verification status is 'ok' or 'warn' on 100% of fixtures, never 'fail'
- Retry layer succeeds on a forced-429 test case with retries=1 in the trace event
- Circuit breaker opens after 3 consecutive sub-agent timeouts in a run and the aggregator still produces a report from the partial set
Sources
Every claim, pattern, and acceptance threshold on this page maps back to one of these. Read them before deviating from the spec.
- ↗ How we built our multi-agent research system Anthropic Engineering
- ↗ Building effective agents Anthropic Engineering
- ↗ Multi-agent systems LangChain Docs
- ↗ Multi-agent handoffs LangChain Docs
- ↗ Workflows and agents LangGraph Docs
- ↗ Build AI agents on Cloudflare Cloudflare Blog
- ↗ financial_research_agent (example) openai/openai-agents-python · GitHub
- ↗ langgraph-supervisor-py langchain-ai · GitHub
- ↗ Orchestrator-workers pattern (notebook) anthropics/claude-cookbooks · GitHub
- ↗ CrewAI crewAIInc · GitHub
- ↗ Multi-agent systems discussion (Sept 2025) Hacker News
- ↗ Multi-agent debate (June 2025) Hacker News
- ↗ Don't build multi-agents Cognition Blog
- ↗ How and when to build multi-agent systems LangChain Blog
- ↗ Multi-agent context engineering @dexhorthy on X
- ↗ Anthropic's multi-agent system explained (talk) YouTube
- ↗ I audited CrewAI's default patterns for token efficiency (43/100) DEV · @garybotlington
- ↗ LangGraph vs CrewAI in production markaicode.com
Build this in your codebase tonight
Sign up — Tekk reads your repo, picks your stack from the five decisions in STACK.md, and writes a personalized version of this 10-task spec. Same architecture, your patterns, your dependencies. Want to do it yourself? Open any task above and hit Copy as Prompt — paste into Cursor, Claude Code, or Codex.