Build a Deep Research Agent

Free spec to build a deep research agent — Anthropic's multi-agent research pattern with orchestrator, parallel sub-agents, verifier, and memory. Framework-agnostic: pick the harness in STACK.md (Claude Agent SDK, LangGraph, OpenAI Agents, or CrewAI). Includes typed contracts, deliberate context engineering, eval harness, and production hardening.

Build this in my repo →

DRA-1

Lay the foundation for your deep research agent

DRA-2

Define the typed messages the orchestrator and sub-agents exchange

DRA-3

Build the orchestrator's planner that breaks a question into sub-agent tasks

DRA-4

Build the sub-agent that handles one research task end-to-end

DRA-5

Fan out sub-agents in parallel without one failure blocking the rest

DRA-6

Aggregate the sub-agents' findings into one report, flagging conflicts

DRA-7

Add a verifier sub-agent that audits the aggregated report

DRA-8

Give your deep research agent memory across follow-up questions

DRA-9

Build an eval harness that grades your deep research agent on real questions

DRA-10

Ship the deep research agent to production: retries, traces, circuit breakers

Nothing in flight

Nothing shipped yet

5 minutes of planning, ~24 hours saved. Tekk wires this 10-task spec into your repo in your stack — sign up to seed your workspace.

Build this →

How to use this spec

Click any row above to open the full task — title, description, subtasks, AI instructions, the works. Same layout the product uses internally.
Hit Copy as Prompt in the right sidebar of any task. You'll get the XML-wrapped prompt Tekk uses internally — paste it into Cursor, Claude Code, Codex, ChatGPT, or anywhere else and the agent has the full task context.
Open in jumps the same prompt directly into v0, Lovable, Bolt, Magic Patterns, Replit, or Cursor with one click.

What you're building

Goal 1: Ship a working multi-agent deep-research pipeline: an orchestrator that decomposes one research question into a typed plan, dispatches N sub-agents in parallel with isolated contexts, aggregates with conflict-surfacing, verifies the result, and survives production with retries + traces + circuit-breakers.
Anthropic's deep-research agent pattern as a tool-agnostic starter kanban. Pick your stack once in STACK.md (language / harness / typed-contracts library / async primitive / memory substrate); every later card carries per-variant snippets. Acceptance bar: eval harness passes on 8 seed fixtures with p50 latency < 30s, p50 tokens < 80k, ≥80% per-fixture sub-agent success rate, verification never 'fail', and a retried-429 case shows retries=1 in the trace.

Architecture

flowchart LR
  User[User Request] --> Planner[Planner<br/>OrchestratorPlan]
  Planner --> Dispatcher{Parallel Dispatcher}
  Dispatcher -->|SubagentTask 1| W1[Sub-agent]
  Dispatcher -->|SubagentTask 2| W2[Sub-agent]
  Dispatcher -->|SubagentTask N| WN[Sub-agent]
  W1 --> Aggregator[Aggregator<br/>AggregatedReport]
  W2 --> Aggregator
  WN --> Aggregator
  Aggregator --> Verifier[Verification Agent<br/>VerificationResult]
  Verifier --> Memory[(Memory<br/>per-thread state)]
  Memory --> Planner
  Verifier --> Output[Final Report]

The three load-bearing decisions: parallel dispatch (Promise.all over sub-agents, not sequential), isolated sub-agent contexts (deliberate context engineering — each sub-agent gets its own conversation), and conflict surfacing in aggregation (when sub-agents disagree, both outputs presented with attribution, never silently merged).

When multi-agent makes sense (and when it doesn't)

Multi-agent is not free. Anthropic's own data: their research system uses ~15× more tokens than a single-agent baseline. Reach for it when the boundary earns the cost — otherwise ship a single agent.

✓ Use multi-agent when

Sub-agents need different tools. Billing uses Stripe, technical reads logs, sales hits a CRM. Bundling these into one agent inflates the tool surface and dilutes the system prompt.
Parallelism saves real wall-clock. Three sub-agents fan out on independent subqueries (research three competitors, classify three docs) and finish in the time of the slowest one.
Per-domain memory boundaries matter. Keep sales-conversation memory out of support-conversation context. Context bleed at the model is the silent killer.
You have a real eval pipeline. Multi-agent failure modes (silent state corruption, classifier collapse) are only catchable with measurement.
The value justifies the token cost. Anthropic's research system: +90.2% accuracy over single-agent on their eval. If the answer matters that much, the 15× spend is fine.

⚠ Ship a single agent when

One sub-agent handles 80%+ of cases. Don't add a second sub-agent for edge cases — add an escalate fallback to the single agent.
Token cost matters more than latency. Multi-agent burns ~15× more tokens. If your unit economics are sensitive, the math will not work.
You don't have an eval set yet. Multi-agent introduces classifier-collapse and silent sub-agent failures — without measurement, you can't tell if a change helped or broke routing.
Your team is one person. Observability adds at least an engineer-week of infra (tracing, cost tracking, concurrency probe) before it's safe to run unattended.
You haven't shipped a single-agent version first. Single → multi is the right migration order, never the reverse.

What the community says

"Our research system uses 15× more tokens than a single Claude conversation. The boundary has to earn the cost — for hard, parallelizable research it does; for simple Q&A it doesn't."

Anthropic engineering · "How we built our multi-agent research system" · anthropic.com

"Hit a wall debugging two agents that kept stomping on each other's files. The fix that worked: full worktree isolation per subagent. The race condition was silent — passed every unit test, then corrupted state on the third concurrent request."

u/idoman · r/LocalLLM · 2026-05-18 · reddit.com

"How are you managing multiple coding agents in parallel? I keep hitting the same problem — they pick the same files, overwrite each other, and there's no good way to coordinate handoffs. Worktrees help but it's still rough."

r/codex · 2026-05-18 · reddit.com

How to know it's working

Ship the spec, then measure on these criteria (the eval harness task grades them):

Eval harness passes ≥80% sub-agent success rate per fixture across 8 seed inputs (evals/fixtures.jsonl)
p50 end-to-end latency < 30s on the 8 fixtures with ANTHROPIC_API_KEY set
p50 token usage < 80,000 tokens per run (within Anthropic's 15× chat-multiplier headroom for 8 subtasks)
Parallel fan-out completes within 1.5× the slowest sub-agent's wall-clock time (proves real parallelism, not sequence)
Verification status is 'ok' or 'warn' on 100% of fixtures, never 'fail'
Retry layer succeeds on a forced-429 test case with retries=1 in the trace event
Circuit breaker opens after 3 consecutive sub-agent timeouts in a run and the aggregator still produces a report from the partial set

Sources

Every claim, pattern, and acceptance threshold on this page maps back to one of these. Read them before deviating from the spec.

Build this in your codebase tonight

Sign up — Tekk reads your repo, picks your stack from the five decisions in STACK.md, and writes a personalized version of this 10-task spec. Same architecture, your patterns, your dependencies. Want to do it yourself? Open any task above and hit Copy as Prompt — paste into Cursor, Claude Code, or Codex.

Build it in my repo →

Customer Support Agent· coming soon

Intercom-style multi-agent support — ticket classifier, RAG over docs, escalation routing, conversation memory…

use-case · customer-supportframework · claude-agent-sdk