Build an AI Coding Agent

Free spec to build an AI coding agent — the edit-test-verify loop behind Aider, OpenHands, SWE-agent, and Claude Code. Framework-agnostic: STACK.md picks the model, diff strategy, test runner, and guardrail substrate. Includes a repo indexer, planner, fallback patching ladder, parsed test feedback, allowlist guardrails, and an eval harness.

Build this in my repo →

ACA-1

Lay the foundation for your AI coding agent

ACA-2

Define the typed contracts the loop's modules will exchange

ACA-3

Build the repo indexer that ranks files by relevance to the task

ACA-4

Build the planner that turns a task into a typed edit plan

ACA-5

Build the diff-apply tool with a fallback patching ladder

ACA-6

Build the test-runner tool that returns parsed pass/fail

ACA-7

Wire the edit-test-verify loop and ship the first end-to-end run

ACA-8

Add guardrails so the agent is safe to run unattended

ACA-9

Build an eval harness that grades your AI coding agent on a fixed set of bugs

ACA-10

Ship the AI coding agent to production: retries, traces, cost logs, circuit breaker

Nothing in flight

Nothing shipped yet

5 minutes of planning, ~26 hours saved. Tekk wires this 10-task spec into your repo in your stack — sign up to seed your workspace.

Build this →

How to use this spec

Click any row above to open the full task — title, description, subtasks, AI instructions, the works. Same layout the product uses internally.
Hit Copy as Prompt in the right sidebar of any task. You'll get the XML-wrapped prompt Tekk uses internally — paste it into Cursor, Claude Code, Codex, ChatGPT, or anywhere else and the agent has the full task context.
Open in jumps the same prompt directly into v0, Lovable, Bolt, Magic Patterns, Replit, or Cursor with one click.

What you're building

Goal 1: Ship a working AI coding agent built on the edit-test-verify loop: a repo indexer ranks files, a planner produces a typed edit plan, a diff-apply tool with a nine-strategy fallback ladder makes the edits, a test-runner returns parsed pass/fail, and the loop iterates on failures until tests pass — with guardrails, an eval harness, and production-grade tracing.
Aider, OpenHands, SWE-agent, and Claude Code all converge on the same spine: read repo → plan → apply diff → run tests → read output → iterate. This kanban builds that loop as a framework-agnostic CLI. Pick five decisions once in STACK.md (language / model / diff_strategy / test_runner / guardrail_substrate); every later card carries per-variant snippets. Acceptance bar: eval harness passes ≥75% of 8 hand-crafted bug-fix scenarios with p50 iterations < 6 and a parallelism-safe run trace; circuit breaker opens after 3 consecutive same-signature failures and exits cleanly; ±5% regression bar held after the production-hardening pass.

Architecture

flowchart TB
  User[User Task] --> Index[Repo Indexer<br/>RepoIndex + RelevanceRanking]
  Index --> Planner[Planner<br/>EditPlan]
  Planner --> Loop{Edit-Test-Verify Loop}
  Loop -->|Diff| Apply[Diff-Apply Tool<br/>9-strategy fallback ladder]
  Apply -->|ApplyResult + commit_sha| Loop
  Loop -->|verification_command| Tests[Test-Runner Tool<br/>parsed TestResult]
  Tests -->|passed/failed/failures| Loop
  Loop -->|failures + prev plan| Planner
  Loop -->|status=success| Output[RunResult + commit chain]
  Loop -->|max_iterations or circuit_breaker| Halt[RunResult halted]
  Guards[Guardrails<br/>touched-file allowlist<br/>forbidden bash<br/>max iterations] -.->|PreToolUse| Apply
  Guards -.->|PreToolUse| Tests

The three load-bearing decisions: structured-output planning before any edit (planner returns a typed EditPlan, never free-text — Anthropic's "explore first, then plan, then code"), flexible patching with a nine-strategy fallback ladder in the diff-apply tool (Aider's research: disabling tolerance for imperfect diffs produces 9x more apply failures), and parsed test feedback from the test-runner (one Failure object per failing test, not raw stdout — handing the model raw pytest text past ~25k tokens degrades its adherence to the system prompt).

When an autonomous coding agent makes sense (and when it doesn't)

An edit-test-verify loop is not a free upgrade over a chat-style coding assistant. Aider's own writeup says it best: "intentionally has quite limited and narrow agentic behavior to avoid long delays, high token costs and the need for users to repeatedly code review incorrect solutions." Reach for autonomous when the verification signal is strong and the change is well-scoped — otherwise stay in the loop yourself.

✓ Build an autonomous coding agent when

You have real tests that pass-or-fail definitively. Verification is the single highest-leverage decision per Anthropic's Claude Code best-practices post. Without a strong pass/fail signal, the loop can't tell when it's done.
The task is well-scoped to a known slice of the repo. "Fix the off-by-one in auth/session.py" works; "refactor the whole codebase" does not. The repo indexer is good but it's lexical, not semantic — the ranking degrades as scope widens.
Edits are local and diff-shaped. The nine-strategy fallback ladder buys forgiveness, not magic. Sprawling cross-file rewrites with logic re-flows are still hard; one-to-five-file fixes with clear test signal are the sweet spot.
You can spare 5–30 minutes of unattended runtime. p50 iterations is 3–6 on the seed eval set; each iteration is one model call plus one test run. Expect 5–30 minutes for a real fix, not 30 seconds.
You've already shipped a chat-style coding setup first. Autonomous on top of a working chat workflow is a step up; autonomous as your first attempt at AI-assisted coding has too many failure modes you can't yet recognize.

⚠ Stay with chat-style assistance when

Your test signal is weak or absent. If the only verification is "the code looks right," the loop will produce something that looks right and doesn't work. Build the test suite first, the agent second.
The task is exploratory, not corrective. "What's the best way to architect this?" is a planning question, not a fix-the-failing-test question. Use plan-mode chat for the exploration, then hand a scoped task to the agent.
You don't have a guardrail substrate yet. Without the touched-file allowlist and the forbidden-bash list from Task 8, the agent can delete your .env or force-push to main on iteration 7. Don't run unattended without guards.
Token cost matters more than developer time. p50 token usage is 20–60k per run on the seed set; SWE-bench-shaped tasks routinely cross 100k. If unit economics are sensitive, the math fails fast.
You can't tolerate the eval harness's setup cost. Without Task 9's harness you can't tell if a model swap, prompt edit, or library bump silently regressed quality. Every later improvement becomes unverifiable.

What the community says

"Aider scored 18.9% on the main SWE-bench benchmark... did not use RAG, vector search, tools or give the LLM access to search the web or unilaterally execute code. Aider intentionally has quite limited and narrow agentic behavior to avoid long delays, high token costs and the need for users to repeatedly code review incorrect solutions."

Paul Gauthier (Aider author) · "How Aider scored SOTA on SWE-bench" · aider.chat

"Give Claude a way to verify its work. Include tests, screenshots, or expected outputs so Claude can check itself. This is the single highest-leverage thing you can do."

Anthropic engineering · "Claude Code best practices" · anthropic.com

"With unified diffs, GPT acts more like it's writing textual data intended to be read by a program, not talking to a person. Diffs are usually consumed by the patch program, which is fairly rigid. This seems to encourage rigor, making GPT less likely to leave informal editing instructions in comments."

Paul Gauthier · "Unified diffs make GPT-4 Turbo 3X less lazy" · aider.chat

How to know it's working

Ship the spec, then measure on these criteria (the eval harness task grades them):

Eval harness passes ≥75% of the 8 hand-crafted bug-fix scenarios in evals/fixtures/ (typo, off-by-one, missing-import, regex, async-race, type-error, dep-cycle, multi-file-refactor)
p50 iterations per successful run < 6 across the 8 fixtures (proves the loop converges, not spins)
p50 end-to-end latency < 120s per fixture with the configured model from STACK.md
Apply success rate ≥ 90% across all attempted diffs (proves the nine-strategy fallback ladder earns its complexity)
Touched-file allowlist denies 100% of edits to .env / secrets/ / .git/ in the guardrails test fixtures
Retry helper succeeds on a forced-429 mock with retries=1 visible in the trace event
Circuit breaker exits the stuck-fix fixture within 3 iterations (not the 12-iteration max), with RunResult.status="halted" and reason="circuit_breaker_open"
Task 10 regression run is within ±5% of evals/baseline.json on pass rate and p50 iterations

Sources

Every claim, pattern, and acceptance threshold on this page maps back to one of these. Read them before deviating from the spec.

Build this in your codebase tonight

Sign up — Tekk reads your repo, picks your stack from the five decisions in STACK.md, and writes a personalized version of this 10-task spec. Same architecture, your patterns, your dependencies. Want to do it yourself? Open any task above and hit Copy as Prompt — paste into Cursor, Claude Code, or Codex.

Build it in my repo →

Customer Support Agent· coming soon

Intercom-style multi-agent support — ticket classifier, RAG over docs, escalation routing, conversation memory…

use-case · customer-supportframework · claude-agent-sdk