Build an AI Coding Agent
Free spec to build an AI coding agent — the edit-test-verify loop behind Aider, OpenHands, SWE-agent, and Claude Code. Framework-agnostic: STACK.md picks the model, diff strategy, test runner, and guardrail substrate. Includes a repo indexer, planner, fallback patching ladder, parsed test feedback, allowlist guardrails, and an eval harness.
How to use this spec
- Click any row above to open the full task — title, description, subtasks, AI instructions, the works. Same layout the product uses internally.
- Hit Copy as Prompt in the right sidebar of any task. You'll get the XML-wrapped prompt Tekk uses internally — paste it into Cursor, Claude Code, Codex, ChatGPT, or anywhere else and the agent has the full task context.
- Open in jumps the same prompt directly into v0, Lovable, Bolt, Magic Patterns, Replit, or Cursor with one click.
What you're building
Goal 1: Ship a working AI coding agent built on the edit-test-verify loop: a repo indexer ranks files, a planner produces a typed edit plan, a diff-apply tool with a nine-strategy fallback ladder makes the edits, a test-runner returns parsed pass/fail, and the loop iterates on failures until tests pass — with guardrails, an eval harness, and production-grade tracing.
Aider, OpenHands, SWE-agent, and Claude Code all converge on the same spine: read repo → plan → apply diff → run tests → read output → iterate. This kanban builds that loop as a framework-agnostic CLI. Pick five decisions once in STACK.md (language / model / diff_strategy / test_runner / guardrail_substrate); every later card carries per-variant snippets. Acceptance bar: eval harness passes ≥75% of 8 hand-crafted bug-fix scenarios with p50 iterations < 6 and a parallelism-safe run trace; circuit breaker opens after 3 consecutive same-signature failures and exits cleanly; ±5% regression bar held after the production-hardening pass.
Architecture
flowchart TB
User[User Task] --> Index[Repo Indexer<br/>RepoIndex + RelevanceRanking]
Index --> Planner[Planner<br/>EditPlan]
Planner --> Loop{Edit-Test-Verify Loop}
Loop -->|Diff| Apply[Diff-Apply Tool<br/>9-strategy fallback ladder]
Apply -->|ApplyResult + commit_sha| Loop
Loop -->|verification_command| Tests[Test-Runner Tool<br/>parsed TestResult]
Tests -->|passed/failed/failures| Loop
Loop -->|failures + prev plan| Planner
Loop -->|status=success| Output[RunResult + commit chain]
Loop -->|max_iterations or circuit_breaker| Halt[RunResult halted]
Guards[Guardrails<br/>touched-file allowlist<br/>forbidden bash<br/>max iterations] -.->|PreToolUse| Apply
Guards -.->|PreToolUse| Tests The three load-bearing decisions: structured-output planning before any edit (planner returns a typed EditPlan, never free-text — Anthropic's "explore first, then plan, then code"), flexible patching with a nine-strategy fallback ladder in the diff-apply tool (Aider's research: disabling tolerance for imperfect diffs produces 9x more apply failures), and parsed test feedback from the test-runner (one Failure object per failing test, not raw stdout — handing the model raw pytest text past ~25k tokens degrades its adherence to the system prompt).
When an autonomous coding agent makes sense (and when it doesn't)
An edit-test-verify loop is not a free upgrade over a chat-style coding assistant. Aider's own writeup says it best: "intentionally has quite limited and narrow agentic behavior to avoid long delays, high token costs and the need for users to repeatedly code review incorrect solutions." Reach for autonomous when the verification signal is strong and the change is well-scoped — otherwise stay in the loop yourself.
✓ Build an autonomous coding agent when
- You have real tests that pass-or-fail definitively. Verification is the single highest-leverage decision per Anthropic's Claude Code best-practices post. Without a strong pass/fail signal, the loop can't tell when it's done.
- The task is well-scoped to a known slice of the repo. "Fix the off-by-one in
auth/session.py" works; "refactor the whole codebase" does not. The repo indexer is good but it's lexical, not semantic — the ranking degrades as scope widens. - Edits are local and diff-shaped. The nine-strategy fallback ladder buys forgiveness, not magic. Sprawling cross-file rewrites with logic re-flows are still hard; one-to-five-file fixes with clear test signal are the sweet spot.
- You can spare 5–30 minutes of unattended runtime. p50 iterations is 3–6 on the seed eval set; each iteration is one model call plus one test run. Expect 5–30 minutes for a real fix, not 30 seconds.
- You've already shipped a chat-style coding setup first. Autonomous on top of a working chat workflow is a step up; autonomous as your first attempt at AI-assisted coding has too many failure modes you can't yet recognize.
⚠ Stay with chat-style assistance when
- Your test signal is weak or absent. If the only verification is "the code looks right," the loop will produce something that looks right and doesn't work. Build the test suite first, the agent second.
- The task is exploratory, not corrective. "What's the best way to architect this?" is a planning question, not a fix-the-failing-test question. Use plan-mode chat for the exploration, then hand a scoped task to the agent.
- You don't have a guardrail substrate yet. Without the touched-file allowlist and the forbidden-bash list from Task 8, the agent can delete your
.envor force-push to main on iteration 7. Don't run unattended without guards. - Token cost matters more than developer time. p50 token usage is 20–60k per run on the seed set; SWE-bench-shaped tasks routinely cross 100k. If unit economics are sensitive, the math fails fast.
- You can't tolerate the eval harness's setup cost. Without Task 9's harness you can't tell if a model swap, prompt edit, or library bump silently regressed quality. Every later improvement becomes unverifiable.
What the community says
How to know it's working
Ship the spec, then measure on these criteria (the eval harness task grades them):
- Eval harness passes ≥75% of the 8 hand-crafted bug-fix scenarios in evals/fixtures/ (typo, off-by-one, missing-import, regex, async-race, type-error, dep-cycle, multi-file-refactor)
- p50 iterations per successful run < 6 across the 8 fixtures (proves the loop converges, not spins)
- p50 end-to-end latency < 120s per fixture with the configured model from STACK.md
- Apply success rate ≥ 90% across all attempted diffs (proves the nine-strategy fallback ladder earns its complexity)
- Touched-file allowlist denies 100% of edits to .env / secrets/ / .git/ in the guardrails test fixtures
- Retry helper succeeds on a forced-429 mock with retries=1 visible in the trace event
- Circuit breaker exits the stuck-fix fixture within 3 iterations (not the 12-iteration max), with RunResult.status="halted" and reason="circuit_breaker_open"
- Task 10 regression run is within ±5% of evals/baseline.json on pass rate and p50 iterations
Sources
Every claim, pattern, and acceptance threshold on this page maps back to one of these. Read them before deviating from the spec.
- ↗ Aider — How Aider scored SOTA on SWE-bench aider.chat
- ↗ Aider — Unified diffs make GPT-4 Turbo 3X less lazy aider.chat
- ↗ Aider — Edit formats reference aider.chat
- ↗ Aider — File editing troubleshooting (context-bloat note) aider.chat
- ↗ Anthropic — Claude Code best practices code.claude.com
- ↗ Aider-AI/aider github.com
- ↗ All-Hands-AI/OpenHands github.com
- ↗ princeton-nlp/SWE-agent github.com
- ↗ anthropics/claude-agent-sdk-python github.com
- ↗ SWE-bench leaderboard swebench.com
- ↗ Show HN: OpenHands, an open source alternative to Devin news.ycombinator.com
- ↗ I used to be religiously pro-Aider — HN discussion news.ycombinator.com
- ↗ Aider Is SOTA for Both SWE-bench and SWE-bench Lite news.ycombinator.com
- ↗ OpenHands and AMD: Local Coding Agents news.ycombinator.com
- ↗ Thoughts on a month with Devin news.ycombinator.com
Build this in your codebase tonight
Sign up — Tekk reads your repo, picks your stack from the five decisions in STACK.md, and writes a personalized version of this 10-task spec. Same architecture, your patterns, your dependencies. Want to do it yourself? Open any task above and hit Copy as Prompt — paste into Cursor, Claude Code, or Codex.