Agent Templates

Build an Agent with Memory

Free spec to build a memory-augmented agent — four explicit memory tiers (working, episodic, semantic, procedural), write tool the agent decides when to call, retrieval that blends recency and relevance, write-time consolidation, and a forgetting primitive. Framework-agnostic: pick the substrate in STACK.md (Postgres+pgvector, SQLite+sqlite-vss, Qdrant, or specialized stores like Mem0, Letta, or Zep). Includes typed contracts, multi-session eval, and privacy compliance.

Build this in my repo →
AWM-1
Lay the foundation for your memory-augmented agent
AWM-2
Define the four memory tiers and the typed records they carry
AWM-3
Build the memory write tool the agent calls when it learns something worth keeping
AWM-4
Build memory retrieval that ranks by relevance AND recency, not just similarity
AWM-5
Assemble the agent's context window from working memory plus retrieved long-term memory
AWM-6
Consolidate old episodic memory into durable semantic memory on a write-time pass
AWM-7
Resolve conflicts when newer memory contradicts older memory
AWM-8
Build an eval harness that grades cross-session recall and contradiction handling
AWM-9
Add privacy primitives: forget a user, redact a fact, enforce retention windows
AWM-10
Ship the memory-augmented agent to production: traces, storage limits, observability into what got retrieved
Nothing in flight
Nothing shipped yet
5 minutes of planning, ~22 hours saved. Tekk wires this 10-task spec into your repo in your stack — sign up to seed your workspace.
Build this →

How to use this spec

  1. Click any row above to open the full task — title, description, subtasks, AI instructions, the works. Same layout the product uses internally.
  2. Hit Copy as Prompt in the right sidebar of any task. You'll get the XML-wrapped prompt Tekk uses internally — paste it into Cursor, Claude Code, Codex, ChatGPT, or anywhere else and the agent has the full task context.
  3. Open in jumps the same prompt directly into v0, Lovable, Bolt, Magic Patterns, Replit, or Cursor with one click.

What you're building

Goal 1: Ship a working memory-augmented agent: four explicit memory tiers (working, episodic, semantic, procedural), a write tool the model calls deliberately, retrieval that ranks by recency AND relevance, write-time consolidation that distills old episodes into durable facts, typed conflict resolution, multi-session eval, right-to-be-forgotten, and production traces.
A memory-augmented agent pattern as a framework-agnostic starter kanban. Pick five tooling decisions in STACK.md (language / memory_substrate / embedder / model / consolidation_cadence); every later card reads your picks and runs the right variant. Acceptance bar: eval harness passes on 12 fixtures with cross-session recall >= 90%, contradiction handling correct on 6/6, retrieval p50 < 50ms in-memory or < 200ms Postgres, top-K precision >= 85%, forget(user) removes every direct + derived record in one call, and the +/-5% regression bar holds after production hardening.

Architecture

flowchart TB
  User[User Turn] --> Assemble[assemble_context]
  Procedural[(Procedural<br/>skills + system)] --> Assemble
  Working[(Working<br/>last N turns)] --> Assemble
  Semantic[(Semantic<br/>durable facts)] -->|retrieve top-K| Assemble
  Episodic[(Episodic<br/>past events)] -->|retrieve top-K| Assemble
  Assemble --> Model[Model]
  Model -->|remember tool call| Resolver[Conflict Resolver]
  Resolver -->|append| Episodic
  Resolver -->|overwrite or preserve| Semantic
  Episodic -.write-time consolidation.-> Consolidator[Consolidation Job]
  Consolidator -.derived facts.-> Semantic
  Forget[forget user] -.cascade via derived_from.-> Episodic
  Forget -.cascade.-> Semantic

The three load-bearing decisions: explicit memory types, not one blob (working / episodic / semantic / procedural each have their own namespace, write rules, and retrieval profile — this is the difference between a memory architecture and "we save the chat history"), write-time consolidation, not read-time (the expensive episodic-to-semantic distillation runs on the write path or a background job, never inside retrieve, so the read hot path stays fast), and recency AND relevance retrieval (score blend = α · semantic similarity + β · recency + γ · BM25, with a hard is_valid filter that drops superseded records — pure cosine similarity returns stale-but-similar matches in production).

When agent memory makes sense (and when it doesn't)

Agent memory is not free. Letta's own benchmarks show a plain filesystem scores 74% on memory tasks, beating specialized memory libraries. Reach for a deliberate memory architecture when the boundary earns the cost — otherwise ship a stateless agent and add memory after evals tell you it's worth it.

✓ Use agent memory when

  • The user expects continuity across sessions. A coaching agent, a personal assistant, a long-running coding agent — the entire product promise breaks if turn 1 of session 5 doesn't remember turn 12 of session 1. Memory IS the feature.
  • Personalization is a product feature, not a nice-to-have. Per-user preferences ("prefers metric units", "reviewed PR #842 last week") move the agent from generic to useful. Without a memory architecture, you're re-stuffing the entire history every turn — expensive and fragile.
  • Token cost matters more than substrate cost. Replaying the full history every turn scales O(turns²). A memory layer with retrieval-on-context scales O(retrieved_k). For high-turn-count users, the substrate cost is dwarfed by the token savings.
  • Compliance requires explicit forget semantics. GDPR right-to-be-forgotten, HIPAA retention windows, SOC 2 data classification — these all need forget(user) as a first-class operation with cascade-delete on derived records. Stuffing chat history into context can't satisfy that audit.
  • You've shipped a stateless version and have an eval baseline. Memory failure modes (stale facts, contradictions, retrieval misses) are silent — without an eval harness, you can't tell whether your memory layer helped or just added latency. Stateless agent first, eval harness second, memory third.

⚠ Ship without memory when

  • Tasks are single-session by design. A code-review agent that runs once per PR doesn't need memory of previous PRs unless you've explicitly designed for it. Don't bolt on memory because "it might be useful" — add an escalate path if the rare case appears.
  • The context window holds the whole interaction. Sub-50-turn conversations with a 200K-token model can fit raw. The retrieval-accuracy degradation curve in long context is real, but at 50 turns you're nowhere near it. Don't add an architectural layer to save tokens you weren't going to spend.
  • You don't have an eval harness yet. Memory bugs are silent killers — a stale fact gets retrieved, the model trusts it, the user loses trust in the agent, and you have no way to detect the bug in production. Eval harness before memory architecture, every time.
  • The user-volume × turns × tokens math doesn't justify the substrate. Postgres+pgvector and Qdrant are not free to run. If your product has 100 weekly active users with 5 turns each, the memory substrate cost exceeds the token savings. Run the math before committing.
  • You haven't shipped a stateless version first. Stateless → memory-augmented is the right migration order, never the reverse. Memory adds load-bearing complexity (consolidation, conflict resolution, forgetting) — adding it before you understand the actual conversation patterns produces an architecture that fights the workload.

What the community says

"The goal of this work is to investigate the extent to which an LLM can manage memory and different memory hierarchies, applying lessons from operating systems to extend effective context lengths."
Charles Packer (pacjam) · MemGPT lead author, HN discussion of the original paper · 2023-10-15 · news.ycombinator.com
"Mem0 = memory storage + retrieval. Doesn't learn patterns. We built an internal solution capturing structured events (agent output, user corrections, accepted changes) to extract evolving preference profiles, enabling agents to improve without explicit instruction."
Ask HN poster (YC W23) · Ask HN: Mem0 stores memories, but doesn't learn user patterns · 2026-02-18 · news.ycombinator.com
"Relevant data is moved into the context the llm needs to answer chat questions. When it runs out of memory it moves a condensed, searchable version of the content to another data store."
u/tayo42 · HN — MemGPT: Towards LLMs as Operating Systems · 2023-10-15 · news.ycombinator.com

How to know it's working

Ship the spec, then measure on these criteria (the eval harness task grades them):

  • Cross-session recall >= 90% on 4 user-preference fixtures (session 1 establishes a preference, session N asks; model output includes the preference)
  • Contradiction handling correct on 6/6 fixtures — facts overwrite (new wins, old marked superseded_by), opinions preserve (both stay live with timestamps), events always append
  • Retrieval p50 latency < 50ms in-memory or < 200ms Postgres on 1000 records per user namespace
  • Retrieval top-K precision >= 85% — relevant record in top-3 of retrieve results across the 3 precision fixtures
  • forget(user_id, scope=ALL) removes every direct and derived record within one call (verified by post-call retrieve returning 0 results, including consolidated semantic facts derived from the user's episodes)
  • Per-user namespace stays under MAX_NAMESPACE_BYTES (default 50MB) after the storage-limit enforcer runs — eviction order is importance ASC, last_accessed_at ASC
  • Substrate circuit breaker opens after 3 consecutive retrieval failures in one run_id; assemble_context falls back to procedural + working memory only and emits a memory_degraded trace event

Sources

Every claim, pattern, and acceptance threshold on this page maps back to one of these. Read them before deviating from the spec.

Build this in your codebase tonight

Sign up — Tekk reads your repo, picks your stack from the five decisions in STACK.md, and writes a personalized version of this 10-task spec. Same architecture, your patterns, your dependencies. Want to do it yourself? Open any task above and hit Copy as Prompt — paste into Cursor, Claude Code, or Codex.

Build it in my repo →

Related specs

Deep Research Agent

Free spec to build a deep research agent — Anthropic's multi-agent research pattern. Orchestrator + parallel s…
use-case · researchframework-agnostic

AI Coding Agent

The edit-test-verify loop behind Aider, OpenHands, SWE-agent, and Claude Code. Framework-agnostic via STACK.md…
use-case · codingframework-agnostic

AI Sales Agent

Research-personalize-outreach loop with a reply classifier closing the cycle. Signal-driven personalization, R…
use-case · salesframework-agnostic

RAG Agent

The agentic flavor of retrieval-augmented generation: query rewriting, hybrid vector + lexical retrieval, cros…
pattern · agentic-ragframework-agnostic