Build an Agent with Memory

Free spec to build a memory-augmented agent — four explicit memory tiers (working, episodic, semantic, procedural), write tool the agent decides when to call, retrieval that blends recency and relevance, write-time consolidation, and a forgetting primitive. Framework-agnostic: pick the substrate in STACK.md (Postgres+pgvector, SQLite+sqlite-vss, Qdrant, or specialized stores like Mem0, Letta, or Zep). Includes typed contracts, multi-session eval, and privacy compliance.

Build this in my repo →

AWM-1

Lay the foundation for your memory-augmented agent

AWM-2

Define the four memory tiers and the typed records they carry

AWM-3

Build the memory write tool the agent calls when it learns something worth keeping

AWM-4

Build memory retrieval that ranks by relevance AND recency, not just similarity

AWM-5

Assemble the agent's context window from working memory plus retrieved long-term memory

AWM-6

Consolidate old episodic memory into durable semantic memory on a write-time pass

AWM-7

Resolve conflicts when newer memory contradicts older memory

AWM-8

Build an eval harness that grades cross-session recall and contradiction handling

AWM-9

Add privacy primitives: forget a user, redact a fact, enforce retention windows

AWM-10

Ship the memory-augmented agent to production: traces, storage limits, observability into what got retrieved

Nothing in flight

Nothing shipped yet

5 minutes of planning, ~22 hours saved. Tekk wires this 10-task spec into your repo in your stack — sign up to seed your workspace.

Build this →

How to use this spec

Click any row above to open the full task — title, description, subtasks, AI instructions, the works. Same layout the product uses internally.
Hit Copy as Prompt in the right sidebar of any task. You'll get the XML-wrapped prompt Tekk uses internally — paste it into Cursor, Claude Code, Codex, ChatGPT, or anywhere else and the agent has the full task context.
Open in jumps the same prompt directly into v0, Lovable, Bolt, Magic Patterns, Replit, or Cursor with one click.

What you're building

Goal 1: Ship a working memory-augmented agent: four explicit memory tiers (working, episodic, semantic, procedural), a write tool the model calls deliberately, retrieval that ranks by recency AND relevance, write-time consolidation that distills old episodes into durable facts, typed conflict resolution, multi-session eval, right-to-be-forgotten, and production traces.
A memory-augmented agent pattern as a framework-agnostic starter kanban. Pick five tooling decisions in STACK.md (language / memory_substrate / embedder / model / consolidation_cadence); every later card reads your picks and runs the right variant. Acceptance bar: eval harness passes on 12 fixtures with cross-session recall >= 90%, contradiction handling correct on 6/6, retrieval p50 < 50ms in-memory or < 200ms Postgres, top-K precision >= 85%, forget(user) removes every direct + derived record in one call, and the +/-5% regression bar holds after production hardening.

Architecture

flowchart TB
  User[User Turn] --> Assemble[assemble_context]
  Procedural[(Procedural<br/>skills + system)] --> Assemble
  Working[(Working<br/>last N turns)] --> Assemble
  Semantic[(Semantic<br/>durable facts)] -->|retrieve top-K| Assemble
  Episodic[(Episodic<br/>past events)] -->|retrieve top-K| Assemble
  Assemble --> Model[Model]
  Model -->|remember tool call| Resolver[Conflict Resolver]
  Resolver -->|append| Episodic
  Resolver -->|overwrite or preserve| Semantic
  Episodic -.write-time consolidation.-> Consolidator[Consolidation Job]
  Consolidator -.derived facts.-> Semantic
  Forget[forget user] -.cascade via derived_from.-> Episodic
  Forget -.cascade.-> Semantic

The three load-bearing decisions: explicit memory types, not one blob (working / episodic / semantic / procedural each have their own namespace, write rules, and retrieval profile — this is the difference between a memory architecture and "we save the chat history"), write-time consolidation, not read-time (the expensive episodic-to-semantic distillation runs on the write path or a background job, never inside retrieve, so the read hot path stays fast), and recency AND relevance retrieval (score blend = α · semantic similarity + β · recency + γ · BM25, with a hard is_valid filter that drops superseded records — pure cosine similarity returns stale-but-similar matches in production).

When agent memory makes sense (and when it doesn't)

Agent memory is not free. Letta's own benchmarks show a plain filesystem scores 74% on memory tasks, beating specialized memory libraries. Reach for a deliberate memory architecture when the boundary earns the cost — otherwise ship a stateless agent and add memory after evals tell you it's worth it.

✓ Use agent memory when

The user expects continuity across sessions. A coaching agent, a personal assistant, a long-running coding agent — the entire product promise breaks if turn 1 of session 5 doesn't remember turn 12 of session 1. Memory IS the feature.
Personalization is a product feature, not a nice-to-have. Per-user preferences ("prefers metric units", "reviewed PR #842 last week") move the agent from generic to useful. Without a memory architecture, you're re-stuffing the entire history every turn — expensive and fragile.
Token cost matters more than substrate cost. Replaying the full history every turn scales O(turns²). A memory layer with retrieval-on-context scales O(retrieved_k). For high-turn-count users, the substrate cost is dwarfed by the token savings.
Compliance requires explicit forget semantics. GDPR right-to-be-forgotten, HIPAA retention windows, SOC 2 data classification — these all need forget(user) as a first-class operation with cascade-delete on derived records. Stuffing chat history into context can't satisfy that audit.
You've shipped a stateless version and have an eval baseline. Memory failure modes (stale facts, contradictions, retrieval misses) are silent — without an eval harness, you can't tell whether your memory layer helped or just added latency. Stateless agent first, eval harness second, memory third.

⚠ Ship without memory when

Tasks are single-session by design. A code-review agent that runs once per PR doesn't need memory of previous PRs unless you've explicitly designed for it. Don't bolt on memory because "it might be useful" — add an escalate path if the rare case appears.
The context window holds the whole interaction. Sub-50-turn conversations with a 200K-token model can fit raw. The retrieval-accuracy degradation curve in long context is real, but at 50 turns you're nowhere near it. Don't add an architectural layer to save tokens you weren't going to spend.
You don't have an eval harness yet. Memory bugs are silent killers — a stale fact gets retrieved, the model trusts it, the user loses trust in the agent, and you have no way to detect the bug in production. Eval harness before memory architecture, every time.
The user-volume × turns × tokens math doesn't justify the substrate. Postgres+pgvector and Qdrant are not free to run. If your product has 100 weekly active users with 5 turns each, the memory substrate cost exceeds the token savings. Run the math before committing.
You haven't shipped a stateless version first. Stateless → memory-augmented is the right migration order, never the reverse. Memory adds load-bearing complexity (consolidation, conflict resolution, forgetting) — adding it before you understand the actual conversation patterns produces an architecture that fights the workload.

What the community says

"The goal of this work is to investigate the extent to which an LLM can manage memory and different memory hierarchies, applying lessons from operating systems to extend effective context lengths."

Charles Packer (pacjam) · MemGPT lead author, HN discussion of the original paper · 2023-10-15 · news.ycombinator.com

"Mem0 = memory storage + retrieval. Doesn't learn patterns. We built an internal solution capturing structured events (agent output, user corrections, accepted changes) to extract evolving preference profiles, enabling agents to improve without explicit instruction."

Ask HN poster (YC W23) · Ask HN: Mem0 stores memories, but doesn't learn user patterns · 2026-02-18 · news.ycombinator.com

"Relevant data is moved into the context the llm needs to answer chat questions. When it runs out of memory it moves a condensed, searchable version of the content to another data store."

u/tayo42 · HN — MemGPT: Towards LLMs as Operating Systems · 2023-10-15 · news.ycombinator.com

How to know it's working

Ship the spec, then measure on these criteria (the eval harness task grades them):

Cross-session recall >= 90% on 4 user-preference fixtures (session 1 establishes a preference, session N asks; model output includes the preference)
Contradiction handling correct on 6/6 fixtures — facts overwrite (new wins, old marked superseded_by), opinions preserve (both stay live with timestamps), events always append
Retrieval p50 latency < 50ms in-memory or < 200ms Postgres on 1000 records per user namespace
Retrieval top-K precision >= 85% — relevant record in top-3 of retrieve results across the 3 precision fixtures
forget(user_id, scope=ALL) removes every direct and derived record within one call (verified by post-call retrieve returning 0 results, including consolidated semantic facts derived from the user's episodes)
Per-user namespace stays under MAX_NAMESPACE_BYTES (default 50MB) after the storage-limit enforcer runs — eviction order is importance ASC, last_accessed_at ASC
Substrate circuit breaker opens after 3 consecutive retrieval failures in one run_id; assemble_context falls back to procedural + working memory only and emits a memory_degraded trace event

Sources

Every claim, pattern, and acceptance threshold on this page maps back to one of these. Read them before deviating from the spec.

Build this in your codebase tonight

Sign up — Tekk reads your repo, picks your stack from the five decisions in STACK.md, and writes a personalized version of this 10-task spec. Same architecture, your patterns, your dependencies. Want to do it yourself? Open any task above and hit Copy as Prompt — paste into Cursor, Claude Code, or Codex.

Build it in my repo →

Build an Agent with Memory

How to use this spec

What you're building

Architecture

When agent memory makes sense (and when it doesn't)

✓ Use agent memory when

⚠ Ship without memory when

What the community says

How to know it's working

Sources

Build this in your codebase tonight

Related specs

Deep Research Agent

AI Coding Agent

AI Sales Agent

RAG Agent