Build a RAG Agent

Free spec to build a RAG agent — the agentic flavor of retrieval-augmented generation: query rewriting, hybrid vector + lexical retrieval, cross-encoder rerank, citation-grounded generation, and a self-grounding check that refuses to answer without evidence. Framework-agnostic: pick vector DB, embedder, reranker, model, and eval framework once in STACK.md.

Build this in my repo →

RAG-1

Lay the foundation for your RAG agent

RAG-2

Ingest your corpus and chunk it so retrieval can find the right slice

RAG-3

Embed your chunks and stand up a hybrid vector + lexical index

RAG-4

Build the query rewriter that turns one user question into multiple retrieval queries

RAG-5

Retrieve the top candidates and rerank them with a cross-encoder

RAG-6

Generate the grounded answer with inline citations to retrieved chunks

RAG-7

Add the self-grounding check that refuses to answer without evidence

RAG-8

Build an eval harness that scores retrieval and faithfulness on real questions

RAG-9

Give your RAG agent memory across follow-up questions

RAG-10

Ship the RAG agent to production: caching, traces, latency budget

Nothing in flight

Nothing shipped yet

5 minutes of planning, ~20 hours saved. Tekk wires this 10-task spec into your repo in your stack — sign up to seed your workspace.

Build this →

How to use this spec

Click any row above to open the full task — title, description, subtasks, AI instructions, the works. Same layout the product uses internally.
Hit Copy as Prompt in the right sidebar of any task. You'll get the XML-wrapped prompt Tekk uses internally — paste it into Cursor, Claude Code, Codex, ChatGPT, or anywhere else and the agent has the full task context.
Open in jumps the same prompt directly into v0, Lovable, Bolt, Magic Patterns, Replit, or Cursor with one click.

What you're building

Goal 1: Ship an agentic RAG agent: query rewriter expands one user question into multiple retrieval queries, hybrid retrieval (vector + lexical) feeds a cross-encoder reranker, the generator answers with inline citations to retrieved chunks, and a self-grounding check refuses to answer when the cited evidence does not support the claim. Acceptance bar: on an 8-fixture eval set, faithfulness ≥ 0.85, context precision ≥ 0.75, context recall ≥ 0.85, answer relevance ≥ 0.80, and refuse-rate on off-corpus questions ≥ 0.90.
The agentic-RAG pattern as a tool-agnostic starter kanban. Pick the stack once in STACK.md (language / vector DB / embedder / reranker / model / eval framework); every later card carries per-variant snippets. The load-bearing decisions: query rewriting unlocks recall on questions that don't lexically match the corpus, reranking is non-optional past K~20, and the citation-or-refuse trust gate turns a demo into a production system.

Architecture

flowchart LR
  User[User Question] --> Reformulator[Reformulator<br/>(turn-aware)]
  Reformulator --> Rewriter[Query Rewriter<br/>RetrievalQueries]
  Rewriter --> Retrieve{Hybrid Retrieve<br/>per rewritten query}
  Retrieve -->|Vector top-K| Fuse[RRF Fusion + Dedupe]
  Retrieve -->|Lexical top-K| Fuse
  Fuse --> Rerank[Cross-Encoder Rerank<br/>top-N=5]
  Rerank --> Generate[Grounded Generation<br/>CitedAnswer w/ inline citations]
  Generate --> Ground{Self-Grounding Check<br/>per citation}
  Ground -->|all supported| Answer[Cited Answer]
  Ground -->|any unsupported| Refuse[Refuse w/ reason]
  Answer --> Memory[(Turn Buffer<br/>last 5)]
  Memory --> Reformulator

The three load-bearing decisions: query rewriting before retrieval (one user question → N retrieval queries via multi-query, HyDE, or sub-question — unlocks recall on phrasings the corpus doesn't index by), cross-encoder reranking is non-optional (Anthropic Contextual Retrieval data: hybrid alone cuts retrieval failures 49%, hybrid + rerank cuts them 67%), and citation-or-refuse trust gate (a post-generation grounding check that refuses to answer when the cited evidence does not support the claim — the difference between a demo and a production system).

When agentic RAG makes sense (and when it doesn't)

Agentic RAG buys you recall, precision, and trust at the cost of latency, complexity, and token spend. Reach for it when the corpus is too big for context and citations actually matter — otherwise ship something simpler.

✓ Use agentic RAG when

Your corpus is too big for the context window. Even 200k-token windows can't hold a real docs site or knowledge base. Retrieval is the only way to surface the right slice without paying to read the whole corpus on every call.
Citations are mandatory. Legal, medical, support, internal-docs use cases where the user needs to verify where an answer came from. Naive RAG paraphrases; agentic RAG cites by chunk_id and refuses without evidence.
Recall matters as much as precision. Single-shot retrieval misses on phrasings the corpus doesn't index by. Query rewriting plus hybrid retrieval plus rerank consistently outperforms embeddings-alone on every benchmark since 2024.
You can afford 1–3s extra latency for the trust gain. The full pipeline adds rewrite + rerank + grounding check overhead. If your users would rather wait three seconds for a verifiable answer than one second for a maybe-right one, you have product-market fit for agentic RAG.
You have an eval fixture set. Without ground-truth questions and ground-truth chunks, you can't tell if prompt tweaks improved or regressed retrieval. The Ragas / TruLens harness is non-optional past prototype.

⚠ Ship something simpler when

Your corpus fits in context. If you can paste your entire knowledge base into the system prompt (under ~50k tokens), do that. contextual_corpus_in_system beats RAG on every metric for small corpora. Reach for retrieval when you exceed the budget.
Single-shot retrieval already saturates. Naive RAG (one embed, one retrieve, one generate) is fine for FAQ matching where queries closely mirror the source. Don't add rewrite + rerank + grounding to a pipeline that already hits 95% on your eval.
You don't need citations. Chat about general knowledge, brainstorming, creative writing — the model's weights are the corpus. Adding retrieval here just slows things down and constrains the model away from useful generation.
You don't have ground-truth eval data. Without faithfulness / precision / recall fixtures, agentic RAG's complexity is unobservable — you can't tell if a change helped or broke the system. Build the eval set first; the pipeline second.
Static FAQ where keyword search wins. If the corpus is small and queries are predictable, BM25 alone (or a Postgres full-text-search query) ships in an afternoon and beats embeddings on cost. Don't over-engineer; the embedding bill compounds.

What the community says

"Combining contextual embeddings (using Voyage) and Contextual BM25 reduced the top-20-chunk retrieval failure rate by 49%. Adding a reranking step reduced this further to 67%."

Anthropic engineering · "Introducing Contextual Retrieval" · 2024-09-19 · anthropic.com

"Stopped chasing hallucination tricks once I made the model refuse to answer without retrieved evidence. Suddenly the system felt trustworthy."

r/LocalLLaMA practitioner · RAG hallucination thread · 2024-05 · reddit.com

"Agentic Retrieval-Augmented Generation integrates RAG with agent-based reasoning. Unlike traditional RAG, an agent powered by an LLM reasons step-by-step, determining precisely when and how to retrieve information during an interaction."

LangChain docs · "Agentic RAG" — canonical definition · docs.langchain.com

How to know it's working

Ship the spec, then measure on these criteria (the eval harness task grades them):

On the 8-fixture seed eval set, Ragas faithfulness ≥ 0.85 averaged across the 6 on-corpus questions
Context precision ≥ 0.75 on the same 6 fixtures (retrieved chunks are actually relevant to the ground-truth answer)
Context recall ≥ 0.85 (every chunk needed to answer is in the retrieved top-N)
Answer relevance ≥ 0.80 (generated answer addresses the question, not a tangent)
Refuse-rate on the 2 off-corpus fixtures ≥ 0.90 (system says "I don't have evidence" instead of hallucinating)
End-to-end p50 latency < 3s, p95 < 8s (rewrite + retrieve + rerank + generate + ground)
Per-query cost variance ≤ 2× between cached and uncached runs (caching working as intended)

Sources

Every claim, pattern, and acceptance threshold on this page maps back to one of these. Read them before deviating from the spec.

Build this in your codebase tonight

Sign up — Tekk reads your repo, picks your stack from the five decisions in STACK.md, and writes a personalized version of this 10-task spec. Same architecture, your patterns, your dependencies. Want to do it yourself? Open any task above and hit Copy as Prompt — paste into Cursor, Claude Code, or Codex.

Build it in my repo →

Build a RAG Agent

How to use this spec

What you're building

Architecture

When agentic RAG makes sense (and when it doesn't)

✓ Use agentic RAG when

⚠ Ship something simpler when

What the community says

How to know it's working

Sources

Build this in your codebase tonight

Related specs

Deep Research Agent

AI Coding Agent

AI Sales Agent

Agent with Memory