Build a RAG Agent
Free spec to build a RAG agent — the agentic flavor of retrieval-augmented generation: query rewriting, hybrid vector + lexical retrieval, cross-encoder rerank, citation-grounded generation, and a self-grounding check that refuses to answer without evidence. Framework-agnostic: pick vector DB, embedder, reranker, model, and eval framework once in STACK.md.
How to use this spec
- Click any row above to open the full task — title, description, subtasks, AI instructions, the works. Same layout the product uses internally.
- Hit Copy as Prompt in the right sidebar of any task. You'll get the XML-wrapped prompt Tekk uses internally — paste it into Cursor, Claude Code, Codex, ChatGPT, or anywhere else and the agent has the full task context.
- Open in jumps the same prompt directly into v0, Lovable, Bolt, Magic Patterns, Replit, or Cursor with one click.
What you're building
Goal 1: Ship an agentic RAG agent: query rewriter expands one user question into multiple retrieval queries, hybrid retrieval (vector + lexical) feeds a cross-encoder reranker, the generator answers with inline citations to retrieved chunks, and a self-grounding check refuses to answer when the cited evidence does not support the claim. Acceptance bar: on an 8-fixture eval set, faithfulness ≥ 0.85, context precision ≥ 0.75, context recall ≥ 0.85, answer relevance ≥ 0.80, and refuse-rate on off-corpus questions ≥ 0.90.
The agentic-RAG pattern as a tool-agnostic starter kanban. Pick the stack once in STACK.md (language / vector DB / embedder / reranker / model / eval framework); every later card carries per-variant snippets. The load-bearing decisions: query rewriting unlocks recall on questions that don't lexically match the corpus, reranking is non-optional past K~20, and the citation-or-refuse trust gate turns a demo into a production system.
Architecture
flowchart LR
User[User Question] --> Reformulator[Reformulator<br/>(turn-aware)]
Reformulator --> Rewriter[Query Rewriter<br/>RetrievalQueries]
Rewriter --> Retrieve{Hybrid Retrieve<br/>per rewritten query}
Retrieve -->|Vector top-K| Fuse[RRF Fusion + Dedupe]
Retrieve -->|Lexical top-K| Fuse
Fuse --> Rerank[Cross-Encoder Rerank<br/>top-N=5]
Rerank --> Generate[Grounded Generation<br/>CitedAnswer w/ inline citations]
Generate --> Ground{Self-Grounding Check<br/>per citation}
Ground -->|all supported| Answer[Cited Answer]
Ground -->|any unsupported| Refuse[Refuse w/ reason]
Answer --> Memory[(Turn Buffer<br/>last 5)]
Memory --> Reformulator The three load-bearing decisions: query rewriting before retrieval (one user question → N retrieval queries via multi-query, HyDE, or sub-question — unlocks recall on phrasings the corpus doesn't index by), cross-encoder reranking is non-optional (Anthropic Contextual Retrieval data: hybrid alone cuts retrieval failures 49%, hybrid + rerank cuts them 67%), and citation-or-refuse trust gate (a post-generation grounding check that refuses to answer when the cited evidence does not support the claim — the difference between a demo and a production system).
When agentic RAG makes sense (and when it doesn't)
Agentic RAG buys you recall, precision, and trust at the cost of latency, complexity, and token spend. Reach for it when the corpus is too big for context and citations actually matter — otherwise ship something simpler.
✓ Use agentic RAG when
- Your corpus is too big for the context window. Even 200k-token windows can't hold a real docs site or knowledge base. Retrieval is the only way to surface the right slice without paying to read the whole corpus on every call.
- Citations are mandatory. Legal, medical, support, internal-docs use cases where the user needs to verify where an answer came from. Naive RAG paraphrases; agentic RAG cites by chunk_id and refuses without evidence.
- Recall matters as much as precision. Single-shot retrieval misses on phrasings the corpus doesn't index by. Query rewriting plus hybrid retrieval plus rerank consistently outperforms embeddings-alone on every benchmark since 2024.
- You can afford 1–3s extra latency for the trust gain. The full pipeline adds rewrite + rerank + grounding check overhead. If your users would rather wait three seconds for a verifiable answer than one second for a maybe-right one, you have product-market fit for agentic RAG.
- You have an eval fixture set. Without ground-truth questions and ground-truth chunks, you can't tell if prompt tweaks improved or regressed retrieval. The Ragas / TruLens harness is non-optional past prototype.
⚠ Ship something simpler when
- Your corpus fits in context. If you can paste your entire knowledge base into the system prompt (under ~50k tokens), do that.
contextual_corpus_in_systembeats RAG on every metric for small corpora. Reach for retrieval when you exceed the budget. - Single-shot retrieval already saturates. Naive RAG (one embed, one retrieve, one generate) is fine for FAQ matching where queries closely mirror the source. Don't add rewrite + rerank + grounding to a pipeline that already hits 95% on your eval.
- You don't need citations. Chat about general knowledge, brainstorming, creative writing — the model's weights are the corpus. Adding retrieval here just slows things down and constrains the model away from useful generation.
- You don't have ground-truth eval data. Without faithfulness / precision / recall fixtures, agentic RAG's complexity is unobservable — you can't tell if a change helped or broke the system. Build the eval set first; the pipeline second.
- Static FAQ where keyword search wins. If the corpus is small and queries are predictable, BM25 alone (or a Postgres full-text-search query) ships in an afternoon and beats embeddings on cost. Don't over-engineer; the embedding bill compounds.
What the community says
How to know it's working
Ship the spec, then measure on these criteria (the eval harness task grades them):
- On the 8-fixture seed eval set, Ragas faithfulness ≥ 0.85 averaged across the 6 on-corpus questions
- Context precision ≥ 0.75 on the same 6 fixtures (retrieved chunks are actually relevant to the ground-truth answer)
- Context recall ≥ 0.85 (every chunk needed to answer is in the retrieved top-N)
- Answer relevance ≥ 0.80 (generated answer addresses the question, not a tangent)
- Refuse-rate on the 2 off-corpus fixtures ≥ 0.90 (system says "I don't have evidence" instead of hallucinating)
- End-to-end p50 latency < 3s, p95 < 8s (rewrite + retrieve + rerank + generate + ground)
- Per-query cost variance ≤ 2× between cached and uncached runs (caching working as intended)
Sources
Every claim, pattern, and acceptance threshold on this page maps back to one of these. Read them before deviating from the spec.
- ↗ Anthropic — Introducing Contextual Retrieval anthropic.com
- ↗ LangChain — Agentic RAG canonical definition + Hybrid RAG docs.langchain.com
- ↗ LangChain — Build RAG pipeline with LangGraph (node graph) docs.langchain.com
- ↗ LlamaIndex — OpenAI agent with query engine (retrieval-as-tool) docs.llamaindex.ai
- ↗ Pinecone — Rerankers and Two-Stage Retrieval pinecone.io
- ↗ Cohere — Improve Search Performance with a Single Line of Code (Rerank) cohere.com
- ↗ Ragas — Metrics-driven development of LLM applications docs.ragas.io
- ↗ TruLens — RAG triad: context relevance, groundedness, answer relevance trulens.org
- ↗ Pinecone Canopy — production-shaped reference RAG stack (GitHub) github.com
- ↗ llmware — enterprise RAG with parsers, fact-checking, prompt catalog (GitHub) github.com
- ↗ Anthropic cookbook — Contextual Embeddings guide notebook (GitHub) github.com
- ↗ HN discussion — Anthropic Contextual Retrieval news.ycombinator.com
- ↗ r/LocalLLaMA — The best way to eliminate hallucinations in RAG reddit.com
Build this in your codebase tonight
Sign up — Tekk reads your repo, picks your stack from the five decisions in STACK.md, and writes a personalized version of this 10-task spec. Same architecture, your patterns, your dependencies. Want to do it yourself? Open any task above and hit Copy as Prompt — paste into Cursor, Claude Code, or Codex.