Build an AI Voice Agent

Free spec to build an AI voice agent on LiveKit Agents — a realtime voice loop with streaming STT, speed-tier LLM, streaming TTS, transformer-based end-of-turn detection, adaptive barge-in, function tools, conversation memory, and an LLM-judge eval harness. Budget-engineered around the 1000ms round-trip target (200/400/200/200 across STT/LLM/TTS/network).

Build this in my repo →

AVA-1

Lay the foundation for your AI voice agent

AVA-2

Wire the LiveKit room and the audio I/O for your voice agent

AVA-3

Stream the user's speech into text in real time

AVA-4

Wire the LLM turn handler with streaming + preemptive generation

AVA-5

Stream the assistant's response back as speech

AVA-6

Add semantic turn detection and barge-in handling

AVA-7

Give the agent a tool it can call mid-conversation

AVA-8

Give your voice agent memory across the conversation

AVA-9

Build an eval harness with judges that grades real call transcripts

AVA-10

Ship your AI voice agent: phone number, traces, recording, regression

Nothing in flight

Nothing shipped yet

5 minutes of planning, ~19 hours saved. Tekk wires this 10-task spec into your repo in your stack — sign up to seed your workspace.

Build this →

How to use this spec

Click any row above to open the full task — title, description, subtasks, AI instructions, the works. Same layout the product uses internally.
Hit Copy as Prompt in the right sidebar of any task. You'll get the XML-wrapped prompt Tekk uses internally — paste it into Cursor, Claude Code, Codex, ChatGPT, or anywhere else and the agent has the full task context.
Open in jumps the same prompt directly into v0, Lovable, Bolt, Magic Patterns, Replit, or Cursor with one click.

What you're building

Goal 1: Ship a working AI voice agent on LiveKit Agents: streaming STT → LLM → TTS pipeline that hits a p50 round-trip under 1000ms on web transport, detects end-of-turn semantically (not VAD-only), handles barge-in cleanly, calls tools mid-conversation, remembers prior turns, and survives production with traces + recording + regression evals.
Anthropic-style voice agent built on LiveKit Agents as the primary variant (Pipecat as the framework-agnostic alt). Pick provider tier once in STACK.md (STT / LLM / TTS / turn-detector / memory substrate); every later card runs the right variant. Acceptance bar: p50 end-to-end latency < 1000ms on the 6-scenario eval suite, false-interruption rate < 5% on a 50-turn turn-detection fixture, tool calls cancel on barge-in within 200ms, multi-turn memory test passes, and the regression harness shows ±10% latency drift after production hardening.

Architecture

flowchart LR
  User[User Audio] -->|WebRTC frames| Room[LiveKit Room]
  Room --> VAD[Silero VAD<br/>pre-warmed]
  VAD --> STT[Streaming STT<br/>~90-200ms]
  STT -->|partial transcripts| EOU[MultilingualModel EOU<br/>~50ms CPU]
  STT -->|partials| LLM[Speed-tier LLM<br/>preemptive gen<br/>~120-400ms TTFT]
  EOU -->|turn-end signal| LLM
  LLM -->|token stream| Tools{Function tool?}
  Tools -->|no| TTS[Streaming TTS<br/>~75-200ms first-byte]
  Tools -->|yes| Tool[Tool call<br/>interruption-aware]
  Tool -->|result| TTS
  TTS -->|audio frames| Room
  Room -->|WebRTC frames| User
  LLM <-->|history| Memory[(Session Memory<br/>in-memory)]
  TTS -.->|trace events| Tracer[Tracer<br/>JSONL stderr]
  STT -.->|trace| Tracer
  LLM -.->|trace| Tracer

The load-bearing constraint is the ~1000ms p50 latency budget: 200ms STT + 400ms LLM + 200ms TTS + 200ms network. Three architectural decisions defend it. First, every stage streams — STT emits partial transcripts mid-utterance, LLM streams tokens to TTS as they arrive, TTS plays audio chunks while still synthesizing. A non-streaming stage alone breaks the budget. Second, turn detection is a transformer classifier (MultilingualModel, 135M params, ~50ms CPU) on top of Silero VAD — not VAD-only. LiveKit's own data: this cuts false interruptions by 85%, the difference between a conversation that feels human and one that talks over the user. Third, tools honor interruption — every long-running tool checks speech_handle.interrupted and cancels on barge-in. Without this, the agent finishes an abandoned tool 4 seconds after the user moved on and reads a stale result over the new turn. Preemptive generation runs the LLM speculatively before EOU confirms — saves ~100ms perceived latency but burns tokens on every false EOU, which is exactly why the transformer classifier (not raw VAD) is the right choice.

When a voice agent makes sense (and when it doesn't)

Voice agents are a different beast than text agents. The action space is small (speak / call tool / stay silent), but the failure mode is brutal: latency. A voice agent that's 200ms slow feels broken; a chat agent that's 200ms slow is fine. Reach for this pattern when the conversation HAS to be in audio — phone support, drive-time scheduling, hands-busy workflows. Otherwise ship a chat agent and save yourself the budget.

✓ Build a voice agent when

The user can't type. Phone support, drive-time scheduling, kitchen/clinical/warehouse settings where hands are full. Audio is the only viable channel.
Speed-of-answer beats answer quality. Routing calls, taking reservations, confirming appointments — fast and correct on simple flows wins. The 1000ms latency budget is achievable for these.
The conversation is naturally short. <5 turns, one task per call. Long multi-turn flows ("help me debug this code") suffer from cumulative latency and audio fatigue.
Your team can ship the production layer. Voice agents need tracing, recording, redaction, and provider failover — a real engineer-week beyond the OSS framework. Builders consistently report (per Dograh's launch) that 60–70% of cost is the hosted platform fee precisely because that production layer is hard.
Telephony/phone numbers are a hard requirement. If the use case is inbound calls, voice is non-negotiable. LiveKit SIP or Twilio dial-in are the two production paths.

⚠ Ship a chat agent when

The user has a screen and a keyboard. Chat is faster to read, easier to skim, costs less to run. Voice is a worse interface unless audio is genuinely required.
The task needs to show data. Tables, code, images, links — voice can't render any of it. Speaking a 12-row table back to the user is the failure mode.
The conversation will exceed 5 turns. Cumulative latency compounds. Multi-turn voice flows that work in demos feel like phone trees in production. Move complex flows to chat with a voice handoff.
Your unit economics can't absorb streaming costs. Streaming STT + speed-tier LLM + streaming TTS is more expensive per-minute than batched chat. If the use case is high-volume / low-value, the math may not work.
You haven't shipped a chat version first. Chat → voice is the right migration order. Voice introduces every text-agent failure mode plus latency, turn detection, and interruption. Ship the simpler version first.

What the community says

"Even with solid OSS (Pipecat/LiveKit), we still had to do a lot of plumbing — variable extraction, tracing, testing etc. We'd spent more time building infrastructure than building the actual agents. 60–70% of our total spend was the Vapi platform fee."

a6kme (Dograh founder) · Show HN: Dograh – an OSS Vapi alternative · 2025-11 · news.ycombinator.com

"Livekit feels more or less for media handling then building voice agent. Pipecat is good project … but not enterprise ready, need to do lot of work to deploy."

p_srivastav · Is there open source alternative for VAPI or retellai? · 2025-10 · news.ycombinator.com

"One of the hardest problems to solve right now for voice AI applications is end-of-turn detection. VAD only picks up on when someone is speaking, whereas a human also uses semantics. Our transformer EOU model reduces false interruptions by 85%."

LiveKit engineering · Using a transformer to improve end-of-turn detection · livekit.com

How to know it's working

Ship the spec, then measure on these criteria (the eval harness task grades them):

p50 end-to-end latency < 1000ms on a 6-scenario fixture suite, measured per-stage (STT / LLM / TTS) and as round-trip total
p95 end-to-end latency < 1500ms — captures tail-latency regression that p50 hides
False-interruption rate < 5% on a 50-clip turn-detection fixture (25 real-EOU + 25 false-EOU; transformer-EOU + adaptive interruption)
False-silence rate < 5% on the same fixture — model doesn't miss real end-of-turn
Barge-in latency < 200ms — user interrupts, agent stops speaking within 200ms
Tool-call accuracy: tool_use_judge passes on the calendar-lookup scenario, including the cancel-on-interrupt path
Multi-turn memory: 3-turn fixture (turn 1 user names themselves; turn 3 user asks for the name) — accuracy_judge passes
Hallucination trap: agent asked about a non-existent reservation acknowledges no record instead of inventing one — accuracy_judge passes
Regression: after Task 10 hardening, re-run baseline; p50 latency within ±10% of original baseline

Sources

Every claim, pattern, and acceptance threshold on this page maps back to one of these. Read them before deviating from the spec.

Build this in your codebase tonight

Sign up — Tekk reads your repo, picks your stack from the five decisions in STACK.md, and writes a personalized version of this 10-task spec. Same architecture, your patterns, your dependencies. Want to do it yourself? Open any task above and hit Copy as Prompt — paste into Cursor, Claude Code, or Codex.

Build it in my repo →

Customer Support Agent· coming soon

Intercom-style multi-agent support — ticket classifier, RAG over docs, escalation routing, conversation memory…

use-case · customer-supportframework · claude-agent-sdk