Build an AI Voice Agent
Free spec to build an AI voice agent on LiveKit Agents — a realtime voice loop with streaming STT, speed-tier LLM, streaming TTS, transformer-based end-of-turn detection, adaptive barge-in, function tools, conversation memory, and an LLM-judge eval harness. Budget-engineered around the 1000ms round-trip target (200/400/200/200 across STT/LLM/TTS/network).
How to use this spec
- Click any row above to open the full task — title, description, subtasks, AI instructions, the works. Same layout the product uses internally.
- Hit Copy as Prompt in the right sidebar of any task. You'll get the XML-wrapped prompt Tekk uses internally — paste it into Cursor, Claude Code, Codex, ChatGPT, or anywhere else and the agent has the full task context.
- Open in jumps the same prompt directly into v0, Lovable, Bolt, Magic Patterns, Replit, or Cursor with one click.
What you're building
Goal 1: Ship a working AI voice agent on LiveKit Agents: streaming STT → LLM → TTS pipeline that hits a p50 round-trip under 1000ms on web transport, detects end-of-turn semantically (not VAD-only), handles barge-in cleanly, calls tools mid-conversation, remembers prior turns, and survives production with traces + recording + regression evals.
Anthropic-style voice agent built on LiveKit Agents as the primary variant (Pipecat as the framework-agnostic alt). Pick provider tier once in STACK.md (STT / LLM / TTS / turn-detector / memory substrate); every later card runs the right variant. Acceptance bar: p50 end-to-end latency < 1000ms on the 6-scenario eval suite, false-interruption rate < 5% on a 50-turn turn-detection fixture, tool calls cancel on barge-in within 200ms, multi-turn memory test passes, and the regression harness shows ±10% latency drift after production hardening.
Architecture
flowchart LR
User[User Audio] -->|WebRTC frames| Room[LiveKit Room]
Room --> VAD[Silero VAD<br/>pre-warmed]
VAD --> STT[Streaming STT<br/>~90-200ms]
STT -->|partial transcripts| EOU[MultilingualModel EOU<br/>~50ms CPU]
STT -->|partials| LLM[Speed-tier LLM<br/>preemptive gen<br/>~120-400ms TTFT]
EOU -->|turn-end signal| LLM
LLM -->|token stream| Tools{Function tool?}
Tools -->|no| TTS[Streaming TTS<br/>~75-200ms first-byte]
Tools -->|yes| Tool[Tool call<br/>interruption-aware]
Tool -->|result| TTS
TTS -->|audio frames| Room
Room -->|WebRTC frames| User
LLM <-->|history| Memory[(Session Memory<br/>in-memory)]
TTS -.->|trace events| Tracer[Tracer<br/>JSONL stderr]
STT -.->|trace| Tracer
LLM -.->|trace| Tracer The load-bearing constraint is the ~1000ms p50 latency budget: 200ms STT + 400ms LLM + 200ms TTS + 200ms network. Three architectural decisions defend it. First, every stage streams — STT emits partial transcripts mid-utterance, LLM streams tokens to TTS as they arrive, TTS plays audio chunks while still synthesizing. A non-streaming stage alone breaks the budget. Second, turn detection is a transformer classifier (MultilingualModel, 135M params, ~50ms CPU) on top of Silero VAD — not VAD-only. LiveKit's own data: this cuts false interruptions by 85%, the difference between a conversation that feels human and one that talks over the user. Third, tools honor interruption — every long-running tool checks speech_handle.interrupted and cancels on barge-in. Without this, the agent finishes an abandoned tool 4 seconds after the user moved on and reads a stale result over the new turn. Preemptive generation runs the LLM speculatively before EOU confirms — saves ~100ms perceived latency but burns tokens on every false EOU, which is exactly why the transformer classifier (not raw VAD) is the right choice.
When a voice agent makes sense (and when it doesn't)
Voice agents are a different beast than text agents. The action space is small (speak / call tool / stay silent), but the failure mode is brutal: latency. A voice agent that's 200ms slow feels broken; a chat agent that's 200ms slow is fine. Reach for this pattern when the conversation HAS to be in audio — phone support, drive-time scheduling, hands-busy workflows. Otherwise ship a chat agent and save yourself the budget.
✓ Build a voice agent when
- The user can't type. Phone support, drive-time scheduling, kitchen/clinical/warehouse settings where hands are full. Audio is the only viable channel.
- Speed-of-answer beats answer quality. Routing calls, taking reservations, confirming appointments — fast and correct on simple flows wins. The 1000ms latency budget is achievable for these.
- The conversation is naturally short. <5 turns, one task per call. Long multi-turn flows ("help me debug this code") suffer from cumulative latency and audio fatigue.
- Your team can ship the production layer. Voice agents need tracing, recording, redaction, and provider failover — a real engineer-week beyond the OSS framework. Builders consistently report (per Dograh's launch) that 60–70% of cost is the hosted platform fee precisely because that production layer is hard.
- Telephony/phone numbers are a hard requirement. If the use case is inbound calls, voice is non-negotiable. LiveKit SIP or Twilio dial-in are the two production paths.
⚠ Ship a chat agent when
- The user has a screen and a keyboard. Chat is faster to read, easier to skim, costs less to run. Voice is a worse interface unless audio is genuinely required.
- The task needs to show data. Tables, code, images, links — voice can't render any of it. Speaking a 12-row table back to the user is the failure mode.
- The conversation will exceed 5 turns. Cumulative latency compounds. Multi-turn voice flows that work in demos feel like phone trees in production. Move complex flows to chat with a voice handoff.
- Your unit economics can't absorb streaming costs. Streaming STT + speed-tier LLM + streaming TTS is more expensive per-minute than batched chat. If the use case is high-volume / low-value, the math may not work.
- You haven't shipped a chat version first. Chat → voice is the right migration order. Voice introduces every text-agent failure mode plus latency, turn detection, and interruption. Ship the simpler version first.
What the community says
How to know it's working
Ship the spec, then measure on these criteria (the eval harness task grades them):
- p50 end-to-end latency < 1000ms on a 6-scenario fixture suite, measured per-stage (STT / LLM / TTS) and as round-trip total
- p95 end-to-end latency < 1500ms — captures tail-latency regression that p50 hides
- False-interruption rate < 5% on a 50-clip turn-detection fixture (25 real-EOU + 25 false-EOU; transformer-EOU + adaptive interruption)
- False-silence rate < 5% on the same fixture — model doesn't miss real end-of-turn
- Barge-in latency < 200ms — user interrupts, agent stops speaking within 200ms
- Tool-call accuracy: tool_use_judge passes on the calendar-lookup scenario, including the cancel-on-interrupt path
- Multi-turn memory: 3-turn fixture (turn 1 user names themselves; turn 3 user asks for the name) — accuracy_judge passes
- Hallucination trap: agent asked about a non-existent reservation acknowledges no record instead of inventing one — accuracy_judge passes
- Regression: after Task 10 hardening, re-run baseline; p50 latency within ±10% of original baseline
Sources
Every claim, pattern, and acceptance threshold on this page maps back to one of these. Read them before deviating from the spec.
- ↗ LiveKit Agents — production voice/video/telephony framework github.com
- ↗ LiveKit Agent Starter (Python) — canonical reference implementation github.com
- ↗ Pipecat — frame-based realtime voice/multimodal framework github.com
- ↗ vocode-core — voice-based LLM apps with telephony focus github.com
- ↗ LiveKit Agents documentation docs.livekit.io
- ↗ Using a transformer to improve end-of-turn detection livekit.com
- ↗ How to build the lowest latency voice agent in Vapi (~465ms end-to-end) assemblyai.com
- ↗ Voice AI pipeline: STT, LLM, TTS and the 300ms budget channel.tel
- ↗ Show HN: Dograh — an OSS Vapi alternative news.ycombinator.com
- ↗ Is there open source alternative for VAPI or retellai? news.ycombinator.com
- ↗ Launch HN: Retell AI (YC W24) — Conversational Speech API for Your LLM news.ycombinator.com
- ↗ Are Voice AI Pipeline Platforms a Race to the Bottom? news.ycombinator.com
- ↗ Show HN: Voicetest — open-source test harness for voice AI agents news.ycombinator.com
Build this in your codebase tonight
Sign up — Tekk reads your repo, picks your stack from the five decisions in STACK.md, and writes a personalized version of this 10-task spec. Same architecture, your patterns, your dependencies. Want to do it yourself? Open any task above and hit Copy as Prompt — paste into Cursor, Claude Code, or Codex.