I Built a Voice AI System From Scratch. The Hard Part Wasn't the AI.
A month ago I started building something I couldn't find anywhere: a voice-driven AI daemon that actually lives on your machine. Not a wrapper. Not a demo. A real system — push to talk, voice in, voice out, with memory, tool use, and enough intelligence to know what you're working on.
I shipped six versions in one month. It works. It's open source.
It also takes ten seconds to respond.
That gap — between what I built and what I wanted to build — turned out to be the most interesting part of the whole project. This is the write-up I wish I'd found before I started.
What I Actually Wanted
Not a chatbot. Not Siri. Something closer to an operating system layer with a voice interface — something that knows your current project, remembers what you said yesterday, can read a file or run a command or search the web, and responds in natural speech fast enough that it feels like a conversation.
The keyword is feels. That's the engineering problem nobody talks about.
The Stack
Before the story, the bones:
STT — faster-whisper
small.en model on CPU. ~250ms transcription time. Accurate enough for technical vocabulary — it gets "Tokio", "asyncio", "Hyprland" right most of the time. Rex auto-detects hardware: M-series Mac gets mlx-whisper, NVIDIA with 6GB+ gets Parakeet TDT 0.6B, everything else gets the CPU path. I'm on a GTX 1650, so I get the CPU path.
LLM — OpenAI-compatible API
Any endpoint works. Claude Haiku under the hood — fast, cheap, genuinely follows system prompts. The client is generic: swap base_url in config and it talks to Ollama, Groq, or anything else.
TTS — Piper
Runs as a subprocess so a crash can't take the daemon with it. en_US-lessac-medium voice. Not studio quality. After a week you stop noticing.
Runtime — Python asyncio No threads. Everything non-blocking. Runs as a systemd user service — starts on login, restarts on crash, idles at under 50MB.
IPC — Unix socket
$XDG_RUNTIME_DIR/rex.sock. Newline-terminated JSON. rex-trigger sends start/stop signals, bound to SUPER+Space in Hyprland.
Storage — SQLite
Conversation history. Tool call log. Persistent facts. Project context from .rex/context.md in whatever directory you're working in.
Version by Version
v0.0.1 — Does the pipeline actually work?
First version was embarrassingly simple. Press hotkey, record audio, transcribe with faster-whisper, match a keyword list, speak a canned response.
"Open terminal" → Rex says "Opening terminal." It doesn't open a terminal. It just says the words.
But audio went in and voice came out. The pipeline was real. That was enough.
v0.1 — A real brain
Replace the keyword matcher with an LLM call. Claude Haiku. Conversation history in SQLite — last N turns injected into every prompt.
The system prompt took longer to write than the integration code. That surprised me. The model is not the personality — the prompt is. Mine ended up at 200 words with hard constraints:
- Two to three sentences maximum unless depth is explicitly asked for
- No markdown (Piper reads asterisks aloud)
- No sycophancy — if I'm wrong, say so
- Contractions always — "I'll" not "I will"
The difference was immediate. It stopped sounding like a support bot.
v0.2 — Tool use
This is where it gets interesting. And painful.
The naive implementation: speak → LLM decides what tool to use → LLM asks for confirmation → user confirms → tool runs → LLM interprets result → speaks response. Four round trips. A simple file read takes three seconds and two spoken exchanges. Unusable.
The fix: collapse into one pass. System prompt now says — when tool use is required, return the complete plan immediately. Tool name, arguments, whether confirmation is needed. No clarifying questions. No intent confirmation. Plan and execute.
Tools are tiered by trust:
silent — read_file, clipboard_read, web_search, git_status
confirm — write_file, run_command, git_commit
dangerous — delete_file, anything destructive
Silent tools run immediately. No extra LLM call. Rex just does it and reports in one sentence. The four round-trip flow collapses to one for 80% of operations.
Web search uses ddgr — DuckDuckGo from the terminal. No API key. Rex searches and summarizes in two sentences. Not perfect. Good enough for "what does this error mean."
v0.3–v0.5 — The details nobody writes about
tts.clean_for_speech() — LLMs love markdown. Piper reads whatever you give it literally. Without this function Rex said "asterisk asterisk important thing asterisk asterisk." The cleaner strips markdown, converts /home/kal/projects/rex to "rex project directory", turns 1024MB into "one gigabyte." Tiny function. Makes voice output listenable.
The indicator overlay — GTK4 floating pill, top-center of the screen. Red dot when listening. Spinner when thinking. Green when done. Amber on error. Built with gtk4-layer-shell for Wayland. Without this you're speaking into a void. With it you always know what state the daemon is in.
rex-ask and rex-chat — One-shot text query and a persistent REPL. Both share the same SQLite memory as the voice daemon. Same Rex, different input. Sometimes you're in a meeting. Sometimes you just don't want to speak.
Project context — .rex/context.md in any project directory. When Rex is triggered from that directory, the context injects into every prompt. "You're helping with Rex, a Python asyncio daemon. Current focus: v0.6 Smart PTT." Rex knows what you're building without you explaining it every session.
v0.6 — Smart PTT
The most important version. The one that makes it actually usable.
The original push-to-talk was dumb. Hold key → record → release → transcribe. In practice: you want to interrupt mid-response but have to wait. You release the key early and Rex cuts off. You hold too long and record fifteen seconds of silence.
Smart PTT is a proper state machine:
IDLE → [keypress] → ARMED → [VAD detects voice] → RECORDING
RECORDING → [VAD detects 1.5s silence] → PROCESSING
PROCESSING → [STT + LLM + TTS] → IDLE
Silero VAD runs in the audio pipeline — a tiny ONNX model firing on every 30ms frame. When it detects 1.5 seconds of continuous silence after speech, recording stops automatically. One keypress to arm. Speak. Silence stops it.
No timeouts. No accidental cutoffs. That's v0.6.
The Wall
I want to be direct about this because every voice AI build post skips it.
End-to-end latency on my hardware: ~10 seconds.
Not 300ms. Not 800ms. Ten seconds.
Here's where the time actually goes:
Keypress to VAD confirmation: ~200ms
STT transcription (CPU): ~2,000ms ← the real culprit
Context injection: ~50ms
LLM time-to-first-token: ~400ms
TTS synthesis + playback start: ~300ms
Network variance (API): ~500–7,000ms on bad days
The STT model on CPU is the biggest single cost. small.en on a GTX 1650 without CUDA acceleration because I don't have enough VRAM to run both STT and a local LLM simultaneously. Something has to give.
Even with a better STT setup, the architecture has a ceiling:
Every voice pipeline — commercial or open source — is sequential:
User finishes speaking → STT → LLM → TTS
Each stage waits for the previous to complete. The LLM's ~400ms time-to-first-token sits there, static, every single turn, no matter how fast everything else gets.
Human conversation expects 200–300ms inter-turn response time. That's not a preference — it's neurological. Cross 300ms and the brain stops perceiving dialogue. It starts perceiving a terminal with a slow connection.
The best optimized production stacks in 2026 hit 320–800ms. I'm at 10 seconds. Both are broken by the same standard. Mine is just more honest about it.
The Architecture That Doesn't Exist Yet
Here's what I kept thinking about while debugging slow responses:
When you ask someone a question in person, they start forming an answer before you finish the sentence. By word 4 of a 10-word question, their brain has predicted 3 likely completions and begun composing the most probable response. The reply starts forming before the question ends.
No voice AI does this. Every single one waits for the full utterance, transcribes it completely, then starts reasoning.
The theoretical fix: speculative prefetch.
When VAD detects speech start — not end — take the last few conversation turns and generate the top-N likely query completions using a tiny local model in ~50ms. Fire parallel LLM requests for each completion speculatively.
When STT returns the actual transcript, match against the prefetches.
Hit → serve the pre-generated response immediately. Perceived latency: STT time only (~300ms on good hardware). Miss → discard, generate normally.
At 50% hit rate in a focused task context — a developer assistant that hears mostly the same categories of request — average perceived latency halves. Cost on misses: under $2/month at current API pricing.
This exists in research (PredGen, 2025, touches the adjacent problem). It does not exist as a production tool for local voice agents. Nobody has built it.
That's the thing worth building. I can't build it on this hardware.
What the Hardware Actually Limits
I want to be specific because "hardware constraints" is vague:
4GB VRAM — Can't run an intelligent local model. Anything that fits (3B quantized) is noticeably less capable than Claude Haiku. Good for simple status checks. Falls apart on reasoning. So I use the API, which means network latency is now in the critical path.
8GB RAM — The daemon, STT model, TTS model, and any local inference all compete for the same pool. Running STT on CPU without CUDA means it can't use the GPU at all, so it's slow. Running it on GPU means it competes with everything else for 4GB.
CPU-only STT — The single biggest latency hit. small.en on CPU takes 1.5–2.5 seconds. On an M3 Mac with the neural engine, the same model takes ~100ms. That's not an optimization gap. That's a hardware generation gap.
A machine with unified memory — M3 or M4 MacBook Pro, 16GB+ — would collapse most of my 10-second pipeline to under 1 second. Not because the code would change. Because the hardware allows the pipeline to share memory efficiently between STT, LLM, and TTS simultaneously.
What I Learned That Surprised Me
The system prompt is load-bearing infrastructure. More important than model choice. More important than any optimization. 200 words that define personality, response length, and behavioral constraints. Version control it. Treat changes like breaking changes. A bad system prompt makes Claude sound like a customer support bot. A good one makes it sound like a character.
Streaming changes the feeling more than the latency. Rex starting to speak mid-generation — first sentence out while Claude is still producing the second — makes it feel alive in a way that waiting for the full response doesn't. The latency doesn't change. The perception does.
Local TTS is good enough now. Piper is not ElevenLabs. After a week you stop caring. The voice cleanup function — stripping markdown, humanizing paths — matters more than voice model quality.
The confirmation UX is the hardest design problem. Not the hardest engineering problem. The hardest design problem. Too much confirmation: annoying. Too little: dangerous. The tier model (silent / confirm / dangerous) is the right abstraction. Getting the tier assignments right takes iteration and judgment, not code.
Honesty about limitations attracts better collaborators than polished demos. This one I'm still testing.
Where It Lives Now
v0.6.0. Smart PTT merged. Voice pipeline end-to-end. Tool use, memory, project context, indicator overlay, text mode, macOS support. systemd service.
Ten seconds average response time on my hardware. Probably under one second on an M4 Mac. Theoretically under 300ms with speculative prefetch built and good hardware.
It's not the thing I set out to build. It's the honest answer to what one person can build in one month on a GTX 1650 with 8GB of RAM and no team.
That answer turned out to be more interesting than the thing I wanted to build.
The codebase is clean. The architecture is right. The walls are documented.
When I come back to this — better hardware, a collaborator, or both — I'll know exactly where to start.
Rex is open source — github.com/sigil-xyz/rex
If you're working on voice AI, inference latency, or speculative decoding applied to conversational turn-taking — reach me at kalki.the.dev@gmail.com