Skip to Content

Building a Low-Latency CPU-Based Voice Assistant with Streaming TTS

NEO built a sub-1.3-second time-to-first-audio (TTFA) voice assistant that runs entirely on CPU using KittenML’s TTS model, sub-sentence streaming at punctuation boundaries, and a multi-threaded producer-consumer pipeline.


Problem Statement

We asked NEO to: Build a voice assistant that feels responsive on CPU-only hardware—no GPU. Most TTS pipelines wait for full sentences before synthesizing, which stacks latency. The system should start playback as soon as possible (e.g. at comma/semicolon boundaries) with a small look-ahead to preserve prosody, and run LLM + TTS + playback in a concurrent pipeline.


Solution Overview

NEO built a CPU-optimized voice assistant achieving 1.25s TTFA without GPU:

  1. Sub-Sentence Streaming — Trigger synthesis at commas/semicolons plus 25-character look-ahead; start audio mid-sentence without choppy output
  2. Multi-Threaded Pipeline — LLM (OpenRouter) streams tokens → chunking at punctuation → TTS (KittenML, ONNX) per chunk → playback; stages run concurrently
  3. Small TTS Model — Under 100 MB (e.g. vs Piper/Sherpa-ONNX ~150 MB+); faster load and lower memory
  4. ONNX Tuning — Thread affinity and parallelism tuned for high core-count Windows; high CPU utilization, low cache misses

CPU Voice Assistant Pipeline Architecture

Workflow / Pipeline

StepDescription
1. LLM InferenceOpenRouter-hosted model streams tokens to the client
2. ChunkingWatch token stream for punctuation triggers; package chunks for synthesis
3. TTS SynthesisKittenML model runs on each chunk via ONNX as soon as chunk is ready
4. Audio PlaybackQueue synthesized audio and play continuously; pipeline keeps all cores busy

Technical Details


Repository & Artifacts

abhishekgandhi-neo/Low-Latency-CPU-Based-Voice-AssistantView on GitHub

Generated Artifacts:


Best Practices & Lessons Learned


References

View source on GitHub


Learn More