Skip to content

Real-Time Audio Processing

Batch pipelines process a file. Real-time pipelines process the next 20 milliseconds before the next 20 arrive. Every conversational AI, broadcast studio, and telephony bot lives and dies by this latency budget.

Type: Build Languages: Python Prerequisites: Phase 6 · 02 (Spectrograms), Phase 6 · 04 (ASR), Phase 6 · 07 (TTS) Time: ~75 minutes

The Problem

You want a voice assistant that feels alive. Human conversational turn-taking latency is ~230 ms (silence-to-response). Anything above 500 ms feels robotic; above 1500 ms feels broken. The budget for a full hear → understand → respond → speak loop in 2026 is:

StageBudget
Mic → buffer20 ms
VAD10 ms
ASR (streaming)150 ms
LLM (first token)100 ms
TTS (first chunk)100 ms
Render → speaker20 ms
Total~400 ms

Moshi (Kyutai, 2024) clocked 200 ms full-duplex. GPT-4o-realtime (2024) clocks ~320 ms. Cascaded pipelines in 2022 shipped at 2500 ms. The 10× improvement came from three techniques: (1) streaming everywhere, (2) asynchronous pipelining with partial results, (3) interruptible generation.

The Concept

Streaming audio pipeline with ring buffer, VAD gate, interruption

Frame / chunk / window. Real-time audio flows as fixed-size blocks. Common choice: 20 ms (320 samples at 16 kHz). Everything downstream must keep up with this cadence.

Ring buffer. Fixed-size circular buffer. Producer thread writes new frames, consumer thread reads. Prevents allocations in the hot path. Size ≈ maximum-latency × sample-rate; a 2-second 16 kHz ring = 32,000 samples.

VAD (Voice Activity Detection). Gates downstream work when nobody is speaking. Silero VAD 4.0 (2024) runs <1 ms per 30 ms frame on CPU. webrtcvad is the older alternative.

Streaming ASR. Models that emit partial transcripts as audio arrives. Parakeet-CTC-0.6B in streaming mode (NeMo, 2024) does 2–5% WER at 320 ms latency. Whisper-Streaming (Macháček et al., 2023) chunks Whisper for near-streaming at ~2 s latency.

Interruption. When the user speaks while the assistant is talking, you must (a) detect the barge-in, (b) stop the TTS, (c) discard the remaining LLM output. All within 100 ms, or the user perceives deaf assistant.

WebRTC Opus transport. 20 ms frames, 48 kHz, adaptive bitrate 8–128 kbps. Standard for browser and mobile. LiveKit, Daily.co, Pion are the 2026 stacks for building voice apps.

Jitter buffer. Network packets arrive out of order / late. The jitter buffer reorders and smooths; too small → audible gaps, too large → latency. 60–80 ms typical.

Common gotchas

  • Thread contention. Python's GIL + heavy models can starve the audio thread. Use a C-callback audio library (sounddevice, PortAudio) and keep Python off the hot path.
  • Sample-rate conversion latency. Resampling inside the pipeline adds 5–20 ms. Either resample upfront or use a zero-latency resampler (PolyPhase, soxr_hq).
  • TTS priming. Even fast TTS like Kokoro has a 100–200 ms warm-up on first request. Cache model + warm it with a dummy run before the first real turn.
  • Echo cancellation. Without AEC, TTS output re-enters the mic and triggers ASR on the bot's own voice. WebRTC AEC3 is the open-source default.

Build It

Step 1: ring buffer

python
import collections

class RingBuffer:
    def __init__(self, capacity):
        self.buf = collections.deque(maxlen=capacity)
    def write(self, frame):
        self.buf.extend(frame)
    def read(self, n):
        return [self.buf.popleft() for _ in range(min(n, len(self.buf)))]
    def level(self):
        return len(self.buf)

Capacity determines max buffering latency. 32,000 samples at 16 kHz = 2 s.

Step 2: VAD gate

python
def simple_energy_vad(frame, threshold=0.01):
    return sum(x * x for x in frame) / len(frame) > threshold ** 2

Replace with Silero VAD in production:

python
import torch
vad, _ = torch.hub.load("snakers4/silero-vad", "silero_vad")
is_speech = vad(torch.tensor(frame), 16000).item() > 0.5

Step 3: streaming ASR

python
# Parakeet-CTC-0.6B streaming via NeMo
from nemo.collections.asr.models import EncDecCTCModelBPE
asr = EncDecCTCModelBPE.from_pretrained("nvidia/parakeet-ctc-0.6b")
# chunk_ms=320 ms, look_ahead_ms=80 ms
for chunk in audio_stream():
    partial_text = asr.transcribe_streaming(chunk)
    print(partial_text, end="\r")

Step 4: interruption handler

python
class Dialog:
    def __init__(self):
        self.tts_task = None

    def on_user_speech(self, frame):
        if self.tts_task and not self.tts_task.done():
            self.tts_task.cancel()   # barge-in
        # then feed to streaming ASR

    def on_final_user_utterance(self, text):
        self.tts_task = asyncio.create_task(self.reply(text))

    async def reply(self, text):
        async for tts_chunk in llm_then_tts(text):
            speaker.write(tts_chunk)

Hinges on async I/O and cancellable TTS streaming. WebRTC peerconnection.stop() on the audio track is the canonical way.

Use It

The 2026 stack:

LayerPick
TransportLiveKit (WebRTC) or Pion (Go)
VADSilero VAD 4.0
Streaming ASRParakeet-CTC-0.6B or Whisper-Streaming
LLM first-tokenGroq, Cerebras, vLLM-streaming
Streaming TTSKokoro or ElevenLabs Turbo v2.5
Echo cancelWebRTC AEC3
End-to-end nativeOpenAI Realtime API or Moshi

Pitfalls

  • Buffering 500 ms to be safe. The buffer is your latency floor. Shrink it.
  • Not pinning threads. Audio callback on a priority-lower-than-UI thread = glitches under load.
  • TTS chunks too small. Sub-200 ms chunks make vocoder artifacts audible. 320 ms chunks are the sweet spot.
  • No jitter buffer. Real networks are jittery; without smoothing you get pops.
  • Single-shot error handling. Audio pipelines must be crash-proof. One exception kills the session.

Ship It

Save as outputs/skill-realtime-designer.md. Design a real-time audio pipeline with concrete latency budgets per stage.

Exercises

  1. Easy. Run code/main.py. Simulates a ring buffer + energy VAD; prints stage latencies for a fake 10-second stream.
  2. Medium. Using sounddevice, build a passthrough loop that processes your mic in 20 ms frames and prints VAD state at each frame.
  3. Hard. Build a full duplex echo test with aiortc: browser → WebRTC → Python → WebRTC → browser. Measure glass-to-glass latency with a 1 kHz pulse.

Key Terms

TermWhat people sayWhat it actually means
Ring bufferThe circular queueFixed-size, lock-free (or SPSC-locked) FIFO for audio frames.
VADSilence gateModel or heuristic marking speech vs non-speech.
Streaming ASRReal-time STTEmits partial text as audio arrives; bounded lookahead.
Jitter bufferNetwork smootherQueue reordering out-of-order packets; 60–80 ms typical.
AECEcho cancellationSubtracts speaker-to-mic feedback path.
Barge-inUser interruptSystem detects user speech mid-TTS; must cancel playback.
Full duplexSimultaneous both waysUser and bot can talk at the same time; Moshi is full duplex.

Further Reading