Skip to content

Streaming Speech-to-Speech — Moshi, Hibiki, and Full-Duplex Dialogue

2024-2026 redefined voice AI. Moshi ships a single model that listens and speaks simultaneously at 200 ms latency. Hibiki does speech-to-speech translation chunk-by-chunk. Both abandon the ASR → LLM → TTS pipeline for a unified full-duplex architecture over Mimi codec tokens. This is the new reference design.

Type: Learn Languages: Python Prerequisites: Phase 6 · 13 (Neural Audio Codecs), Phase 6 · 11 (Real-Time Audio), Phase 7 · 05 (Full Transformer) Time: ~75 minutes

The Problem

Every voice agent built from Lessons 11 + 12 has a fundamental latency floor around 300-500 ms: VAD fires, STT processes, LLM reasons, TTS generates. Each stage has its own minimum latency. You can tune and parallelize, but the pipeline shape caps you.

Moshi (Kyutai, 2024-2026) asks a different question: what if there is no pipeline? What if one model takes audio in and emits audio out directly, continuously, with text as an intermediate "inner monologue" instead of a required stage?

The answer is full-duplex speech-to-speech. Theoretical latency 160 ms (80 ms Mimi frame + 80 ms acoustic delay). Practical latency 200 ms on a single L4 GPU. That's half what a best-in-class pipelined voice agent achieves.

The Concept

Moshi architecture: two parallel Mimi streams + inner-monologue text

The Moshi architecture

Inputs. Two Mimi codec streams, both at 12.5 Hz × 8 codebooks:

  • Stream 1: user audio (Mimi-encoded, constantly arriving)
  • Stream 2: Moshi's own audio (generated by Moshi)

The transformer. A 7B-parameter Temporal Transformer processes both streams and a text "inner monologue" stream. At each 80 ms step, it:

  1. Consumes the latest user Mimi tokens (8 codebooks).
  2. Consumes the most recent Moshi Mimi tokens (8 codebooks, as produced).
  3. Generates the next Moshi text token (inner monologue).
  4. Generates the next Moshi Mimi tokens (8 codebooks via a small Depth Transformer).

All three streams — user audio, Moshi audio, Moshi text — run in parallel. Moshi can hear the user while speaking; can interrupt itself when the user interrupts; can back-channel ("mhm") without breaking its main utterance.

The depth transformer. Within a frame, the 8 codebooks are not predicted in parallel — they have inter-codebook dependencies. A small 2-layer "depth transformer" predicts them sequentially within 80 ms. This is the standard factorization for AR codec LMs (also used by VALL-E, VibeVoice).

Why inner-monologue text helps

Without explicit text, the model has to implicitly model language in its acoustic stream. Moshi's insight: force it to emit text tokens alongside audio. The text stream is essentially the transcript of what Moshi is saying. This improves semantic coherence, makes it easier to swap out a language model head, and gives you transcripts for free.

Hibiki: streaming speech-to-speech translation

Same architecture, trained on translation pairs. Source audio in, target-language audio out, continuously. Hibiki-Zero (Feb 2026) eliminates the need for word-level aligned training data — uses sentence-level data + GRPO reinforcement learning for latency optimization.

Four language pairs supported initially; can be adapted to a new language with ≈1000 hours.

The broader Kyutai stack (2026)

  • Moshi — full-duplex dialogue (French first, English well-supported)
  • Hibiki / Hibiki-Zero — simultaneous speech translation
  • Kyutai STT — streaming ASR (500 ms or 2.5 s look-ahead)
  • Kyutai Pocket TTS — 100M-param TTS runs on CPU (Jan 2026)
  • Unmute — full pipeline combining these on public servers

Throughput on an L40S GPU: 64 concurrent sessions at 3× real-time.

Sesame CSM — the cousin

Sesame CSM (2025) uses a similar idea — a Llama-3 backbone with a Mimi codec head. But CSM is single-directional (takes context + text, produces speech) rather than full-duplex. It's the best "voice presence" TTS on the market; not quite the same as Moshi's full-duplex capability.

2026 performance numbers

ModelLatencyUse caseLicense
Moshi200 ms (L4)full-duplex English / French dialogueCC-BY 4.0
Hibiki12.5 Hz framerateFrench ↔ English streaming translationCC-BY 4.0
Hibiki-Zerosame5 language-pairs, no aligned dataCC-BY 4.0
Sesame CSM-1B200 ms TTFAcontext-conditioned TTSApache-2.0
GPT-4o Realtime~300 msclosed, OpenAI APIcommercial
Gemini 2.5 Live~350 msclosed, Google APIcommercial

Build It

Step 1: the interface

Moshi exposes a WebSocket server that takes 80 ms chunks of Mimi-encoded audio and returns 80 ms chunks of Mimi-encoded audio. Both ways. Constantly.

python
import asyncio
import websockets
from moshi.client_utils import encode_audio_mimi, decode_audio_mimi

async def moshi_chat():
    async with websockets.connect("ws://localhost:8998/api/chat") as ws:
        mic_task = asyncio.create_task(stream_mic_to(ws))
        spk_task = asyncio.create_task(stream_from_to_speaker(ws))
        await asyncio.gather(mic_task, spk_task)

Step 2: the full-duplex loop

python
async def stream_mic_to(ws):
    async for chunk_80ms in mic_stream_at_12_5_hz():
        mimi_tokens = encode_audio_mimi(chunk_80ms)
        await ws.send(serialize(mimi_tokens))

async def stream_from_to_speaker(ws):
    async for msg in ws:
        mimi_tokens, text_token = deserialize(msg)
        audio = decode_audio_mimi(mimi_tokens)
        await play(audio)

Both directions run simultaneously. Python asyncio or Rust futures are the standard transport.

Step 3: the training objective (conceptual)

For every 80 ms frame t:

  • Input: user_mimi[0..t], moshi_mimi[0..t-1], moshi_text[0..t-1]
  • Predict: moshi_text[t], then moshi_mimi[t, codebook_0..7]

Text is predicted before audio (inner monologue); audio is predicted codebook-sequential within the depth transformer.

Step 4: where Moshi wins and where it doesn't

Moshi wins:

  • Sub-250 ms end-to-end on cheap hardware.
  • Natural back-channels and interruptions.
  • No pipeline glue code.

Moshi does not win:

  • Tool calling (not trained for it; you need a separate LLM path).
  • Long reasoning (Moshi is an 8B-ish dialogue model, not Claude/GPT-4).
  • Factual accuracy on niche topics.
  • Most production enterprise use cases (still use pipelines in 2026).

Use It

SituationPick
Lowest-latency voice companionMoshi
Live translation callHibiki
Voice demo / researchMoshi, CSM
Enterprise agent with toolsPipeline (Lesson 12), not Moshi
Custom-voice TTS in contextSesame CSM
Speech-to-speech, any languagesGPT-4o Realtime or Gemini 2.5 Live (commercial)

Pitfalls

  • Limited tool calling. Moshi is a dialogue model, not an agent framework. Combine with pipeline for tools.
  • Specific-voice conditioning. Moshi uses a single trained persona; cloning is a separate training run.
  • Language coverage. French + English is excellent; others limited. Hibiki-Zero helps, but you still need training data.
  • Resource cost. A full Moshi session holds a GPU slot; not a cheap shared-tenant deploy pattern.

Ship It

Save as outputs/skill-duplex-pipeline.md. Pick pipeline vs full-duplex architecture for a voice-agent workload, with reason.

Exercises

  1. Easy. Run code/main.py. It simulates the two-stream + inner-monologue architecture symbolically.
  2. Medium. Pull Moshi from HuggingFace, run the server, test one conversation. Measure wall-clock latency from end-of-user-speech to start-of-Moshi-response.
  3. Hard. Take your Lesson 12 pipeline agent and compare P50 latency vs Moshi on 20 matched test utterances. Write up when a pipeline architecturally wins anyway.

Key Terms

TermWhat people sayWhat it actually means
Full-duplexHear-and-speak at onceTwo audio streams active simultaneously on the same model.
Inner monologueModel's text streamMoshi emits text tokens alongside its audio output.
Depth transformerInter-codebook predictorSmall transformer that predicts 8 codebooks within one 80 ms frame.
MimiKyutai's codec12.5 Hz × 8 codebooks; semantic+acoustic; powers Moshi.
Streaming S2SAudio → audio liveChunk-by-chunk translation/dialogue, no pipeline stages.
Back-channeling"Mhm" reactionsMoshi can emit small acknowledgments without breaking its turn.

Further Reading