Skip to content

Audio Generation

Audio is a 1-D signal at 16-48 kHz. A five-second clip is 80-240k samples. No transformer attends to that sequence directly. The solution for every production audio model in 2026 is the same: a neural codec (Encodec, SoundStream, DAC) compresses audio to discrete tokens at 50-75 Hz, and a transformer or diffusion model generates tokens.

Type: Build Languages: Python Prerequisites: Phase 6 · 02 (Audio Features), Phase 6 · 04 (ASR), Phase 8 · 06 (DDPM) Time: ~45 minutes

The Problem

Three audio generation tasks:

  1. Text-to-speech. Given text, produce speech. Clean speech is narrow-band and has strong phonetic structure — solved well by transformer-over-tokens. VALL-E (Microsoft), NaturalSpeech 3, ElevenLabs, OpenAI TTS.
  2. Music generation. Given a prompt (text, melody, chord progression, genre), produce music. Much broader distribution. MusicGen (Meta), Stable Audio 2.5, Suno v4, Udio, Riffusion.
  3. Audio effects / sound design. Given a prompt, produce ambient sound or Foley. AudioGen, AudioLDM 2, Stable Audio Open.

All three run on the same substrate: neural audio codec + token-AR or diffusion generator.

The Concept

Audio generation: codec tokens + transformer or diffusion

Neural audio codecs

Encodec (Meta, 2022), SoundStream (Google, 2021), Descript Audio Codec (DAC, 2023). A convolutional encoder compresses waveform to a per-timestep vector; residual vector quantization (RVQ) converts each vector to a cascade of K codebook indices. Decoder reverses it. 24 kHz audio at 2 kbps using 8 RVQ codebooks at 75 Hz = 600 tokens/sec.

waveform (16000 samples/sec)
    └─ encoder conv ─┐
                     ├─ RVQ layer 1 → indices at 75 Hz
                     ├─ RVQ layer 2 → indices at 75 Hz
                     ├─ ...
                     └─ RVQ layer 8

Two generative paradigms on top

Token-autoregressive. Flatten RVQ tokens into a sequence, run a decoder-only transformer. MusicGen uses "delayed parallel" to emit K codebook streams in parallel with per-stream offsets. VALL-E generates speech tokens from a text prompt + 3-second voice sample.

Latent diffusion. Pack codec tokens as continuous latents or model them with categorical diffusion. Stable Audio 2.5 uses flow matching on continuous audio latents. AudioLDM 2 uses text-to-mel-to-audio diffusion.

The 2024-2026 trend: flow matching is winning for music (faster inference, cleaner samples) while token-AR still dominates speech because it is naturally causal and streams well.

Production landscape

SystemTaskBackboneLatency
ElevenLabs V3TTSToken-AR + neural vocoder~300ms first token
OpenAI GPT-4o audioFull-duplex speechEnd-to-end multimodal AR~200ms
NaturalSpeech 3TTSLatent flow matchingNon-streaming
Stable Audio 2.5Music / SFXDiT + flow matching on audio latents~10s for 1-minute clip
Suno v4Full songsUndisclosed; token-AR suspected~30s per song
Udio v1.5Full songsUndisclosed~30s per song
MusicGen 3.3BMusicToken-AR on Encodec 32kHzReal-time
AudioCraft 2Music + SFXFlow matching~5s for 5s clip
Riffusion v2MusicSpectrogram diffusion~10s

Build It

code/main.py simulates the core idea: train a tiny next-token transformer on synthetic "audio token" sequences generated from two distinct "styles" (alternating low and high tokens for style A, monotonic ramp for style B). Condition on style and sample.

Step 1: synthetic audio tokens

python
def make_tokens(style, length, vocab_size, rng):
    if style == 0:  # "speech-like": alternating
        return [i % vocab_size for i in range(length)]
    # "music-like": ramp
    return [(i * 3) % vocab_size for i in range(length)]

Step 2: train a tiny token predictor

A bigram-style predictor conditioned on style. The point is the pattern: codec tokens → cross-entropy training → autoregressive sampling.

Step 3: sample conditionally

Given the style token and a starting token, sample the next token from the predicted distribution. Continue for 20-40 tokens.

Pitfalls

  • Codec quality caps output quality. If the codec can't represent a sound faithfully, no amount of generator quality helps. DAC is the current open best.
  • RVQ error accumulation. Each RVQ layer models the residual of the previous. Errors on layer 1 propagate. Sampling with temperature 0 on higher layers helps.
  • Musical structure. 30 seconds of tokens is 20k+ tokens at 75 Hz. Hard for transformers. MusicGen uses sliding window + prompt continuation; Stable Audio uses shorter clips + crossfading.
  • Artifacts at boundaries. Crossfading between generated clips needs careful overlap-add.
  • Clean-data appetite. Music generators need tens of thousands of hours of licensed music. The Suno / Udio RIAA lawsuit (2024) brought this to the surface.
  • Voice cloning ethics. A 3-second sample plus a text prompt is enough for VALL-E / XTTS / ElevenLabs to clone a voice. Every production model needs abuse detection + opt-out lists.

Use It

Task2026 stack
Commercial TTSElevenLabs, OpenAI TTS, or Azure Neural
Voice cloning (consent-verified)XTTS v2 (open) or ElevenLabs Pro
Background music, fastStable Audio 2.5 API, Suno, or Udio
Music with lyricsSuno v4 or Udio v1.5
Sound effects / FoleyAudioCraft 2, ElevenLabs SFX, or Stable Audio Open
Real-time voice agentGPT-4o realtime or Gemini Live
Open-weights music researchMusicGen 3.3B, Stable Audio Open 1.0, AudioLDM 2
Dubbing / translationHeyGen, ElevenLabs Dubbing

Ship It

Save outputs/skill-audio-brief.md. Skill takes an audio brief (task, duration, style, voice, license) and outputs: model + hosting, prompt format (genre tags, style descriptors, structural markers), codec + generator + vocoder chain, seed protocol, and eval plan (MOS / CLAP score / CER for TTS / user A/B).

Exercises

  1. Easy. Run code/main.py and set style explicitly. Verify the generated sequences match the style's pattern.
  2. Medium. Add delayed parallel decoding: simulate 2 streams of tokens that must stay offset by 1 step. Train a joint predictor.
  3. Hard. Use HuggingFace transformers to run MusicGen-small locally. Generate a 10-second clip with three different prompts; A/B for style adherence.

Key Terms

TermWhat people sayWhat it actually means
Codec"Neural compression"Encoder / decoder for audio; typical output is 50-75 Hz tokens.
RVQ"Residual VQ"Cascade of K quantizers; each models the residual of the previous.
Token"One codec symbol"Discrete index into a codebook; 1024 or 2048 typical.
Delayed parallel"Offset codebooks"Emit K token streams with staggered offsets to reduce sequence length.
Flow matching"The 2024 win for audio"Straighter-path alternative to diffusion; faster sampling.
Voice prompt"3-second sample"Speaker embedding or token prefix that steers the cloned voice.
Mel spectrogram"The visual"Log-magnitude perceptual spectrogram; used by many TTS systems.
Vocoder"Mel to wave"Neural component that converts mel spectrograms back to audio.

Production note: audio is a streaming problem

Audio is the one output modality users expect to arrive as it is generated, not all-at-once. In production terms this means TPOT matters (Time Per Output Token) because the user's listening speed is the target throughput — not their reading speed. For 16kHz audio tokenized at ~75 tokens/second (Encodec), the server must generate ≥75 tokens/sec per user to keep playback smooth.

Two architectural consequences:

  • Flow-matching audio models cannot stream trivially. Stable Audio 2.5 and AudioCraft 2 render a fixed clip length in one pass. To stream, you chunk the clip and overlap boundaries — think sliding-window diffusion — adding 100-300ms of latency overhead vs a codec AR model.

If the product is "live voice chat" or "real-time music continuation", pick the codec AR path. If it is "render a 30-second clip on submit", flow-matching wins on quality and total latency.

Further Reading