Skip to content

Autoencoders & Variational Autoencoders (VAE)

A plain autoencoder compresses then reconstructs. It memorizes. It does not generate. Add one trick — force the code to look Gaussian — and you get a sampler. That single trick, the reparameterization of z = μ + σ·ε, is why every latent-diffusion and flow-matching image model you use in 2026 has a VAE at the input.

Type: Build Languages: Python Prerequisites: Phase 3 · 02 (Backprop), Phase 3 · 07 (CNNs), Phase 8 · 01 (Taxonomy) Time: ~75 minutes

The Problem

Compress a 784-pixel MNIST digit to a 16-number code, then reconstruct. A plain autoencoder will ace reconstruction MSE but the code space is a lumpy mess. Pick a random point in the code space, decode it, and you get noise. It has no sampler. It is a compression model dressed up.

What you actually want is: (a) the code space is a clean, smooth distribution you can sample from — say an isotropic Gaussian N(0, I), (b) decoding any sample produces a plausible digit, and (c) the encoder and decoder still compress well. Three goals, one architecture, one loss.

Kingma's 2013 VAE solves this by training the encoder to output a distribution q(z|x) = N(μ(x), σ(x)²), pulling that distribution toward the prior N(0, I) via a KL penalty, and then sampling z from q(z|x) before decoding. At inference time, drop the encoder, sample z ~ N(0, I), decode. The KL penalty is what forces the code space to be structured.

In 2026 VAEs rarely ship standalone — they have been outclassed by diffusion for raw image quality — but they are the encoder of choice for every latent-diffusion model (SD 1/2/XL/3, Flux, AudioCraft). Learn the VAE and you learn the invisible first layer of every image pipeline you use.

The Concept

Autoencoder vs VAE: the reparameterization trick

Autoencoder. z = encoder(x), x̂ = decoder(z), loss = ||x - x̂||². Code space unstructured.

VAE encoder. Outputs two vectors: μ(x) and log σ²(x). These define q(z|x) = N(μ, diag(σ²)).

Reparameterization trick. Sampling from q(z|x) is not differentiable. Rewrite the sample as z = μ + σ·ε where ε ~ N(0, I). Now z is a deterministic function of (μ, σ) plus a non-parameter noise — gradients flow through μ and σ.

Loss. Evidence Lower BOund (ELBO), two terms:

loss = reconstruction + β · KL[q(z|x) || N(0, I)]
     = ||x - x̂||²  + β · Σ_i ( σ_i² + μ_i² - log σ_i² - 1 ) / 2

Reconstruction pushes toward x. KL pushes q(z|x) toward the prior. They trade off. Small β (<1) = sharper samples, code space less Gaussian. Large β (>1) = cleaner code space, blurrier samples. β-VAE (Higgins 2017) made this knob famous and kicked off disentanglement research.

Sampling. At inference: draw z ~ N(0, I), forward through decoder. One forward pass — no iterative sampling like diffusion.

Build It

code/main.py implements a tiny VAE without numpy or torch. Input is 8-dimensional synthetic data drawn from a 2-component Gaussian mixture in 8-D. Encoder and decoder are single hidden-layer MLPs. We implement tanh activation, forward pass, loss, and a hand-written backward pass. Not production — pedagogy.

Step 1: encoder forward

python
def encode(x, enc):
    h = tanh(add(matmul(enc["W1"], x), enc["b1"]))
    mu = add(matmul(enc["W_mu"], h), enc["b_mu"])
    log_sigma2 = add(matmul(enc["W_sig"], h), enc["b_sig"])
    return mu, log_sigma2

log σ² instead of σ so the network output is unconstrained (softplus of σ is a trap — gradients die at σ ≈ 0).

Step 2: reparameterize and decode

python
def reparameterize(mu, log_sigma2, rng):
    eps = [rng.gauss(0, 1) for _ in mu]
    sigma = [math.exp(0.5 * lv) for lv in log_sigma2]
    return [m + s * e for m, s, e in zip(mu, sigma, eps)]

def decode(z, dec):
    h = tanh(add(matmul(dec["W1"], z), dec["b1"]))
    return add(matmul(dec["W_out"], h), dec["b_out"])

Step 3: the ELBO

python
def elbo(x, x_hat, mu, log_sigma2, beta=1.0):
    recon = sum((a - b) ** 2 for a, b in zip(x, x_hat))
    kl = 0.5 * sum(math.exp(lv) + m * m - lv - 1 for m, lv in zip(mu, log_sigma2))
    return recon + beta * kl, recon, kl

Exact closed-form KL because both distributions are Gaussian. Do not integrate numerically. People still ship code with monte-carlo KL estimates in 2026 — it is 3x slower for no reason.

Step 4: generate

python
def sample(dec, z_dim, rng):
    z = [rng.gauss(0, 1) for _ in range(z_dim)]
    return decode(z, dec)

That is the generative model. Five lines.

Pitfalls

  • Posterior collapse. KL term drives q(z|x) → N(0, I) so aggressively that z carries no info about x. Fix: β-annealing (start β=0, ramp to 1), free bits, or skip the KL on inactive dimensions.
  • Blurry samples. The Gaussian decoder likelihood implies MSE reconstruction, which is Bayes-optimal for L2 (the mean) — the mean of a set of plausible digits is a fuzzy digit. Fix: discrete decoder (VQ-VAE, NVAE), or use the VAE only as an encoder and stack diffusion on the latents (this is what Stable Diffusion does).
  • β too large, too early. See posterior collapse. Start at β≈0.01 and ramp.
  • Latent dim too small. 16-D works for MNIST, 256-D for ImageNet 256², 2048-D for ImageNet 1024². Stable Diffusion's VAE compresses 512×512×3 → 64×64×4 (32x downsample factor in spatial area, 32x in channels).

Use It

The 2026 VAE stack:

SituationPick
Image-latent encoder for diffusionStable Diffusion VAE (sd-vae-ft-ema) or Flux VAE
Audio-latent encoderEncodec (Meta), SoundStream, or DAC (Descript)
Video latentsSora's spatiotemporal patches, Latte VAE, WAN VAE
Disentangled representation learningβ-VAE, FactorVAE, TCVAE
Discrete latents (for transformer modelling)VQ-VAE, RVQ (ResidualVQ)
Continuous latents for generationPlain VAE, then condition a flow/diffusion model in that latent space

A latent-diffusion model is a VAE with a diffusion model living between encoder and decoder. The VAE does coarse compression, the diffusion model does the heavy lifting. Same pattern for video (VAE + video-diffusion DiT) and audio (Encodec + MusicGen transformer).

Ship It

Save outputs/skill-vae-trainer.md.

Skill takes: dataset profile + latent-dim target + downstream use (reconstruction, sampling, or latent-diffusion input) and outputs: architecture choice (plain/β/VQ/RVQ), β schedule, latent dim, decoder likelihood (Gaussian vs categorical), and evaluation plan (recon MSE, KL per dim, Fréchet distance between q(z|x) and N(0, I)).

Exercises

  1. Easy. Change β in code/main.py to 0.01, 0.1, 1.0, 5.0. Record the final reconstruction MSE and KL. Which β is Pareto-best for your synthetic data?
  2. Medium. Replace the Gaussian decoder likelihood with a Bernoulli likelihood (cross-entropy loss). Compare sample quality on a binarized version of the same synthetic data.
  3. Hard. Extend code/main.py into a mini VQ-VAE: replace the continuous z with a nearest-neighbour lookup in a codebook of K=32 entries. Compare reconstruction MSE and report how many codebook entries get used (codebook collapse is real).

Key Terms

TermWhat people sayWhat it actually means
AutoencoderEncode-decode networkx → z → x̂, learn MSE. Not generative.
VAEAE with a samplerEncoder outputs a distribution, KL penalty shapes code space.
ELBOEvidence lower boundlog p(x) ≥ recon - KL[q(z|x) || p(z)]; tight when q = p(z|x).
Reparameterizationz = μ + σ·εRewrites stochastic node as deterministic + pure noise. Enables backprop through sampling.
Priorp(z)Target distribution for the latent, typically N(0, I).
Posterior collapse"KL term wins"Encoder ignores x, outputs the prior; decoder must hallucinate.
β-VAETunable KL weightloss = recon + β·KL. Higher β = more disentangled but blurrier.
VQ-VAEDiscrete latentReplace continuous z with nearest codebook vector; enables transformer modelling.

Production note: the VAE is the hottest path in a diffusion server

In a Stable Diffusion / Flux / SD3 pipeline the VAE is called twice per request — once to encode (if doing img2img / inpainting) and once to decode. At 1024² the decoder pass is often the single largest activation-memory peak in the whole pipeline because it upsamples 128×128×16 latents back to 1024×1024×3. Two practical consequences:

  • Slice or tile the decode. diffusers exposes pipe.vae.enable_slicing() and pipe.vae.enable_tiling(). Tiling trades a small seam artifact for O(tile²) memory instead of O(H·W). Essential for 1024²+ on consumer GPUs.
  • bf16 decoder, fp32 numerics for the final resize. The SD 1.x VAE was released in fp32 and silently produces NaNs when cast to fp16 at 1024²+. SDXL ships madebyollin/sdxl-vae-fp16-fix — always prefer the fp16-fix variant or use bf16.

Further Reading