Skip to content

Audio Fundamentals — Waveforms, Sampling, Fourier Transform

Waveforms are the raw signal. Spectrograms are the representation. Mel features are the ML-friendly form. Every modern ASR and TTS pipeline walks this ladder, and the first rung is understanding sampling and Fourier.

Type: Learn Languages: Python Prerequisites: Phase 1 · 06 (Vectors & Matrices), Phase 1 · 14 (Probability Distributions) Time: ~45 minutes

The Problem

A microphone produces a pressure-vs-time signal. Your neural net consumes tensors. Between them sits a stack of conventions that, when violated, produce silent bugs: the model trains fine but the WER doubles, or TTS ships a hiss, or a voice cloning system memorizes the microphone instead of the speaker.

Every bug in speech systems traces back to one of three questions:

  1. What sample rate was the data recorded at, and what does the model expect?
  2. Is the signal aliased?
  3. Are you operating on raw samples or on a frequency representation?

Get these right and the rest of Phase 6 is tractable. Get them wrong and even Whisper-Large-v4 produces garbage.

The Concept

Waveform, sampling, DFT, and frequency bins visualized

Waveform. A one-dimensional array of floats in [-1.0, 1.0]. Indexed by sample number. To convert to seconds, divide by the sample rate: t = n / sr. A 10-second clip at 16 kHz is an array of 160,000 floats.

Sampling rate (sr). How many samples per second. Common rates in 2026:

RateUse
8 kHzTelephony, legacy VOIP. Nyquist at 4 kHz kills consonants. Avoid for ASR.
16 kHzASR standard. Whisper, Parakeet, SeamlessM4T v2 all consume 16 kHz.
22.05 kHzTTS vocoder training for older models.
24 kHzModern TTS (Kokoro, F5-TTS, xTTS v2).
44.1 kHzCD audio, music.
48 kHzFilm, pro audio, high-fidelity TTS (VALL-E 2, NaturalSpeech 3).

Nyquist-Shannon. A sample rate of sr can unambiguously represent frequencies up to sr/2. The sr/2 boundary is the Nyquist frequency. Energy above Nyquist gets aliased — folded down into lower frequencies — and corrupts the signal. Always low-pass filter before downsampling.

Bit depth. 16-bit PCM (signed int16, range ±32,767) is the universal exchange format. 24-bit for music, 32-bit float for internal DSP. Libraries like soundfile read int16 but expose float32 arrays in [-1, 1].

Fourier Transform. Any finite signal is a sum of sinusoids at different frequencies. The Discrete Fourier Transform (DFT) computes, for N samples, N complex coefficients — one per frequency bin. bin k maps to frequency k · sr / N Hz. Magnitude is amplitude at that frequency, angle is phase.

FFT. Fast Fourier Transform: an O(N log N) algorithm for the DFT when N is a power of 2. Every audio library uses FFT under the hood. A 1024-sample FFT at 16 kHz gives 512 usable frequency bins spanning 0–8 kHz at 15.6 Hz resolution.

Framing + window. We do not FFT an entire clip. We chop it into overlapping frames (typically 25 ms with 10 ms hop), multiply each frame by a window function (Hann, Hamming) to kill edge discontinuities, then FFT each frame. This is the Short-Time Fourier Transform (STFT). Lesson 02 picks up from here.

Build It

Step 1: read a clip and plot the waveform

code/main.py uses only the stdlib wave module to keep the demo dependency-free. For production you will use soundfile or torchaudio.load (both return (waveform, sr) tuples):

python
import soundfile as sf
waveform, sr = sf.read("clip.wav", dtype="float32")  # shape (T,), sr=int

Step 2: synthesize a sine wave from first principles

python
import math

def sine(freq_hz, sr, seconds, amp=0.5):
    n = int(sr * seconds)
    return [amp * math.sin(2 * math.pi * freq_hz * i / sr) for i in range(n)]

A 440 Hz sine (concert A) at 16 kHz for 1 second is 16,000 floats. Write with wave.open(..., "wb") using 16-bit PCM encoding.

Step 3: compute the DFT by hand

python
def dft(x):
    N = len(x)
    out = []
    for k in range(N):
        re = sum(x[n] * math.cos(-2 * math.pi * k * n / N) for n in range(N))
        im = sum(x[n] * math.sin(-2 * math.pi * k * n / N) for n in range(N))
        out.append((re, im))
    return out

O(N²) — fine for N=256 to confirm correctness, useless for real audio. Real code calls numpy.fft.rfft or torch.fft.rfft.

Step 4: find the dominant frequency

Magnitude peak index k_star maps to frequency k_star * sr / N. Running this on the 440 Hz sine should return a peak at bin 440 * N / sr.

Step 5: demonstrate aliasing

Sample a 7 kHz sine at 10 kHz (Nyquist = 5 kHz). The 7 kHz tone is above Nyquist and folds to 10 − 7 = 3 kHz. The FFT peak appears at 3 kHz. This is the classic aliasing demo and the reason every DAC/ADC ships with a brick-wall low-pass filter.

Use It

The stack you will actually ship in 2026:

TaskLibraryWhy
Read/write WAV/FLAC/OGGsoundfile (libsndfile wrapper)Fastest, stable, returns float32.
Resampletorchaudio.transforms.Resample or librosa.resampleCorrect anti-aliasing built in.
STFT / Meltorchaudio or librosaGPU-friendly; PyTorch ecosystem.
Real-time streamingsounddevice or pyaudioCross-platform PortAudio bindings.
Inspect a fileffprobe or soxiCLI, fast, reports sr/channels/codec.

Decision rule: match sample rate before you match anything else. Whisper expects 16 kHz mono float32. Pass it 44.1 kHz stereo and you will get garbage that looks like a model bug.

Ship It

Save as outputs/skill-audio-loader.md. The skill helps you check that audio input matches the expectations of the downstream model and resamples correctly when it does not.

Exercises

  1. Easy. Synthesize a 1-second mix of 220 Hz + 440 Hz + 880 Hz at 16 kHz. Run DFT. Confirm three peaks at the expected bins.
  2. Medium. Record a 3-second WAV of your voice at 48 kHz. Downsample to 16 kHz using torchaudio.transforms.Resample (with anti-aliasing), then to 16 kHz using naive decimation (every third sample). FFT both. Where does the aliasing appear?
  3. Hard. Build the STFT from scratch using only math and the DFT from Step 3. Frame size 400, hop 160, Hann window. Plot magnitudes with matplotlib.pyplot.imshow. This is the spectrogram of Lesson 02.

Key Terms

TermWhat people sayWhat it actually means
Sample rateHow many samples per secondFrequency in Hz at which the ADC measures the signal.
NyquistThe max frequency you can representsr/2; energy above it aliases back down.
Bit depthResolution of each sampleint16 = 65,536 levels; float32 = 24-bit precision in [-1, 1].
DFTThe Fourier transform for sequencesN samples → N complex frequency coefficients.
FFTThe fast DFTO(N log N) algorithm requiring N = power of 2.
BinFrequency columnk · sr / N Hz; resolution = sr / N.
STFTSpectrogram under the hoodFramed + windowed FFT over time.
AliasingWeird frequency ghostsEnergy above Nyquist mirroring down to lower bins.

Further Reading