Skip to content

Voice Cloning & Voice Conversion

Voice cloning reads your text in someone else's voice. Voice conversion rewrites your voice into someone else's while preserving what you said. Both hang on the same decomposition: separate speaker identity from content.

Type: Build Languages: Python Prerequisites: Phase 6 · 06 (Speaker Recognition), Phase 6 · 07 (TTS) Time: ~75 minutes

The Problem

In 2026, a 5-second audio clip is enough to produce a high-quality clone of anyone's voice with a consumer GPU. ElevenLabs, F5-TTS, OpenVoice v2, VoiceBox all ship zero-shot or few-shot cloning. The technology is a blessing (accessibility TTS, dubbing, assistive voices) and a weapon (scam calls, political deepfakes, IP theft).

Two closely-related tasks:

  • Voice cloning (TTS-side): text + 5-second reference voice → audio in that voice.
  • Voice conversion (speech-side): source audio (person A saying X) + reference voice of person B → audio of B saying X.

Both factor a waveform into (content, speaker, prosody) and recombine content from one source with speaker from another.

Key constraint you now ship under in 2026: watermarking and consent gates are legally required in the EU (AI Act, enforceable August 2026) and in California (AB 2905, effective 2025). Your pipeline must emit an inaudible watermark and refuse non-consensual clones.

The Concept

Voice cloning vs conversion: factorize, swap speaker, recombine

Zero-shot cloning. Pass a 5-second clip to a model that has been trained on thousands of speakers. The speaker encoder maps the clip to a speaker embedding; the TTS decoder conditions on that embedding plus text.

Used by: F5-TTS (2024), YourTTS (2022), XTTS v2 (2024), OpenVoice v2 (2024).

Few-shot fine-tuning. Record 5-30 minutes of the target voice. LoRA-fine-tune a base model for an hour. Quality leaps from "okay" to "indistinguishable". Coqui and ElevenLabs both support this pattern; community uses it with F5-TTS.

Voice conversion (VC). Two families:

  • Recognition-synthesis. Run ASR-like model to extract content representation (e.g., soft phoneme posteriors, PPGs), then resynthesize with target speaker embedding. Robust to language and accent. Used by KNN-VC (2023), Diff-HierVC (2023).
  • Disentanglement. Train an autoencoder that separates content, speaker, and prosody in latent space at the bottleneck. Swap speaker embedding at inference. Lower quality but faster. Used by AutoVC (2019), VITS-VC variants.

Neural codec-based cloning (2024+). VALL-E, VALL-E 2, NaturalSpeech 3, VoiceBox — treat audio as discrete tokens from SoundStream / EnCodec, train a large autoregressive or flow-matching model over codec tokens. Quality comparable to ElevenLabs on short prompts.

The ethics bit, not a bolt-on

Watermarking. PerTh (Perth) and SilentCipher (2024) embed a ~16-32 bit ID imperceptibly in the audio. Survives re-encoding, streaming, and common edits. Production-ready open source.

Consent gates. Must pair every cloned output with a verifiable consent record. "I, Rohit, on 2026-04-22, authorize this voice for X purpose." Store in a tamper-evident log.

Detection. AASIST, RawNet2, and Wav2Vec2-AASIST ship as detectors. ASVspoof 2025 challenge published EERs of 0.8–2.3% for state-of-the-art detectors against ElevenLabs, VALL-E 2, and Bark outputs.

Numbers (2026)

ModelZero-shot?SECS (target sim)WER (intel.)Params
F5-TTSYes0.722.1%335M
XTTS v2Yes0.653.5%470M
OpenVoice v2Yes0.702.8%220M
VALL-E 2Yes0.772.4%370M
VoiceBoxYes0.782.1%330M

SECS > 0.70 is generally indistinguishable from the target for most listeners.

Build It

Step 1: decompose with recognition-synthesis (code-only demo in main.py)

python
def clone_pipeline(ref_audio, text, target_embedder, tts_model):
    speaker_emb = target_embedder.encode(ref_audio)
    mel = tts_model(text, speaker=speaker_emb)
    return vocoder(mel)

Conceptually simple; implementation mass is in tts_model and speaker encoder.

Step 2: zero-shot clone with F5-TTS

python
from f5_tts.api import F5TTS
tts = F5TTS()
wav = tts.infer(
    ref_file="rohit_5s.wav",
    ref_text="The quick brown fox jumps over the lazy dog.",
    gen_text="Please add milk and bread to my list.",
)

Reference transcript must exactly match the audio; mismatch breaks alignment.

Step 3: voice conversion with KNN-VC

python
import torch
from knnvc import KNNVC  # 2023 model, https://github.com/bshall/knn-vc
vc = KNNVC.load("wavlm-base-plus")
out_wav = vc.convert(source="my_voice.wav", target_pool=["alice_1.wav", "alice_2.wav"])

KNN-VC runs WavLM to extract per-frame embeddings for source and target pool, then replaces each source frame with its nearest neighbor in the pool. Non-parametric, works with a minute of target speech.

Step 4: embed a watermark

python
from silentcipher import SilentCipher
sc = SilentCipher(model="2024-06-01")
payload = b"consent_id:abc123;ts:1745353200"
watermarked = sc.embed(wav, sr=24000, message=payload)
detected = sc.detect(watermarked, sr=24000)   # returns payload bytes

~32 bits of payload, detectable after MP3 re-encode and light noise.

python
def cloned_inference(text, ref_audio, consent_record):
    assert verify_signature(consent_record), "Signed consent required"
    assert consent_record["speaker_id"] == hash_speaker(ref_audio)
    wav = tts.infer(ref_file=ref_audio, gen_text=text)
    wav = watermark(wav, payload=consent_record["id"])
    return wav

Use It

The 2026 stack:

SituationPick
5-sec zero-shot clone, open-sourceF5-TTS or OpenVoice v2
Commercial production cloningElevenLabs Instant Voice Clone v2.5
Voice conversion (rewriting)KNN-VC or Diff-HierVC
Many-speaker fine-tuneStyleTTS 2 + speaker adapter
Cross-lingual cloningXTTS v2 or VALL-E X
Deepfake detectionWav2Vec2-AASIST

Pitfalls

  • Misaligned reference transcript. F5-TTS and similar require the reference text to match the reference audio exactly, punctuation included.
  • Reverberant reference. Echo kills the clone. Record dry, close-mic.
  • Emotional mismatch. Training reference "cheerful" produces cheerful clones of everything. Match reference emotion to target use.
  • Language leakage. Cloning an English speaker then asking the model to speak French often carries the accent anyway; use cross-lingual models (XTTS, VALL-E X).
  • No watermark. Legally unshippable in EU from Aug 2026.

Ship It

Save as outputs/skill-voice-cloner.md. Design a cloning or conversion pipeline with consent gate + watermark + quality target.

Exercises

  1. Easy. Run code/main.py. Demonstrates the speaker-embedding swap by computing the cosine between two "speakers" pre and post swap.
  2. Medium. Use OpenVoice v2 to clone your own voice. Measure SECS between reference and clone. Measure CER via Whisper.
  3. Hard. Apply SilentCipher watermark to 20 clones, run them through 128 kbps MP3 encode+decode, detect the payload. Report bit-accuracy.

Key Terms

TermWhat people sayWhat it actually means
Zero-shot clone5 seconds is enoughPretrained model + speaker embedding; no training.
PPGPhonetic posteriorgramPer-frame ASR posteriors used as language-agnostic content rep.
KNN-VCNearest-neighbor conversionReplace each source frame with nearest target-pool frame.
Neural codec TTSVALL-E styleAR model over EnCodec/SoundStream tokens.
WatermarkInaudible signatureBits embedded in audio, survive re-encode.
SECSCloning fidelityCosine between target and clone speaker embeddings.
AASISTDeepfake detectorAnti-spoof model; detects synthesized speech.

Further Reading