The ElevenLabs newsletter voice workflow I had to rebuild twice

The pitch is irresistible. You clone your own voice once, paste a newsletter into an API call, and get a listenable podcast episode out the other end. Readers who prefer audio get a daily commute listen, you get better engagement metrics, and the whole pipeline is about forty lines of Python.

I built this. It took three weeks, I rebuilt it twice, and the final version looks almost nothing like what I drew on the whiteboard. Here is what went wrong, in order, with what finally fixed each one. None of this is a criticism of ElevenLabs. It is a great product. The failures were mine, but the tutorials do not mention any of them.

The glossy version

Step one: record 30 minutes of clean audio of yourself reading various texts. Upload to ElevenLabs as training data for Instant Voice Cloning. Step two: confirm the clone sounds like you in a 20-second test clip. Step three: write a Python script that takes a newsletter, chunks it into sections, calls client.text_to_speech.convert on each chunk with your voice ID, and concatenates the output into an MP3. Step four: ship it.

I did all four steps in one evening. The result sounded like me. I was convinced the project was done. It was not done.

Break 1: my voice sounded unhinged after the third paragraph

When you feed ElevenLabs a single long chunk of text, the model hits its internal consistency budget and starts drifting. By paragraph three, the pitch was climbing, by paragraph five I sounded vaguely out of breath, and by paragraph seven there was a rising panic in the vowels that I have genuinely never produced in real life. It was subtle enough that I did not catch it in the 20-second test clip but obvious on anything longer than a minute.

The fix was not to use a longer max_length or a better voice. The fix was to chunk the input at paragraph boundaries, call the API once per paragraph, and stitch the clips together in post. Every paragraph starts fresh from the model's perspective, so the drift resets each time. The new loop looked like this:

from elevenlabs.client import ElevenLabs
from pathlib import Path

client = ElevenLabs(api_key="YOUR_KEY")
voice_id = "YOUR_VOICE_ID"

def newsletter_to_audio(text: str, output_dir: Path) -> list[Path]:
    paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
    clips = []

    for i, paragraph in enumerate(paragraphs):
        audio = client.text_to_speech.convert(
            voice_id=voice_id,
            model_id="eleven_multilingual_v2",
            text=paragraph,
            voice_settings={
                "stability": 0.6,
                "similarity_boost": 0.75,
                "style": 0.0,
                "use_speaker_boost": True,
            },
        )

        clip_path = output_dir / f"chunk_{i:03d}.mp3"
        with open(clip_path, "wb") as f:
            for part in audio:
                f.write(part)
        clips.append(clip_path)

    return clips

The key setting was stability: 0.6. The default is lower and produces more expressive but less consistent output. For long-form narration of the same person, higher stability is what you want. I also turned style down to 0 because any style transfer at all made the clone sound like it was doing impressions.

Break 2: pauses and pacing were wrong because punctuation is the only instrument I had

ElevenLabs interprets punctuation as pacing signals. A full stop is a slightly longer pause than a comma, a question mark bumps inflection up, an em dash creates a sentence break. But newsletters, especially ones that go through AI-assisted copy editing, tend to have weird punctuation. Sentences that should be two were one. Lists that should have pauses ran together. The output was technically accurate but sounded like someone reading a legal document at a dinner party.

The fix was a pre-processing pass that normalised punctuation specifically for text-to-speech. Before handing a paragraph to ElevenLabs, the script inserts full stops where there are semicolons, breaks sentences over 30 words, and adds explicit short pauses with SSML-style <break time="0.5s" /> tags, which ElevenLabs respects in the V2 multilingual model.

import re

def prepare_for_tts(text: str) -> str:
    # Turn semicolons into full stops for cleaner pacing
    text = text.replace(";", ".")
    # Add a small break after colons
    text = text.replace(":", ': <break time="0.3s" />')
    # Split run-on sentences at conjunctions
    text = re.sub(r", (and|but|so) ", r". \1 ", text)
    # Add a slightly longer pause between paragraphs inside a single chunk
    text = text.replace("\n\n", ' <break time="0.8s" />\n\n')
    return text

This was the point where the output started sounding like a real podcast rather than a Frankenstein speech.

Break 3: stitched clips had audible seams

The chunking-per-paragraph approach produced clips that were individually clean but sounded strange when concatenated. Each clip had a tiny silence at the start and end, and the energy level between clips did not always match. Headphone listeners noticed immediately. The fix was Descript.

I exported the paragraph clips, dragged them into a single Descript project in sequence, used its Studio Sound feature to normalise levels across all of them, and let its auto-fade handle the seam transitions. Descript also has a "remove pauses" feature that took out the extra dead air at the start and end of each clip without needing to manually trim anything. The final export was one continuous audio file that actually sounded like a single recording session.

The full pipeline that works:

Newsletter draft in markdown
Python script: chunk at paragraph boundaries, apply prepare_for_tts(), call ElevenLabs with stability: 0.6
Drop the paragraph clips into a Descript project
Apply Studio Sound, remove silences, add intro and outro music
Export as MP3

Total compute time: about 90 seconds for ElevenLabs, 30 seconds for Descript processing. Total human time: about 5 minutes for the Descript cleanup. Cost per episode: roughly 60 pence for a 1,500-word newsletter on the Creator plan.

The workflow works. It is not good enough to fool a careful listener into thinking it is a real recording, because a trained ear picks up the subtle evenness of pace that AI-cloned speech has even after all the fixes above. For most newsletter audiences this is fine. People are not listening critically for authenticity, they are listening while commuting or washing up. If your product depends on listeners believing a recording is a live human in a booth, clone-plus-stitch is not the technology you want. Either record it yourself or wait for the emotion-conditioned models that are not yet available in the ElevenLabs consumer tiers.

The ElevenLabs newsletter voice workflow I had to rebuild twice

The glossy version

Break 1: my voice sounded unhinged after the third paragraph

Break 2: pauses and pacing were wrong because punctuation is the only instrument I had

Break 3: stitched clips had audible seams

More Recipes

Automated Podcast Production Workflow

Build an Automated YouTube Channel with AI

Medical device regulatory documentation from technical specifications