The ElevenLabs newsletter voice workflow I had to rebuild twice
Cloning a voice and turning weekly newsletters into podcast episodes is a five-minute setup that took me three weeks to get right. Three specific things that broke, and what finally fixed each one.
- Monthly cost
- ~£18 / $23 on Creator plan/mo
- Published
The pitch is irresistible. You clone your own voice once, paste a newsletter into an API call, and get a listenable podcast episode out the other end. Readers who prefer audio get a daily commute listen, you get better engagement metrics, and the whole pipeline is about forty lines of Python.
I built this. It took three weeks, I rebuilt it twice, and the final version looks almost nothing like what I drew on the whiteboard. Here is what went wrong, in order, with what finally fixed each one. None of this is a criticism of ElevenLabs. It is a great product. The failures were mine, but the tutorials do not mention any of them.
The glossy version
Step one: record 30 minutes of clean audio of yourself reading various texts. Upload to ElevenLabs as training data for Instant Voice Cloning. Step two: confirm the clone sounds like you in a 20-second test clip. Step three: write a Python script that takes a newsletter, chunks it into sections, calls client.text_to_speech.convert on each chunk with your voice ID, and concatenates the output into an MP3. Step four: ship it.
I did all four steps in one evening. The result sounded like me. I was convinced the project was done. It was not done.
Break 1: my voice sounded unhinged after the third paragraph
When you feed ElevenLabs a single long chunk of text, the model hits its internal consistency budget and starts drifting. By paragraph three, the pitch was climbing, by paragraph five I sounded vaguely out of breath, and by paragraph seven there was a rising panic in the vowels that I have genuinely never produced in real life. It was subtle enough that I did not catch it in the 20-second test clip but obvious on anything longer than a minute.
The fix was not to use a longer max_length or a better voice. The fix was to chunk the input at paragraph boundaries, call the API once per paragraph, and stitch the clips together in post. Every paragraph starts fresh from the model's perspective, so the drift resets each time. The new loop looked like this:
from elevenlabs.client import ElevenLabs
from pathlib import Path
client = ElevenLabs(api_key="YOUR_KEY")
voice_id = "YOUR_VOICE_ID"
def newsletter_to_audio(text: str, output_dir: Path) -> list[Path]:
paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
clips = []
for i, paragraph in enumerate(paragraphs):
audio = client.text_to_speech.convert(
voice_id=voice_id,
model_id="eleven_multilingual_v2",
text=paragraph,
voice_settings={
"stability": 0.6,
"similarity_boost": 0.75,
"style": 0.0,
"use_speaker_boost": True,
},
)
clip_path = output_dir / f"chunk_{i:03d}.mp3"
with open(clip_path, "wb") as f:
for part in audio:
f.write(part)
clips.append(clip_path)
return clips
The key setting was stability: 0.6. The default is lower and produces more expressive but less consistent output. For long-form narration of the same person, higher stability is what you want. I also turned style down to 0 because any style transfer at all made the clone sound like it was doing impressions.
Break 2: pauses and pacing were wrong because punctuation is the only instrument I had
ElevenLabs interprets punctuation as pacing signals. A full stop is a slightly longer pause than a comma, a question mark bumps inflection up, an em dash creates a sentence break. But newsletters, especially ones that go through AI-assisted copy editing, tend to have weird punctuation. Sentences that should be two were one. Lists that should have pauses ran together. The output was technically accurate but sounded like someone reading a legal document at a dinner party.
The fix was a pre-processing pass that normalised punctuation specifically for text-to-speech. Before handing a paragraph to ElevenLabs, the script inserts full stops where there are semicolons, breaks sentences over 30 words, and adds explicit short pauses with SSML-style <break time="0.5s" /> tags, which ElevenLabs respects in the V2 multilingual model.
import re
def prepare_for_tts(text: str) -> str:
# Turn semicolons into full stops for cleaner pacing
text = text.replace(";", ".")
# Add a small break after colons
text = text.replace(":", ': <break time="0.3s" />')
# Split run-on sentences at conjunctions
text = re.sub(r", (and|but|so) ", r". \1 ", text)
# Add a slightly longer pause between paragraphs inside a single chunk
text = text.replace("\n\n", ' <break time="0.8s" />\n\n')
return text
This was the point where the output started sounding like a real podcast rather than a Frankenstein speech.
Break 3: stitched clips had audible seams
The chunking-per-paragraph approach produced clips that were individually clean but sounded strange when concatenated. Each clip had a tiny silence at the start and end, and the energy level between clips did not always match. Headphone listeners noticed immediately. The fix was Descript.
I exported the paragraph clips, dragged them into a single Descript project in sequence, used its Studio Sound feature to normalise levels across all of them, and let its auto-fade handle the seam transitions. Descript also has a "remove pauses" feature that took out the extra dead air at the start and end of each clip without needing to manually trim anything. The final export was one continuous audio file that actually sounded like a single recording session.
The full pipeline that works:
- Newsletter draft in markdown
- Python script: chunk at paragraph boundaries, apply
prepare_for_tts(), call ElevenLabs withstability: 0.6 - Drop the paragraph clips into a Descript project
- Apply Studio Sound, remove silences, add intro and outro music
- Export as MP3
Total compute time: about 90 seconds for ElevenLabs, 30 seconds for Descript processing. Total human time: about 5 minutes for the Descript cleanup. Cost per episode: roughly 60 pence for a 1,500-word newsletter on the Creator plan.
The workflow works. It is not good enough to fool a careful listener into thinking it is a real recording, because a trained ear picks up the subtle evenness of pace that AI-cloned speech has even after all the fixes above. For most newsletter audiences this is fine. People are not listening critically for authenticity, they are listening while commuting or washing up. If your product depends on listeners believing a recording is a live human in a booth, clone-plus-stitch is not the technology you want. Either record it yourself or wait for the emotion-conditioned models that are not yet available in the ElevenLabs consumer tiers.
More Recipes
Automated Podcast Production Workflow
Automated Podcast Production Workflow: From Raw Audio to Published Episode
Build an Automated YouTube Channel with AI
Build an Automated YouTube Channel with AI
Medical device regulatory documentation from technical specifications
Medtech companies spend significant resources translating technical specs into regulatory-compliant documentation.