Whisper API vs Deepgram for podcast transcription: when the cheaper option is actually wrong
The benchmark tables miss two things that matter for podcast work: speaker diarisation and the 25 MB upload cap. Here is what each actually looks like in code, and when to pick which.
- Time saved
- Saves 1-2 hrs per episode
- Monthly cost
- ~£8-15 / $10-19/mo
- Published
A 90-minute podcast with two hosts and a guest. You want a clean transcript, speaker labels, timestamped quotes for the show notes, and a few shareable clips. The first call you make is picking a transcription provider, and this is where most guides give you a benchmark table and tell you Whisper wins on accuracy at $0.006 per minute versus Deepgram's $0.0043. If you stop reading at the price and the WER chart, you will spend the next afternoon stitching together a workaround.
Here is the thing the benchmark tables don't tell you: the Whisper API does not do speaker diarisation. There is no parameter for it. The response comes back as one continuous transcript with no indication of who said what. For a three-person interview, that means post-processing the output through a separate diarisation model like pyannote and aligning the speaker spans against Whisper's segment boundaries, which don't line up with speaker change points. That's two APIs, two sets of credentials, and an audio alignment step that will eat an hour of your Saturday.
Deepgram handles diarisation in the same request. You add diarize=true to the call and the response contains speaker_0, speaker_1, speaker_2 labels on every word. For podcast work, this is not a minor convenience. It decides whether the whole workflow is one API call or four.
The second gotcha is file size. Whisper API caps uploads at 25 MB. A 90-minute podcast exported at 128 kbps MP3 is around 84 MB. You can compress it down to 64 kbps to fit, which loses noticeable clarity on any speech with background noise, or you chunk the file into three pieces and stitch the transcripts back together afterwards. Deepgram accepts files up to 2 GB on standard endpoints and will also take a URL if your audio lives on S3, R2, or a CDN.
What each actually looks like in code
Deepgram via the Python SDK, the full podcast workflow in one request:
from deepgram import DeepgramClient, PrerecordedOptions
dg = DeepgramClient(api_key="YOUR_KEY")
options = PrerecordedOptions(
model="nova-3",
smart_format=True,
diarize=True,
punctuate=True,
paragraphs=True,
utterances=True,
summarize="v2",
)
with open("episode-42.mp3", "rb") as audio:
source = {"buffer": audio, "mimetype": "audio/mpeg"}
response = dg.listen.rest.v("1").transcribe_file(source, options)
for utterance in response.results.utterances:
speaker = utterance.speaker
text = utterance.transcript
start = utterance.start
print(f"[{start:.1f}s] Speaker {speaker}: {text}")
That gives you a structured, speaker-labelled transcript with timestamps, plus a results.summary.short field containing a short summary generated server-side. One call, one bill, one response.
Whisper via the OpenAI Python library, and then what you have to do next:
from openai import OpenAI
client = OpenAI(api_key="YOUR_KEY")
with open("episode-42.mp3", "rb") as audio:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio,
response_format="verbose_json",
timestamp_granularities=["segment"],
)
for segment in transcript.segments:
print(f"[{segment.start:.1f}s] {segment.text}")
You get accurate text with timestamps, no speaker labels, and an HTTP 413 if your file is over 25 MB. To recover diarisation, you run the same audio through pyannote or a managed alternative and then reconcile the speaker spans against Whisper's segments. This is a genuinely annoying job because Whisper tends to break segments at pauses within a single speaker's turn, not at speaker boundaries, so you end up with half-speaker segments that need to be re-split.
When Whisper is actually the better pick
This isn't a hit piece on Whisper. For single-speaker audio, it is still the cleaner choice. A recorded lecture, an audiobook draft, a solo podcast, a voice memo, Whisper's single-speaker transcription quality is excellent and often edges out Deepgram Nova-3 on academic benchmarks. It also handles mid-sentence language switching better in testing with bilingual clips, which matters if you record in multiple languages.
The 25 MB cap hurts less for solo content because a 90-minute lecture at 48 kbps mono voice sounds fine and stays well under the limit.
When to pick which
| Your audio | Pick |
|---|---|
| Solo podcast, lecture, audiobook | Whisper |
| Interview with 2+ speakers | Deepgram |
| Anything over 25 MB you don't want to chunk | Deepgram |
| Bilingual or code-switching content | Whisper |
| You need topic or summary fields in the same response | Deepgram |
| Maximum word-level accuracy on a 5-min clean voice memo | Whisper |
Both APIs update their models weekly. Deepgram Nova-3 launched in early 2025 and the accuracy gap has narrowed considerably since the benchmarks most blog posts still cite. Run both against ten minutes of your actual audio before committing. Once you have a transcript, the next step is usually running it through Claude for show notes, chapter markers, and clip selection.
More Recipes
Automated Podcast Production Workflow
Automated Podcast Production Workflow: From Raw Audio to Published Episode
Build an Automated YouTube Channel with AI
Build an Automated YouTube Channel with AI
Medical device regulatory documentation from technical specifications
Medtech companies spend significant resources translating technical specs into regulatory-compliant documentation.