Alchemy RecipeIntermediateworkflow

Whisper API vs Deepgram for podcast transcription: when the cheaper option is actually wrong

The benchmark tables miss two things that matter for podcast work: speaker diarisation and the 25 MB upload cap. Here is what each actually looks like in code, and when to pick which.

Time saved
Saves 1-2 hrs per episode
Monthly cost
~£8-15 / $10-19/mo
Published

A 90-minute podcast with two hosts and a guest. You want a clean transcript, speaker labels, timestamped quotes for the show notes, and a few shareable clips. The first call you make is picking a transcription provider, and this is where most guides give you a benchmark table and tell you Whisper wins on accuracy at $0.006 per minute versus Deepgram's $0.0043. If you stop reading at the price and the WER chart, you will spend the next afternoon stitching together a workaround.

Here is the thing the benchmark tables don't tell you: the Whisper API does not do speaker diarisation. There is no parameter for it. The response comes back as one continuous transcript with no indication of who said what. For a three-person interview, that means post-processing the output through a separate diarisation model like pyannote and aligning the speaker spans against Whisper's segment boundaries, which don't line up with speaker change points. That's two APIs, two sets of credentials, and an audio alignment step that will eat an hour of your Saturday.

Deepgram handles diarisation in the same request. You add diarize=true to the call and the response contains speaker_0, speaker_1, speaker_2 labels on every word. For podcast work, this is not a minor convenience. It decides whether the whole workflow is one API call or four.

The second gotcha is file size. Whisper API caps uploads at 25 MB. A 90-minute podcast exported at 128 kbps MP3 is around 84 MB. You can compress it down to 64 kbps to fit, which loses noticeable clarity on any speech with background noise, or you chunk the file into three pieces and stitch the transcripts back together afterwards. Deepgram accepts files up to 2 GB on standard endpoints and will also take a URL if your audio lives on S3, R2, or a CDN.

What each actually looks like in code

Deepgram via the Python SDK, the full podcast workflow in one request:

from deepgram import DeepgramClient, PrerecordedOptions

dg = DeepgramClient(api_key="YOUR_KEY")

options = PrerecordedOptions(
    model="nova-3",
    smart_format=True,
    diarize=True,
    punctuate=True,
    paragraphs=True,
    utterances=True,
    summarize="v2",
)

with open("episode-42.mp3", "rb") as audio:
    source = {"buffer": audio, "mimetype": "audio/mpeg"}
    response = dg.listen.rest.v("1").transcribe_file(source, options)

for utterance in response.results.utterances:
    speaker = utterance.speaker
    text = utterance.transcript
    start = utterance.start
    print(f"[{start:.1f}s] Speaker {speaker}: {text}")

That gives you a structured, speaker-labelled transcript with timestamps, plus a results.summary.short field containing a short summary generated server-side. One call, one bill, one response.

Whisper via the OpenAI Python library, and then what you have to do next:

from openai import OpenAI
client = OpenAI(api_key="YOUR_KEY")

with open("episode-42.mp3", "rb") as audio:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio,
        response_format="verbose_json",
        timestamp_granularities=["segment"],
    )

for segment in transcript.segments:
    print(f"[{segment.start:.1f}s] {segment.text}")

You get accurate text with timestamps, no speaker labels, and an HTTP 413 if your file is over 25 MB. To recover diarisation, you run the same audio through pyannote or a managed alternative and then reconcile the speaker spans against Whisper's segments. This is a genuinely annoying job because Whisper tends to break segments at pauses within a single speaker's turn, not at speaker boundaries, so you end up with half-speaker segments that need to be re-split.

When Whisper is actually the better pick

This isn't a hit piece on Whisper. For single-speaker audio, it is still the cleaner choice. A recorded lecture, an audiobook draft, a solo podcast, a voice memo, Whisper's single-speaker transcription quality is excellent and often edges out Deepgram Nova-3 on academic benchmarks. It also handles mid-sentence language switching better in testing with bilingual clips, which matters if you record in multiple languages.

The 25 MB cap hurts less for solo content because a 90-minute lecture at 48 kbps mono voice sounds fine and stays well under the limit.

When to pick which

Your audioPick
Solo podcast, lecture, audiobookWhisper
Interview with 2+ speakersDeepgram
Anything over 25 MB you don't want to chunkDeepgram
Bilingual or code-switching contentWhisper
You need topic or summary fields in the same responseDeepgram
Maximum word-level accuracy on a 5-min clean voice memoWhisper

Both APIs update their models weekly. Deepgram Nova-3 launched in early 2025 and the accuracy gap has narrowed considerably since the benchmarks most blog posts still cite. Run both against ten minutes of your actual audio before committing. Once you have a transcript, the next step is usually running it through Claude for show notes, chapter markers, and clip selection.

More Recipes