Thoth
Support Roadmap Blog Alternatives Use Cases Other Apps Mac App Store
Support Roadmap Blog Alternatives Use Cases Other Apps Mac App Store

How I benchmark transcription models in Thoth

2026-05-10

I'm a laser physicist and R&D engineer by training. Call it déformation professionnelle: I can't ship a transcription app and just trust the numbers on the box. When you spend your days designing experiments with proper controls and stress cases, you end up doing the same thing to your own software.

So before recommending any model, I ran my own benchmark.

Why benchmark?

Every transcription model ships with a WER number from some clean studio dataset. Real meetings are messier: accented speakers, background noise, domain vocabulary. I wanted numbers I could trust.

The setup

Three scripts, each stressing a different scenario:

  • Script A: English with a French accent, casual meeting cadence
  • Script B: Native French speaker, code-switching between languages
  • Script C: Clean audio (Simon Sinek TED talk), re-transcription only

I recorded A and B myself on a MacBook Pro M2 internal microphone, at normal speaking distance, with the TV running at low volume in the background. Thoth uses Apple's VoiceProcessingIO for mic capture, which includes automatic gain control and noise suppression, the same stack FaceTime and most Mac apps use. None of the models picked up the TV audio, which is a good sign for real-world use. For C, I used the official TED human-reviewed subtitles as reference.

The scripts were deliberately stressful. Scripts A and B included dense technical vocabulary (CoreML, diarization, quantization), a wall of acronyms (GPU, API, WER, SDK...), proper nouns from French institutions, numbers, prices, and identifiers. These are exactly the categories where transcription models fall apart. If your meetings are mostly conversational English with common vocabulary, you will do noticeably better than what you see here.

WER = (substitutions + deletions + insertions) / reference word count, computed via edit distance.

Results

Re-transcription

ModelEN accentedFR nativeEN clean
Whisper Base51.1%54.3%9.2%
Whisper Small45.9%47.7%9.2%
Whisper Medium39.7%45.3%36.8%
Whisper Large V3 Turbo32.3%38.3%7.8%
Parakeet TDT v338.9%48.7%8.7%

Live transcription (Script A)

EngineWERLatency
WhisperKit Base+Small65.9%~12 s
Parakeet Sliding Window56.8%~11 s
Parakeet EOU 120M38.4%~160 ms

What I found

Whisper Large V3 Turbo came out on top across all three scripts. 32.3% on French-accented English, 38.3% on native French, 7.8% on the TED talk. If accuracy is what you're after, it's the clear pick.

Parakeet TDT v3 is close on clean audio (8.7%, nearly matching Large's 7.8%) but fell apart under accent and started code-switching to English mid-recording on the French script. It covers 25 languages, but heavy accent scenarios aren't its strong suit. What I later found is that this code-switching isn't random noise: Parakeet is actually translating the audio to English. Read the investigation.

Whisper Medium was the real surprise. It sits between Small and Large in size, so you'd expect it to land somewhere in the middle. Instead it posted 36.8% WER on the clean TED audio, where Large got 7.8%. I traced it to a silent-skip issue in the CoreML conversion: 245 deletions, near-zero insertions. The model drops entire sections without any sign that it has done so. I flag this in Settings.

Parakeet EOU 120M (the live streaming engine) got 38.4% on Script A. That looks rough on paper, but it's producing word-by-word output at around 160ms latency, so comparing it to re-transcription models isn't really fair.

Takeaway

The LibriSpeech numbers in Settings are directionally right, but all models do 10-30x worse on accented or foreign-language speech. Clean native audio gets you close to the published benchmarks. Accented speech doesn't. If accuracy matters, use Whisper Large V3 Turbo.

The full numbers are on the transcription accuracy page.


Thoth is a private meeting recorder for Mac. All transcription runs on your device. Built by one person, no funding, no team. If you find it useful, upgrading to Pro is the best way to support development.

More posts

  • Thoth is coming to iPhone
  • Redact names before cloud AI sees them
  • Thoth for ADHD and autism: what I learned
  • A band-aid for Parakeet's language drift
  • Parakeet translates French audio
  • Thoth is coming to your menu bar
  • Could Thoth come to iPhone? What I found
  • Your meeting audio belongs on your Mac
  • Local vs cloud AI summaries: a benchmark

← All posts

© 2026 Thoth · Privacy · Terms