How I benchmark transcription models in Thoth

2026-05-10

I'm a laser physicist and R&D engineer by training. Call it déformation professionnelle: I can't ship a transcription app and just trust the numbers on the box. When you spend your days designing experiments with proper controls and stress cases, you end up doing the same thing to your own software.

So before recommending any model, I ran my own benchmark.

Why benchmark?

Every transcription model ships with a WER number from some clean studio dataset. Real meetings are messier: accented speakers, background noise, domain vocabulary. I wanted numbers I could trust.

The setup

Three scripts, each stressing a different scenario:

Script A: English with a French accent, casual meeting cadence
Script B: Native French speaker, code-switching between languages
Script C: Clean audio (Simon Sinek TED talk), re-transcription only

I recorded A and B myself on a MacBook Pro M2 internal microphone, at normal speaking distance, with the TV running at low volume in the background. Thoth uses Apple's VoiceProcessingIO for mic capture, which includes automatic gain control and noise suppression, the same stack FaceTime and most Mac apps use. None of the models picked up the TV audio, which is a good sign for real-world use. For C, I used the official TED human-reviewed subtitles as reference.

The scripts were deliberately stressful. Scripts A and B included dense technical vocabulary (CoreML, diarization, quantization), a wall of acronyms (GPU, API, WER, SDK...), proper nouns from French institutions, numbers, prices, and identifiers. These are exactly the categories where transcription models fall apart. If your meetings are mostly conversational English with common vocabulary, you will do noticeably better than what you see here.

WER = (substitutions + deletions + insertions) / reference word count, computed via edit distance.

Results

Re-transcription

Model	EN accented	FR native	EN clean
Whisper Base	51.1%	54.3%	9.2%
Whisper Small	45.9%	47.7%	9.2%
Whisper Medium	39.7%	45.3%	36.8%
Whisper Large V3 Turbo	32.3%	38.3%	7.8%
Parakeet TDT v3	38.9%	48.7%	8.7%

Live transcription (Script A)

Engine	WER	Latency
WhisperKit Base+Small	65.9%	~12 s
Parakeet Sliding Window	56.8%	~11 s
Parakeet EOU 120M	38.4%	~160 ms

What I found

Whisper Large V3 Turbo came out on top across all three scripts. 32.3% on French-accented English, 38.3% on native French, 7.8% on the TED talk. If accuracy is what you're after, it's the clear pick.

Parakeet TDT v3 is close on clean audio (8.7%, nearly matching Large's 7.8%) but fell apart under accent and started code-switching to English mid-recording on the French script. It covers 25 languages, but heavy accent scenarios aren't its strong suit. What I later found is that this code-switching isn't random noise: Parakeet is actually translating the audio to English. Read the investigation.

Whisper Medium was the real surprise. It sits between Small and Large in size, so you'd expect it to land somewhere in the middle. Instead it posted 36.8% WER on the clean TED audio, where Large got 7.8%. I traced it to a silent-skip issue in the CoreML conversion: 245 deletions, near-zero insertions. The model drops entire sections without any sign that it has done so. I flag this in Settings.

Parakeet EOU 120M (the live streaming engine) got 38.4% on Script A. That looks rough on paper, but it's producing word-by-word output at around 160ms latency, so comparing it to re-transcription models isn't really fair.

Takeaway

The LibriSpeech numbers in Settings are directionally right, but all models do 10-30x worse on accented or foreign-language speech. Clean native audio gets you close to the published benchmarks. Accented speech doesn't. If accuracy matters, use Whisper Large V3 Turbo.

The full numbers are on the transcription accuracy page.

Thoth is a private meeting recorder for Mac. All transcription runs on your device. Built by one person, no funding, no team. If you find it useful, upgrading to Pro is the best way to support development.

← All posts