Benchmark

Transcription Accuracy: How Thoth Models Compare

Studio benchmarks are measured on clean, native-English audio. Real meetings have accented speakers, background noise, and domain vocabulary. We tested on all three. The numbers below are from actual recordings, not marketing copy.

Re-transcription · 42-min recording, M2 MacBook Pro

Model	Speed	WER · accented EN	WER · clean EN	Languages
Whisper Large V3 Turbo	12.7×	32.3%	7.8%	99
Parakeet TDT V3 Pro	180×	38.9%	8.7%	25
Whisper Small	17.7×	45.9%	9.2%	99
Whisper Base	59.6×	51.1%	9.2%	99

Live transcription · French-accented English

Engine	WER	Latency
Parakeet EOU 120M Pro	38.4%	~160 ms
Parakeet Sliding Window Pro	56.8%	~11 s
WhisperKit Base+Small	65.9%	~12 s

7.72 s to diarize 42 min of audio with 2 speakers Up to 8 speakers · Fully on-device · Pyannote CoreML

Large V3 Turbo wins on accuracy. Best on all three scripts: 32.3% on French-accented English, 7.8% on clean audio. If the transcript needs to be right, this is the one.

Parakeet is 14x faster on the same file. Near-identical WER on clean speech (8.7% vs 7.8%). Falls behind on accented speech and code-switching. Worth it when speed matters and audio is clean.

Parakeet EOU is a different category. Word-by-word output at ~160 ms latency. Comparing its 38.4% WER to batch models isn't fair: it's a streaming engine optimised for real-time, not accuracy.

Published benchmarks are optimistic. Every model ran 10-30x worse on accented or foreign-language speech than studio numbers suggest. Real meetings are harder than LibriSpeech.

Full methodology: three test scripts (French-accented English, native French with code-switching, clean studio audio), WER computed via edit distance, hardware details, and model notes including the Whisper Medium silent-skip issue. Read the full benchmark writeup →

All models run on your Mac.

Pick the model that fits your use case. No cloud, no account, no data leaving your device.