Benchmark
Transcription Accuracy: How Thoth Models Compare
Studio benchmarks are measured on clean, native-English audio. Real meetings have accented speakers, background noise, and domain vocabulary. We tested on all three. The numbers below are from actual recordings, not marketing copy.
Re-transcription · 42-min recording, M2 MacBook Pro
| Model | Speed | WER · accented EN | WER · clean EN | Languages |
|---|---|---|---|---|
| Whisper Large V3 Turbo | 12.7× | 32.3% | 7.8% | 99 |
| Parakeet TDT V3 Pro | 180× | 38.9% | 8.7% | 25 |
| Whisper Small | 17.7× | 45.9% | 9.2% | 99 |
| Whisper Base | 59.6× | 51.1% | 9.2% | 99 |
Live transcription · French-accented English
| Engine | WER | Latency |
|---|---|---|
| Parakeet EOU 120M Pro | 38.4% | ~160 ms |
| Parakeet Sliding Window Pro | 56.8% | ~11 s |
| WhisperKit Base+Small | 65.9% | ~12 s |
Large V3 Turbo wins on accuracy. Best on all three scripts: 32.3% on French-accented English, 7.8% on clean audio. If the transcript needs to be right, this is the one.
Parakeet is 14x faster on the same file. Near-identical WER on clean speech (8.7% vs 7.8%). Falls behind on accented speech and code-switching. Worth it when speed matters and audio is clean.
Parakeet EOU is a different category. Word-by-word output at ~160 ms latency. Comparing its 38.4% WER to batch models isn't fair: it's a streaming engine optimised for real-time, not accuracy.
Published benchmarks are optimistic. Every model ran 10-30x worse on accented or foreign-language speech than studio numbers suggest. Real meetings are harder than LibriSpeech.