Thoth
Support Roadmap Blog Alternatives Use Cases Other Apps Mac App Store
Support Roadmap Blog Alternatives Use Cases Other Apps Mac App Store

A band-aid for Parakeet's language drift

2026-05-20

The previous post described the problem: Parakeet TDT v3 has no language conditioning token, so on spontaneous non-English speech it falls back to its English training prior and produces output that reads like a real-time translation. Whisper locks to French from the first token. Parakeet cannot.

The obvious answer was to recommend Whisper for non-English recordings and move on. I did that. Then I kept thinking.

What the model actually sees

Parakeet's decoder is a joint CTC/transducer model compiled into CoreML. The JointDecision model produces logits over the full 8,192-token vocabulary, but CoreML does the argmax internally before returning to Swift. What you get back is the top-1 prediction, already decided.

Except it also returns the top-64 logits and their token IDs, left over from the beam infrastructure. The full distribution never surfaces. But top-64 is enough to ask a useful question.

The diagnostic

I instrumented the decoder to log every time it emitted an English-exclusive token on a French recording. English-exclusive means tokens that are essentially impossible in French prose: the, The, and, And, they, with, that, would, their, and 39 others, words that exist in French but not as standalone space-prefixed tokens the way they appear in English.

Then I checked: when the model picks one of those English tokens, is there a French candidate anywhere in the top-64?

On a 57-minute spontaneous French interview:

  • 813 English token wins captured
  • 84% had a French alternative in the top-64
  • Average logit gap between the English winner and the best French candidate: 4.4
  • 35% of cases had a gap under 3.0: the English token barely won

The model knew French was a plausible answer. It just didn't pick it.

The workaround

I added a post-processing step in the decoder. Every time the joint model picks a token on the English blocklist, I scan the top-64 for the highest-scoring non-English Latin-script token and substitute it. If nothing suitable is in the top-64, the original prediction stands.

The substitution also updates the token probability fed back into the transducer's LSTM state. This matters: each forced French token conditions the next prediction toward French. The effect compounds. A single substitution doesn't just fix one word: it nudges the following frames away from English.

The whole thing runs in O(64) per frame. No additional model call. No latency impact.

Results

I re-ran the full benchmark:

RecordingParakeet beforeParakeet afterWhisper
Children's weather segment0%0%0%
Archival recording, 19122.9%2.7%2.0%
Weightlifting documentary7.1%0%0%
French slang documentary18.2%3.7%0%
Picard dialect documentary16.7%0%0%
Private French-language interview31.3%13.5%0.4%

On four of the six recordings, Parakeet now matches Whisper exactly. The 2.7% on the 1912 archival is one sentence flagged as Catalan by the language detector; the text is valid French. Real drift on that recording is zero.

The French slang documentary dropped from 18.2% to 3.7%. The remaining instances are genuine: a quoted passage from a 15th-century criminal trial record read aloud in archaic French mixed with period slang, and one sentence where no French candidate appeared in the top-64 at all.

What the workaround cannot do

The 57-minute interview sits at 13.5% instead of 0%. That gap is architectural.

When the model produces a sustained English passage (several sentences in sequence), it means no French token appeared in the top-64 for those frames. The blocklist has nothing to substitute. Those are cases where the acoustic evidence was genuinely weak and the English prior dominated completely.

Whisper sidesteps this with a language token prepended before every chunk. The decoder never has the option of drifting. Parakeet has no equivalent slot. The workaround addresses cases where French was acoustically plausible and the English prior just barely tipped the scales. The hard cases remain.

What it does to the output quality

The blocklist substitutes the function word but surrounding content words stay as decoded. In cases where the model was on the edge of drifting, the result is clean French. In cases where it had fully committed to English sentence structure, the result is sometimes hybrid: correct French function words inside English content-word scaffolding.

Looking at the diffs on a 1982 INA documentary about the Picard dialect:

Before: "It's the message of terroirs that sought render ainsi to our compréhension."

After: "C'est tout le message de quelques terroirs qu'il souhaite rendre ainsi à notre compréhension."

That worked because It's triggered a substitution to C'est, which pulled the LSTM toward French. The next tokens followed: the became tout le, that sought render became qu'il souhaite rendre, to our became à notre. One substitution, the whole sentence recovered.

Before: "...la quête des survivances orales, the recherche du pain les picards."

After: "...la quête des survivances orales, la recherche du pain les picards."

That is the clean case. One token corrected, no side effects.

Status

This is a local patch on the FluidAudio checkout inside Thoth, not a proper fix to the model or the library. I am preparing a fork and pull request upstream. The English blocklist is language-pair specific (French versus English) and would need equivalent token sets for other language pairs, but the mechanism generalizes.

The architectural problem is unchanged. Parakeet still has no language conditioning token. On highly spontaneous speech, it will still drift. The workaround just reduces how often that happens.

Parakeet is still the faster option, and on all but the most spontaneous audio it now matches Whisper on language correctness. The 57-minute interview sits at 13.5%, which is in my opinion not good enough to lift the warning for highly spontaneous French speech.


Thoth is a private meeting recorder for Mac. All transcription runs on your device. Built by one person, no funding, no team. If you find it useful, upgrading to Pro is the best way to support development.

More posts

  • Parakeet translates French audio
  • Thoth is coming to your menu bar
  • Could Thoth come to iPhone? What I found
  • Your meeting audio belongs on your Mac
  • Local vs cloud AI summaries: a benchmark
  • How I benchmark transcription models in Thoth

← All posts

© 2026 Thoth · Privacy · Terms