Local vs cloud AI summaries: a two-round benchmark

2026-05-13

Local vs cloud AI summaries: a two-round benchmark

Every meeting recorder now ships with AI summaries. The question nobody answers clearly is: where does the AI actually run, and does it matter for quality?

I ran a proper benchmark across two very different meetings, three AI models, and three independent judges. Here is what I found.

The setup

Two transcripts, both processed through the same three models:

Phi 3.5 Mini (local, 2.3 GB on disk, runs entirely on the Mac)
Gemma 3 12B (local, memory-mapped, runs on Apple Silicon via the Neural Engine)
Claude Sonnet 4.6 (BYOK cloud: transcript sent directly from Thoth to Anthropic's API using my own key, Thoth never touches it)

Round 1 was a French-language interview: dense conversational content, two speakers, domain-specific vocabulary, code-switching. Round 2 was an English technical meeting: multiple speakers, highly specialized terminology, implicit strategic decisions alongside explicit technical ones.

Each summary was scored by three independent judges on six criteria: factual accuracy, completeness, decision capture, action items, quote selection, and language quality. Summaries were anonymized before scoring to reduce self-enhancement bias. The judges were Claude Opus 4.6 (with extended thinking), Gemini 2.5 Pro, and GPT-5.5.

The results

Round 1 (French conversational transcript)

Model	Opus 4.6	Gemini 2.5 Pro	GPT-5.5	Average
Phi 3.5 Mini	4.3	5.3	4.3	4.6
Gemma 3 12B	5.3	6.5	6.3	6.0
Sonnet 4.6 (BYOK)	7.2	8.3	8.0	7.8

Round 2 (English technical transcript)

Model	Opus 4.6	Gemini 2.5 Pro	GPT-5.5	Average
Phi 3.5 Mini	3.8	6.5	4.3	4.9
Gemma 3 12B	4.2	6.8	5.3	5.4
Sonnet 4.6 (BYOK)	8.8	8.8	8.2	8.6

The ranking is identical across all three judges and both rounds. Sonnet leads, Gemma is in the middle, Phi is last. The gap widens on the technical transcript.

What went wrong, specifically

The most important finding is not the ranking. It is the failure modes.

Sonnet hallucinated quotes. In Round 1, two of its four selected quotes did not appear verbatim in the source transcript. They sounded authentic and on-brand for the speaker. Opus 4.6 and GPT-5.5 caught this and penalized it. Gemini missed it and gave Sonnet 9/10 on quote selection. This is the most serious risk of cloud AI summaries for meeting notes: the model produces fluent, plausible content that was never actually said.

Gemma confused participants with competitors. In Round 2, Gemma identified the main competitor as the client throughout its summary. This is not a minor error. It inverts the entire commercial context of the meeting. Someone acting on that summary would have a fundamentally wrong picture of the situation. Gemma also concluded "no formal decisions were made" in both rounds, despite clear implicit decisions in both transcripts.

Phi reproduced transcription errors verbatim. Where the transcript contained garbled text, Phi carried it into the summary uncorrected. It also chose quotes that were either irrelevant or contextually meaningless.

Decision capture was the weakest criterion for local models across both rounds. Neither Phi nor Gemma reliably identified implicit decisions, which are often the most actionable outputs of a meeting. Sonnet was consistently the only model to capture decisions that were never stated explicitly but were clearly agreed upon.

A practical note on prompting

The initial runs were done with an English system prompt across all models. Local models responded in English regardless of the transcript language. Switching to a system prompt in the detected language of the transcript fixed this cleanly. Passing the prompt in the target language is more reliable than instructing the model to respond in a specific language.

What this means in practice

The quality gap between local and cloud is real and consistent. On a conversational French transcript, Sonnet scores 7.8 vs 4.6 for the smallest local model. On a dense technical English meeting, that gap grows to 8.6 vs 4.9.

But the right question is not which is better. It is which tradeoff fits your meeting.

If the transcript contains anything you would not send to a third party, use local. Unpublished research, privileged conversations, NDA-protected discussions, patient information. A 4.6 summary you control is better than an 8.6 summary that left your machine.

If the meeting is a standard internal call or a client discussion where depth matters more than absolute privacy, BYOK cloud produces meaningfully better output. The quality difference is large enough to affect how useful the summary actually is.

The third option, which most tools do not offer, is choosing per recording rather than per account. Some meetings warrant local. Others warrant cloud. Thoth lets you pick for each recording independently.

One more thing worth noting: when BYOK is configured correctly, the transcript goes directly from your Mac to the AI provider using your own API key. The meeting recorder never sees it. That is meaningfully different from a vendor routing your content through their own infrastructure before forwarding to a model. The privacy posture is different even if the end model is the same.

Thoth supports five local AI models and BYOK cloud (OpenAI, Anthropic, Google). You choose per recording. Audio never leaves your Mac regardless of which AI path you use. Download on the Mac App Store.

← All posts

Local vs cloud AI summaries: a two-round benchmark

Local vs cloud AI summaries: a two-round benchmark

The setup

The results

What went wrong, specifically

A practical note on prompting

What this means in practice

More posts