BENCH-001June 2026

Speech-to-Text on Real Multilingual Conversation

15 models4 languageshuman-verified ground truth

TL;DR

— ElevenLabs leads overall; no one else handles code-switching (8.1% WER).
— Dialectal Arabic breaks everyone: the best model misses every other word.
— Half the field fails catastrophically: loops, invented content, dropped audio.

01Results

Error rate of each model against human-verified transcripts — lower is better. Word error rate (WER) throughout; character error rate (CER) for Japanese, which has no word boundaries.

Model
ElevenLabsscribe_v2	2.9🏆	53.6	8.1🏆	19.9
Deepgramnova-3	9.0	53.1🏆	19.3	24.9
Googlegemini-3.5-flash	8.3	54.8	35.0	22.2
Googlegemini-3.1-pro-preview	9.7	54.9	52.6	23.4
AssemblyAIspeech_models: universal-3-pro, universal-2	11.5	79.8	53.3	16.6🏆
Microsoftmai-transcribe-1.5	8.9	96.4	52.3	22.0
Groqwhisper-large-v3 · open weights	13.8	77.3	75.7	19.7
OpenAIgpt-4o-transcribe-diarize	15.4	83.9	44.8	—
OpenAIwhisper-1	18.0	99.4	94.5	23.4
OpenAIgpt-4o-transcribe	15.1	91.2	91.3	38.2
Sarvamsaarika:v2.5	—	—	63.6	—
Cartesiaink-whisper	11.4	97.2	108.4*	58.6
Mistralvoxtral-small-24b-2507 · open weights	10.2	130.1*	182.3*	53.1
OpenAIgpt-4o-mini-transcribe-2025-12-15	12.1	384.9*	63.0	23.2
NVIDIAnemotron-3-nano-omni-30b-a3b · open weights	749.1*	2291.5*	678.7*	1281.5*

* Above 100%: WER = (substitutions + deletions + insertions) ÷ words actually spoken. Hallucinated content counts as insertions, so a model that invents enough text makes more errors than there are real words.

AssemblyAI was run through model routing with universal-3-pro and universal-2. Universal-3 Pro supports English, Spanish, Portuguese, French, German, and Italian; the Arabic, Hinglish/Hindi, and Japanese results should be read as Universal-2 fallback scores.

SpanishMexican Spanish, family call · WER

Scribe v2

2.9

Gemini 3.5 Flash

8.3

MAI-Transcribe 1.5

8.9

Nova-3

9.0

Gemini 3.1 Pro

9.7

Voxtral 24B

10.2

Ink-Whisper

11.4

U3 Pro / U2 routed

11.5

4o-mini-transcribe

12.1

Whisper large-v3

13.8

4o-transcribe

15.1

4o-diarize

15.4

Whisper-1

18.0

Nemotron 3 Nano

≫

749.1

ArabicGulf dialect, family call · WER

Nova-3

53.1

Scribe v2

53.6

Gemini 3.5 Flash

54.8

Gemini 3.1 Pro

54.9

Whisper large-v3

77.3

U3 Pro / U2 routed

79.8

4o-diarize

83.9

4o-transcribe

91.2

MAI-Transcribe 1.5

96.4

Ink-Whisper

97.2

Whisper-1

99.4

Voxtral 24B

≫

130.1

4o-mini-transcribe

≫

384.9

Nemotron 3 Nano

≫

2291.5

HinglishHindi–English code-switching · WER

Scribe v2

8.1

Nova-3

19.3

Gemini 3.5 Flash

35.0

4o-diarize

44.8

MAI-Transcribe 1.5

52.3

Gemini 3.1 Pro

52.6

U3 Pro / U2 routed

53.3

4o-mini-transcribe

63.0

Saarika v2.5

63.6

Whisper large-v3

75.7

4o-transcribe

91.3

Whisper-1

94.5

Ink-Whisper

≫

108.4

Voxtral 24B

≫

182.3

Nemotron 3 Nano

≫

678.7

JapaneseCasual conversation · CER

U3 Pro / U2 routed

16.6

Whisper large-v3

19.7

Scribe v2

19.9

MAI-Transcribe 1.5

22.0

Gemini 3.5 Flash

22.2

4o-mini-transcribe

23.2

Gemini 3.1 Pro

23.4

Whisper-1

23.4

Nova-3

24.9

4o-transcribe

38.2

Voxtral 24B

53.1

Ink-Whisper

58.6

Nemotron 3 Nano

≫

1281.5

02The Data

The recordings are natural conversations between native speakers who know each other — real calls between family members, speaking the way people actually speak: overlapping turns, interruptions, laughter, mid-sentence language switches, imperfect connections. Nothing is scripted, read aloud, or staged.

Ground truth is utterance-level transcription with timestamps and speaker labels, produced and verified by native speakers. Code-switched speech is kept in its natural scripts — Hindi in Devanagari, English in Latin — and each recording carries speaker metadata such as first language, dialect, and age bracket. Conversations are available both as the merged mix benchmarked here and as speaker-isolated tracks.

Language	Setting
Spanish	Mexican Spanish, family call
Arabic	Gulf dialect, family call
Hinglish	Hindi–English code-switching
Japanese	Casual conversation

03Findings

Clean-language scores don't transfer. The same models that score 3–18% WER on Spanish score 53–385% on Gulf-dialect Arabic. Microsoft's MAI-Transcribe-1.5 — released days before this benchmark claiming the #1 spot on the Open ASR Leaderboard — scores a solid 8.9% on Spanish yet misidentifies the Arabic call entirely, transcribing it as German and English gibberish (96.4% WER, reproducible with both generic and region-specific language hints). Published benchmarks built on read speech and broadcast audio say very little about real conversation in lower-resource varieties.
Code-switching is a separating dimension. Hinglish — how hundreds of millions of people actually speak — splits the field: ElevenLabs reaches 8.1% WER and Deepgram 19.3%, frontier multimodal models land mid-pack (Gemini 3.5 Flash at 35.0%), every Whisper-derived model sits between 76% and 108%, and Mistral's Voxtral hits 182% after generating twice as many words as were spoken. Even Sarvam — built specifically for Indian languages — lands at 63.6%. Whisper-family models and Sarvam also force all output into a single script, writing English words in Devanagari.
Failure is catastrophic, not graceful. On Arabic, gpt-4o-mini-transcribe produced more than four times as many words as were actually spoken — page after page of invented content. Cartesia emitted repetition loops (one syllable repeated 70+ times) and transcribed only a quarter of the Japanese audio. gpt-4o-transcribe silently dropped the entire opening of one file. NVIDIA's Nemotron 3 Nano Omni — a reasoning multimodal model — is the starkest case: on Arabic it spends tens of thousands of reasoning tokens and returns no transcript at all, and when it does emit text it loops, producing 38,000 words where 1,664 were spoken (2,292% WER) and romanizing Japanese into Latin script (1,282% CER). In production, these failures would poison downstream data without raising an error. The instability is also run-to-run: across five repeated runs, dedicated transcription APIs return near-identical scores, while gpt-4o-family models swing by tens of points on the harder languages — the same audio, the same settings, a different transcript every time.
Everyone struggles with casual Japanese. Fast, overlapping family conversation holds the best result (AssemblyAI via Universal-2 fallback) to 16.6% CER — roughly one error every six characters — despite Japanese being a well-resourced language.

04Methodology

Audio. Each model received the merged conversation mix — overlapping speech included — at 16 kHz mono, and was scored against the human-verified transcript of the full conversation.

Models. Each provider's current flagship transcription API, called with its documented language hint: AssemblyAI model routing with universal-3-pro and universal-2, ElevenLabs scribe_v2, Deepgram nova-3, OpenAI gpt-4o-transcribe, gpt-4o-transcribe-diarize, and gpt-4o-mini-transcribe (2025-12-15), whisper-1, Cartesia ink-whisper, Google Gemini 3.1 Pro and 3.5 Flash, Microsoft MAI-Transcribe-1.5, Mistral Voxtral Small 24B, open-weights Whisper large-v3 (via Groq), and NVIDIA Nemotron 3 Nano Omni 30B (open weights, via OpenRouter; added after the original evaluation and reported from a single run rather than a five-run mean). Multimodal chat models were prompted for verbatim transcription at temperature 0. Audio exceeding provider limits was split into 5–10 minute segments and rejoined. Sarvam saarika:v2.5 (batch API) supports Indic languages and English only, so it is scored on Hinglish alone. AssemblyAI's documentation says this routed configuration uses Universal-3 Pro for languages it supports and falls back to Universal-2 for all other languages; therefore the AssemblyAI Spanish result is Universal-3 Pro, while the Arabic, Hinglish/Hindi, and Japanese results are Universal-2 fallback scores rather than native Universal-3 Pro scores.

Scoring. References and hypotheses are normalized identically before comparison: timestamps, speaker labels, and annotation tags stripped; Unicode NFKC; casefolding; punctuation removed; Arabic diacritics and orthographic variants folded. WER and CER are computed with jiwer. CER is the headline metric for Japanese; WER elsewhere. The full benchmark was run five times per model; reported figures are means across runs.

Caveats. Part of the Whisper-family Hinglish gap is script convention rather than misrecognition, though their error rates remain far higher under any convention. Merged audio includes crosstalk, which penalizes all models equally.

This benchmark is built on a small public sample of Specific's human-verified conversation data. Reach out for the full dataset or evaluation runs on your own models.