BENCH-001June 2026

Speech-to-Text on Real Multilingual Conversation

15 models4 languageshuman-verified ground truth
TL;DR
  • — ElevenLabs leads overall; no one else handles code-switching (8.1% WER).
  • — Dialectal Arabic breaks everyone: the best model misses every other word.
  • — Half the field fails catastrophically: loops, invented content, dropped audio.

01Results

Error rate of each model against human-verified transcripts — lower is better. Word error rate (WER) throughout; character error rate (CER) for Japanese, which has no word boundaries.

Model
ElevenLabsscribe_v22.9🏆53.68.1🏆19.9
Deepgramnova-39.053.1🏆19.324.9
Googlegemini-3.5-flash8.354.835.022.2
Googlegemini-3.1-pro-preview9.754.952.623.4
AssemblyAIspeech_models: universal-3-pro, universal-211.579.853.316.6🏆
Microsoftmai-transcribe-1.58.996.452.322.0
Groqwhisper-large-v3 · open weights13.877.375.719.7
OpenAIgpt-4o-transcribe-diarize15.483.944.8
OpenAIwhisper-118.099.494.523.4
OpenAIgpt-4o-transcribe15.191.291.338.2
Sarvamsaarika:v2.563.6
Cartesiaink-whisper11.497.2108.4*58.6
Mistralvoxtral-small-24b-2507 · open weights10.2130.1*182.3*53.1
OpenAIgpt-4o-mini-transcribe-2025-12-1512.1384.9*63.023.2
NVIDIAnemotron-3-nano-omni-30b-a3b · open weights749.1*2291.5*678.7*1281.5*

* Above 100%: WER = (substitutions + deletions + insertions) ÷ words actually spoken. Hallucinated content counts as insertions, so a model that invents enough text makes more errors than there are real words.

AssemblyAI was run through model routing with universal-3-pro and universal-2. Universal-3 Pro supports English, Spanish, Portuguese, French, German, and Italian; the Arabic, Hinglish/Hindi, and Japanese results should be read as Universal-2 fallback scores.

SpanishMexican Spanish, family call · WER
Scribe v2
2.9
Gemini 3.5 Flash
8.3
MAI-Transcribe 1.5
8.9
Nova-3
9.0
Gemini 3.1 Pro
9.7
Voxtral 24B
10.2
Ink-Whisper
11.4
U3 Pro / U2 routed
11.5
4o-mini-transcribe
12.1
Whisper large-v3
13.8
4o-transcribe
15.1
4o-diarize
15.4
Whisper-1
18.0
Nemotron 3 Nano
749.1
ArabicGulf dialect, family call · WER
Nova-3
53.1
Scribe v2
53.6
Gemini 3.5 Flash
54.8
Gemini 3.1 Pro
54.9
Whisper large-v3
77.3
U3 Pro / U2 routed
79.8
4o-diarize
83.9
4o-transcribe
91.2
MAI-Transcribe 1.5
96.4
Ink-Whisper
97.2
Whisper-1
99.4
Voxtral 24B
130.1
4o-mini-transcribe
384.9
Nemotron 3 Nano
2291.5
HinglishHindi–English code-switching · WER
Scribe v2
8.1
Nova-3
19.3
Gemini 3.5 Flash
35.0
4o-diarize
44.8
MAI-Transcribe 1.5
52.3
Gemini 3.1 Pro
52.6
U3 Pro / U2 routed
53.3
4o-mini-transcribe
63.0
Saarika v2.5
63.6
Whisper large-v3
75.7
4o-transcribe
91.3
Whisper-1
94.5
Ink-Whisper
108.4
Voxtral 24B
182.3
Nemotron 3 Nano
678.7
JapaneseCasual conversation · CER
U3 Pro / U2 routed
16.6
Whisper large-v3
19.7
Scribe v2
19.9
MAI-Transcribe 1.5
22.0
Gemini 3.5 Flash
22.2
4o-mini-transcribe
23.2
Gemini 3.1 Pro
23.4
Whisper-1
23.4
Nova-3
24.9
4o-transcribe
38.2
Voxtral 24B
53.1
Ink-Whisper
58.6
Nemotron 3 Nano
1281.5

02The Data

The recordings are natural conversations between native speakers who know each other — real calls between family members, speaking the way people actually speak: overlapping turns, interruptions, laughter, mid-sentence language switches, imperfect connections. Nothing is scripted, read aloud, or staged.

Ground truth is utterance-level transcription with timestamps and speaker labels, produced and verified by native speakers. Code-switched speech is kept in its natural scripts — Hindi in Devanagari, English in Latin — and each recording carries speaker metadata such as first language, dialect, and age bracket. Conversations are available both as the merged mix benchmarked here and as speaker-isolated tracks.

LanguageSetting
SpanishMexican Spanish, family call
ArabicGulf dialect, family call
HinglishHindi–English code-switching
JapaneseCasual conversation

03Findings

  1. Clean-language scores don't transfer. The same models that score 3–18% WER on Spanish score 53–385% on Gulf-dialect Arabic. Microsoft's MAI-Transcribe-1.5 — released days before this benchmark claiming the #1 spot on the Open ASR Leaderboard — scores a solid 8.9% on Spanish yet misidentifies the Arabic call entirely, transcribing it as German and English gibberish (96.4% WER, reproducible with both generic and region-specific language hints). Published benchmarks built on read speech and broadcast audio say very little about real conversation in lower-resource varieties.
  2. Code-switching is a separating dimension. Hinglish — how hundreds of millions of people actually speak — splits the field: ElevenLabs reaches 8.1% WER and Deepgram 19.3%, frontier multimodal models land mid-pack (Gemini 3.5 Flash at 35.0%), every Whisper-derived model sits between 76% and 108%, and Mistral's Voxtral hits 182% after generating twice as many words as were spoken. Even Sarvam — built specifically for Indian languages — lands at 63.6%. Whisper-family models and Sarvam also force all output into a single script, writing English words in Devanagari.
  3. Failure is catastrophic, not graceful. On Arabic, gpt-4o-mini-transcribe produced more than four times as many words as were actually spoken — page after page of invented content. Cartesia emitted repetition loops (one syllable repeated 70+ times) and transcribed only a quarter of the Japanese audio. gpt-4o-transcribe silently dropped the entire opening of one file. NVIDIA's Nemotron 3 Nano Omni — a reasoning multimodal model — is the starkest case: on Arabic it spends tens of thousands of reasoning tokens and returns no transcript at all, and when it does emit text it loops, producing 38,000 words where 1,664 were spoken (2,292% WER) and romanizing Japanese into Latin script (1,282% CER). In production, these failures would poison downstream data without raising an error. The instability is also run-to-run: across five repeated runs, dedicated transcription APIs return near-identical scores, while gpt-4o-family models swing by tens of points on the harder languages — the same audio, the same settings, a different transcript every time.
  4. Everyone struggles with casual Japanese. Fast, overlapping family conversation holds the best result (AssemblyAI via Universal-2 fallback) to 16.6% CER — roughly one error every six characters — despite Japanese being a well-resourced language.

04Methodology

Audio. Each model received the merged conversation mix — overlapping speech included — at 16 kHz mono, and was scored against the human-verified transcript of the full conversation.

Models. Each provider's current flagship transcription API, called with its documented language hint: AssemblyAI model routing with universal-3-pro and universal-2, ElevenLabs scribe_v2, Deepgram nova-3, OpenAI gpt-4o-transcribe, gpt-4o-transcribe-diarize, and gpt-4o-mini-transcribe (2025-12-15), whisper-1, Cartesia ink-whisper, Google Gemini 3.1 Pro and 3.5 Flash, Microsoft MAI-Transcribe-1.5, Mistral Voxtral Small 24B, open-weights Whisper large-v3 (via Groq), and NVIDIA Nemotron 3 Nano Omni 30B (open weights, via OpenRouter; added after the original evaluation and reported from a single run rather than a five-run mean). Multimodal chat models were prompted for verbatim transcription at temperature 0. Audio exceeding provider limits was split into 5–10 minute segments and rejoined. Sarvam saarika:v2.5 (batch API) supports Indic languages and English only, so it is scored on Hinglish alone. AssemblyAI's documentation says this routed configuration uses Universal-3 Pro for languages it supports and falls back to Universal-2 for all other languages; therefore the AssemblyAI Spanish result is Universal-3 Pro, while the Arabic, Hinglish/Hindi, and Japanese results are Universal-2 fallback scores rather than native Universal-3 Pro scores.

Scoring. References and hypotheses are normalized identically before comparison: timestamps, speaker labels, and annotation tags stripped; Unicode NFKC; casefolding; punctuation removed; Arabic diacritics and orthographic variants folded. WER and CER are computed with jiwer. CER is the headline metric for Japanese; WER elsewhere. The full benchmark was run five times per model; reported figures are means across runs.

Caveats. Part of the Whisper-family Hinglish gap is script convention rather than misrecognition, though their error rates remain far higher under any convention. Merged audio includes crosstalk, which penalizes all models equally.

This benchmark is built on a small public sample of Specific's human-verified conversation data. Reach out for the full dataset or evaluation runs on your own models.