Speech-to-Text on Real Multilingual Conversation
- — ElevenLabs leads overall; no one else handles code-switching (8.1% WER).
- — Dialectal Arabic breaks everyone: the best model misses every other word.
- — Half the field fails catastrophically: loops, invented content, dropped audio.
01Results
Error rate of each model against human-verified transcripts — lower is better. Word error rate (WER) throughout; character error rate (CER) for Japanese, which has no word boundaries.
| Model | ||||
|---|---|---|---|---|
| 2.9🏆 | 53.6 | 8.1🏆 | 19.9 | |
| 9.0 | 53.1🏆 | 19.3 | 24.9 | |
| 8.3 | 54.8 | 35.0 | 22.2 | |
| 9.7 | 54.9 | 52.6 | 23.4 | |
| 11.5 | 79.8 | 53.3 | 16.6🏆 | |
| 8.9 | 96.4 | 52.3 | 22.0 | |
| 13.8 | 77.3 | 75.7 | 19.7 | |
| 15.4 | 83.9 | 44.8 | — | |
| 18.0 | 99.4 | 94.5 | 23.4 | |
| 15.1 | 91.2 | 91.3 | 38.2 | |
| — | — | 63.6 | — | |
| 11.4 | 97.2 | 108.4* | 58.6 | |
| 10.2 | 130.1* | 182.3* | 53.1 | |
| 12.1 | 384.9* | 63.0 | 23.2 | |
| 749.1* | 2291.5* | 678.7* | 1281.5* |
* Above 100%: WER = (substitutions + deletions + insertions) ÷ words actually spoken. Hallucinated content counts as insertions, so a model that invents enough text makes more errors than there are real words.
AssemblyAI was run through model routing with universal-3-pro and universal-2. Universal-3 Pro supports English, Spanish, Portuguese, French, German, and Italian; the Arabic, Hinglish/Hindi, and Japanese results should be read as Universal-2 fallback scores.
02The Data
The recordings are natural conversations between native speakers who know each other — real calls between family members, speaking the way people actually speak: overlapping turns, interruptions, laughter, mid-sentence language switches, imperfect connections. Nothing is scripted, read aloud, or staged.
Ground truth is utterance-level transcription with timestamps and speaker labels, produced and verified by native speakers. Code-switched speech is kept in its natural scripts — Hindi in Devanagari, English in Latin — and each recording carries speaker metadata such as first language, dialect, and age bracket. Conversations are available both as the merged mix benchmarked here and as speaker-isolated tracks.
| Language | Setting |
|---|---|
| Spanish | Mexican Spanish, family call |
| Arabic | Gulf dialect, family call |
| Hinglish | Hindi–English code-switching |
| Japanese | Casual conversation |
03Findings
- Clean-language scores don't transfer. The same models that score 3–18% WER on Spanish score 53–385% on Gulf-dialect Arabic. Microsoft's MAI-Transcribe-1.5 — released days before this benchmark claiming the #1 spot on the Open ASR Leaderboard — scores a solid 8.9% on Spanish yet misidentifies the Arabic call entirely, transcribing it as German and English gibberish (96.4% WER, reproducible with both generic and region-specific language hints). Published benchmarks built on read speech and broadcast audio say very little about real conversation in lower-resource varieties.
- Code-switching is a separating dimension. Hinglish — how hundreds of millions of people actually speak — splits the field: ElevenLabs reaches 8.1% WER and Deepgram 19.3%, frontier multimodal models land mid-pack (Gemini 3.5 Flash at 35.0%), every Whisper-derived model sits between 76% and 108%, and Mistral's Voxtral hits 182% after generating twice as many words as were spoken. Even Sarvam — built specifically for Indian languages — lands at 63.6%. Whisper-family models and Sarvam also force all output into a single script, writing English words in Devanagari.
- Failure is catastrophic, not graceful. On Arabic, gpt-4o-mini-transcribe produced more than four times as many words as were actually spoken — page after page of invented content. Cartesia emitted repetition loops (one syllable repeated 70+ times) and transcribed only a quarter of the Japanese audio. gpt-4o-transcribe silently dropped the entire opening of one file. NVIDIA's Nemotron 3 Nano Omni — a reasoning multimodal model — is the starkest case: on Arabic it spends tens of thousands of reasoning tokens and returns no transcript at all, and when it does emit text it loops, producing 38,000 words where 1,664 were spoken (2,292% WER) and romanizing Japanese into Latin script (1,282% CER). In production, these failures would poison downstream data without raising an error. The instability is also run-to-run: across five repeated runs, dedicated transcription APIs return near-identical scores, while gpt-4o-family models swing by tens of points on the harder languages — the same audio, the same settings, a different transcript every time.
- Everyone struggles with casual Japanese. Fast, overlapping family conversation holds the best result (AssemblyAI via Universal-2 fallback) to 16.6% CER — roughly one error every six characters — despite Japanese being a well-resourced language.
04Methodology
Audio. Each model received the merged conversation mix — overlapping speech included — at 16 kHz mono, and was scored against the human-verified transcript of the full conversation.
Models. Each provider's current flagship transcription API, called with its documented language hint: AssemblyAI model routing with universal-3-pro and universal-2, ElevenLabs scribe_v2, Deepgram nova-3, OpenAI gpt-4o-transcribe, gpt-4o-transcribe-diarize, and gpt-4o-mini-transcribe (2025-12-15), whisper-1, Cartesia ink-whisper, Google Gemini 3.1 Pro and 3.5 Flash, Microsoft MAI-Transcribe-1.5, Mistral Voxtral Small 24B, open-weights Whisper large-v3 (via Groq), and NVIDIA Nemotron 3 Nano Omni 30B (open weights, via OpenRouter; added after the original evaluation and reported from a single run rather than a five-run mean). Multimodal chat models were prompted for verbatim transcription at temperature 0. Audio exceeding provider limits was split into 5–10 minute segments and rejoined. Sarvam saarika:v2.5 (batch API) supports Indic languages and English only, so it is scored on Hinglish alone. AssemblyAI's documentation says this routed configuration uses Universal-3 Pro for languages it supports and falls back to Universal-2 for all other languages; therefore the AssemblyAI Spanish result is Universal-3 Pro, while the Arabic, Hinglish/Hindi, and Japanese results are Universal-2 fallback scores rather than native Universal-3 Pro scores.
Scoring. References and hypotheses are normalized identically before comparison: timestamps, speaker labels, and annotation tags stripped; Unicode NFKC; casefolding; punctuation removed; Arabic diacritics and orthographic variants folded. WER and CER are computed with jiwer. CER is the headline metric for Japanese; WER elsewhere. The full benchmark was run five times per model; reported figures are means across runs.
Caveats. Part of the Whisper-family Hinglish gap is script convention rather than misrecognition, though their error rates remain far higher under any convention. Merged audio includes crosstalk, which penalizes all models equally.