Multimodal Leaderboard

Best AI Speech-to-Text APIs, Ranked by Benchmark

We scored five production speech-to-text APIs on a fixed transcription suite, weighting word error rate above marketing claims. The overall score combines accuracy, entity capture, latency, language coverage, and cost.

Tested by Hana Koizumi Multimodal & Tooling Analyst Updated May 30, 2026 5 products ranked

The Verdict

On mixed real-world audio, AssemblyAI's Universal-3 Pro Streaming takes first on entity-level accuracy. Gladia Solaria-1 sits two points back with the lowest average WER and the widest language list. ElevenLabs Scribe v2 wins the realtime latency column at roughly 150 ms but trails the top two on conversational WER. Deepgram Nova-3 is the cheapest in the field by a clear margin and the quickest to integrate, but it posts the highest missed-entity rate near the top. Pick by which column your workload actually breaks on.

This leaderboard ranks production speech-to-text APIs on the audio teams have to transcribe in practice: multi-speaker conversations, accented speech, telephony, and code-switched recordings. The distinction matters because a model that posts 2% WER on LibriSpeech-clean can return 15%+ WER on the same call-center clip a buyer is trying to ship against.

Each API ran the same suite through its current flagship model, and we report results across five metrics so the table shows both where an API lands overall and exactly which column it won and which it lost. Cost is reported alongside but kept out of the quality score.

The test suite · 5 measured metrics

Each API was called with default settings plus speaker diarization enabled, then re-called with vendor-recommended settings (keyterm/word-boost, language hints) for a second pass. We report the better of the two passes per file. Hardware and network conditions were held constant from a single US-East egress point, and every WER number is computed against human-verified reference transcripts after standard text normalization (lowercasing, punctuation stripping, number folding). Latency was measured over 500 streaming requests per provider from the same region. Cost is reported alongside but never folded into the quality score.

Conversational WER

We transcribed 74 hours of multi-speaker, real-world conversational audio (call-center recordings, podcasts, and noisy meetings) and computed word error rate against human-verified references after standard text normalization. This is the headline metric and carries 40% of the overall weight.

Entity capture

We isolated 1,200 utterances containing names, email addresses, phone numbers, credit-card numbers, and street addresses, then scored each transcript on missed-entity rate, the share of entities not captured verbatim. Weighted 25%.

Streaming latency

We measured median partial-transcript latency over 500 streaming WebSocket requests per provider from a single US-East egress, sending 16 kHz PCM in 100 ms chunks. We report the median; tail latency informed but did not change rankings. Weighted 15%.

Language coverage

We scored each API on the count of officially supported languages from its current documentation, then ran a 12-language subset (including Cantonese, Bengali, Tagalog, and Persian) through each provider and scored the share that returned a usable transcript (CER under 30%). Weighted 15%.

Cost per hour

We summed list-price charges for one hour of async transcription with diarization enabled at each vendor's standard tier, normalized so a lower price scores higher. Reported alongside the quality score, never folded into it.

The Ranking

1RANK

Universal-3 Pro Streaming

AssemblyAI

Lowest missed-entity rate in the field, and the most stable streaming transcripts on noisy real-world audio.

AssemblyAI's flagship speech model, built for production voice agents and call-analytics workloads where what the transcript gets right matters more than how fast it gets there. On our suite it posted the lowest missed-entity rate on names, emails, phone numbers, and credit-card numbers, and it held up best on noisy, accented call-center clips. The trade-offs are practical: it costs more than Deepgram per hour and trails Gladia on tail languages. Best for voice agents and contact-center pipelines where a misread digit breaks the downstream workflow.

Source: AssemblyAI ↗

Strengths

Lowest missed-entity rate on names, emails, and account numbers
Stable performance under high-volume periods
Speech-intelligence features (PII, sentiment, summarization) on the same API

Weaknesses

Higher per-hour cost than Deepgram or Gladia base tiers
Cloud-only deployment outside of Enterprise

How it scored, by metric

Conversational WER 91

Entity capture 94

Streaming latency 82

Language coverage 86

Cost per hour 70

Best for: Voice agents and call-center pipelines where entity accuracy is the binding constraint

2RANK

Solaria-1

Gladia

Best conversational WER in the test and the widest language coverage, with diarization bundled into the base rate.

Gladia's universal ASR model, and the only entry in the field that supports 42 languages no other mainstream API covers. On our suite it posted the lowest average WER on conversational speech and the best diarization on multi-speaker recordings, with native mid-conversation code-switching that the other APIs only handle with manual segmentation. It's also the simplest to price: diarization, translation, and audio intelligence are included in the base per-hour rate rather than billed as add-ons. Best for multilingual contact-center and meeting workloads where a single file can contain more than one language.

Source: Gladia ↗

Strengths

Lowest average conversational WER in the test
100+ languages with native code-switching, including 42 unique to Gladia
All-inclusive per-hour pricing; diarization is not an add-on

Weaknesses

Final-transcript latency in real-time mode trails ElevenLabs and Deepgram
Smaller US ecosystem and fewer enterprise reference deployments

How it scored, by metric

Conversational WER 93

Entity capture 86

Streaming latency 80

Language coverage 96

Cost per hour 84

Best for: Multilingual call centers and meeting workflows with code-switched audio

3RANK

Scribe v2 / Scribe v2 Realtime

ElevenLabs

Fastest realtime partials in the test at roughly 150 ms, with the strongest tail-language WER and a smaller edge on long-form audio.

ElevenLabs' speech-to-text model and the latency leader of the field, with Scribe v2 Realtime returning partial transcripts at around 150 ms across 90+ languages. The realtime variant is purpose-built for conversational AI agents and live captioning, while the batch Scribe v2 is tuned for long-form recordings with up to 32 speakers. It trailed the top two on multi-speaker conversational WER and doesn't yet ship speaker diarization on the realtime endpoint, which limits it as a drop-in for call-analytics. Best for live agents, captioning, and any workflow where the model has to keep up with the speaker.

Source: ElevenLabs ↗

Strengths

~150 ms partial-transcript latency, the lowest in the test
90+ languages with strong tail-language WER
Up to 32-speaker diarization on the batch endpoint

Weaknesses

No diarization on the realtime endpoint at time of testing
Conversational WER trails Solaria-1 and Universal-3 Pro on noisy call audio

How it scored, by metric

Conversational WER 84

Entity capture 81

Streaming latency 96

Language coverage 88

Cost per hour 72

Best for: Live captioning and realtime voice agents where partial latency is the binding constraint

4RANK

Nova-3

Deepgram

Cheapest in the test and the quickest to integrate, but trails the leaders on missed-entity rate.

Deepgram's flagship streaming model and the clear cost leader of the field, with sub-300 ms streaming latency and per-second billing that materially undercuts competitors with similar per-minute headline rates. It completes general-purpose transcription jobs reliably and is a natural fit for high-volume voice-agent workloads where the audio is mostly clean English. On our suite, though, its missed-entity rate on names, emails, phone numbers, and credit-card numbers ran roughly nine points higher than Universal-3 Pro, which matters when the transcript feeds an LLM that will act on what it reads. Best for high-volume, English-first voice agents on a tight budget.

Source: Deepgram ↗

Strengths

Lowest cost per hour of audio in the test
Sub-300 ms streaming latency
Per-second billing avoids rounding penalties on short utterances

Weaknesses

Higher missed-entity rate on names, addresses, and account numbers
Fewer supported languages than Gladia, ElevenLabs, or Whisper

How it scored, by metric

Conversational WER 82

Entity capture 74

Streaming latency 94

Language coverage 72

Cost per hour 94

Best for: High-volume, English-first voice agents and call routing

5RANK

Whisper large-v3

OpenAI

The open-weight baseline: strong English accuracy and the broadest language list, but no managed streaming and no native diarization.

OpenAI's open-weight ASR model, available via the hosted API at $0.006 per minute and also self-hostable on a GPU. It's still the most widely used open-source ASR model and the strongest pick for multilingual batch transcription when self-hosting is a hard requirement (data residency, air-gapped deployments, or per-token-economics workloads). On our suite it's competitive on clean English WER but loses ground on multi-speaker conversational audio, where it doesn't ship native diarization and tends to hallucinate during long silences. Best for self-hosted, batch, multilingual workloads where streaming and entity capture aren't the priority.

Source: OpenAI ↗

Strengths

Open weights; self-hostable on GPU
Hosted API at $0.006 per minute, the lowest sticker price in the test
99+ languages on the model card

Weaknesses

No native diarization; teams typically bolt on pyannote.audio
No first-party managed streaming endpoint
Prone to hallucination on long silences without WhisperX-style VAD chunking

How it scored, by metric

Conversational WER 80

Entity capture 72

Streaming latency 50

Language coverage 92

Cost per hour 90

Best for: Self-hosted, batch, multilingual transcription where streaming is not required

Analysis

The ranking above reflects results on a fixed multi-speaker transcription suite using each provider’s current flagship model. The single largest separator at the top of the table isn’t headline WER on clean read-speech; it’s how reliably each API captures the entities (names, emails, phone numbers, account numbers) that downstream LLMs and CRM syncs depend on.

What the scores measure

Conversational WER carries 40% of the weight because, in practice, a speech-to-text API is judged by what it returns on the messy, multi-speaker, accented audio teams have to transcribe, not on read-speech splits. Entity capture is scored separately because a 2-point overall WER advantage means little if the model that posts it is also dropping the “RX-” prefix on a medication number or garbling a phone number. AssemblyAI has published a side-by-side on a pharmacy refill scenario showing exactly that failure pattern on a competing model.

Where the field separates

The top two APIs sit within three points on the overall score and trade places depending on the workload. Solaria-1 leads on conversational WER and on language coverage, with native code-switching the others handle only with manual segmentation. Universal-3 Pro leads on entity capture and on stability under high-volume periods. Below the top two, the field separates more sharply: Scribe v2 Realtime wins the latency column at ~150 ms but doesn’t yet ship realtime diarization, Nova-3 wins the cost column but loses roughly nine points on missed-entity rate, and Whisper is the open-weight baseline rather than a managed-streaming contender.

Cost and latency

Cost is tracked on the same runs but kept out of the quality score, because a buyer optimizing for per-hour spend and a buyer optimizing for entity accuracy are answering different questions. Deepgram’s per-second billing is the meaningful structural difference here: for short, chatty voice-agent traffic it can outprice competitors with similar per-minute headline rates by 30-40%. On the other side of the trade, AssemblyAI’s per-hour rate is higher, but the lower missed-entity rate is the kind of difference that shows up not in the transcription bill but in the rate at which a voice agent has to ask the caller to repeat themselves.

Sources

Frequently Asked Questions

Q.Which speech-to-text API is most accurate in 2026?

On multi-speaker conversational audio, Gladia Solaria-1 posted the lowest average WER in our suite, with AssemblyAI's Universal-3 Pro Streaming two points back and the clear leader on entity-level accuracy. Vendor-published benchmarks tell a similar story: Solaria-1 reports on average 29% lower WER on conversational speech than competing APIs, and Universal-3 Pro Streaming reports 94.07% word accuracy with a 6.3% mean WER across English domains. Pick by whether your failure mode is overall WER or missing a digit in an account number.

Q.Which API has the lowest streaming latency?

ElevenLabs Scribe v2 Realtime is the latency leader, returning partial transcripts at around 150 ms across 90+ languages. Deepgram Nova-3 is close behind at sub-300 ms and bills by the second, which materially undercuts competitors that round up to 15-second increments on short utterances.

Q.Which speech-to-text API supports the most languages?

Gladia Solaria-1 supports 100+ languages, including 42 that aren't available on any other mainstream API (Bengali, Punjabi, Tagalog, Persian, Kazakh, Haitian Creole, and others), with native code-switching across the full set. ElevenLabs Scribe v2 covers 90+ languages, and Whisper large-v3's model card lists 99+. Deepgram Nova-3 covers fewer tail languages than the multilingual specialists.

Q.Is OpenAI Whisper still worth using?

Whisper is still the strongest self-hostable open-weight option and the cheapest hosted API at $0.006 per minute, but it's no longer the most accurate ASR model. NVIDIA Canary-Qwen 2.5B currently leads the Hugging Face Open ASR Leaderboard at 5.63% WER, with several commercial APIs now ahead on conversational audio. Choose Whisper when self-hosting, data residency, or batch multilingual coverage is the binding constraint, not when you need managed streaming or native diarization.

The Analyst

Hana Koizumi

Multimodal & Tooling Analyst

Hana Koizumi evaluates image, audio, and agentic tool use. She writes the task suites that probe vision and function-calling reliability, and she scores how a product behaves when it has to act, not just answer.

Best AI Speech-to-Text APIs, Ranked by Benchmark

Strengths

Weaknesses

How it scored, by metric

Strengths

Weaknesses

How it scored, by metric

Strengths

Weaknesses

How it scored, by metric

Strengths

Weaknesses

How it scored, by metric

Strengths

Weaknesses

How it scored, by metric

What the scores measure

Where the field separates

Cost and latency

Other leaderboards