Best AI Speech-to-Text APIs, Ranked by Benchmark
We scored five production speech-to-text APIs on a fixed transcription suite, weighting word error rate above marketing claims. The overall score combines accuracy, entity capture, latency, language coverage, and cost.
On mixed real-world audio, AssemblyAI's Universal-3 Pro Streaming takes first on entity-level accuracy. Gladia Solaria-1 sits two points back with the lowest average WER and the widest language list. ElevenLabs Scribe v2 wins the realtime latency column at roughly 150 ms but trails the top two on conversational WER. Deepgram Nova-3 is the cheapest in the field by a clear margin and the quickest to integrate, but it posts the highest missed-entity rate near the top. Pick by which column your workload actually breaks on.
This leaderboard ranks production speech-to-text APIs on the audio teams have to transcribe in practice: multi-speaker conversations, accented speech, telephony, and code-switched recordings. The distinction matters because a model that posts 2% WER on LibriSpeech-clean can return 15%+ WER on the same call-center clip a buyer is trying to ship against.
Each API ran the same suite through its current flagship model, and we report results across five metrics so the table shows both where an API lands overall and exactly which column it won and which it lost. Cost is reported alongside but kept out of the quality score.
Each API was called with default settings plus speaker diarization enabled, then re-called with vendor-recommended settings (keyterm/word-boost, language hints) for a second pass. We report the better of the two passes per file. Hardware and network conditions were held constant from a single US-East egress point, and every WER number is computed against human-verified reference transcripts after standard text normalization (lowercasing, punctuation stripping, number folding). Latency was measured over 500 streaming requests per provider from the same region. Cost is reported alongside but never folded into the quality score.
We transcribed 74 hours of multi-speaker, real-world conversational audio (call-center recordings, podcasts, and noisy meetings) and computed word error rate against human-verified references after standard text normalization. This is the headline metric and carries 40% of the overall weight.
We isolated 1,200 utterances containing names, email addresses, phone numbers, credit-card numbers, and street addresses, then scored each transcript on missed-entity rate, the share of entities not captured verbatim. Weighted 25%.
We measured median partial-transcript latency over 500 streaming WebSocket requests per provider from a single US-East egress, sending 16 kHz PCM in 100 ms chunks. We report the median; tail latency informed but did not change rankings. Weighted 15%.
We scored each API on the count of officially supported languages from its current documentation, then ran a 12-language subset (including Cantonese, Bengali, Tagalog, and Persian) through each provider and scored the share that returned a usable transcript (CER under 30%). Weighted 15%.
We summed list-price charges for one hour of async transcription with diarization enabled at each vendor's standard tier, normalized so a lower price scores higher. Reported alongside the quality score, never folded into it.
AssemblyAI's flagship speech model, built for production voice agents and call-analytics workloads where what the transcript gets right matters more than how fast it gets there. On our suite it posted the lowest missed-entity rate on names, emails, phone numbers, and credit-card numbers, and it held up best on noisy, accented call-center clips. The trade-offs are practical: it costs more than Deepgram per hour and trails Gladia on tail languages. Best for voice agents and contact-center pipelines where a misread digit breaks the downstream workflow.
Source: AssemblyAI ↗Strengths
- Lowest missed-entity rate on names, emails, and account numbers
- Stable performance under high-volume periods
- Speech-intelligence features (PII, sentiment, summarization) on the same API
Weaknesses
- Higher per-hour cost than Deepgram or Gladia base tiers
- Cloud-only deployment outside of Enterprise
How it scored, by metric
Gladia's universal ASR model, and the only entry in the field that supports 42 languages no other mainstream API covers. On our suite it posted the lowest average WER on conversational speech and the best diarization on multi-speaker recordings, with native mid-conversation code-switching that the other APIs only handle with manual segmentation. It's also the simplest to price: diarization, translation, and audio intelligence are included in the base per-hour rate rather than billed as add-ons. Best for multilingual contact-center and meeting workloads where a single file can contain more than one language.
Source: Gladia ↗Strengths
- Lowest average conversational WER in the test
- 100+ languages with native code-switching, including 42 unique to Gladia
- All-inclusive per-hour pricing; diarization is not an add-on
Weaknesses
- Final-transcript latency in real-time mode trails ElevenLabs and Deepgram
- Smaller US ecosystem and fewer enterprise reference deployments
How it scored, by metric
ElevenLabs' speech-to-text model and the latency leader of the field, with Scribe v2 Realtime returning partial transcripts at around 150 ms across 90+ languages. The realtime variant is purpose-built for conversational AI agents and live captioning, while the batch Scribe v2 is tuned for long-form recordings with up to 32 speakers. It trailed the top two on multi-speaker conversational WER and doesn't yet ship speaker diarization on the realtime endpoint, which limits it as a drop-in for call-analytics. Best for live agents, captioning, and any workflow where the model has to keep up with the speaker.
Source: ElevenLabs ↗Strengths
- ~150 ms partial-transcript latency, the lowest in the test
- 90+ languages with strong tail-language WER
- Up to 32-speaker diarization on the batch endpoint
Weaknesses
- No diarization on the realtime endpoint at time of testing
- Conversational WER trails Solaria-1 and Universal-3 Pro on noisy call audio
How it scored, by metric
Deepgram's flagship streaming model and the clear cost leader of the field, with sub-300 ms streaming latency and per-second billing that materially undercuts competitors with similar per-minute headline rates. It completes general-purpose transcription jobs reliably and is a natural fit for high-volume voice-agent workloads where the audio is mostly clean English. On our suite, though, its missed-entity rate on names, emails, phone numbers, and credit-card numbers ran roughly nine points higher than Universal-3 Pro, which matters when the transcript feeds an LLM that will act on what it reads. Best for high-volume, English-first voice agents on a tight budget.
Source: Deepgram ↗Strengths
- Lowest cost per hour of audio in the test
- Sub-300 ms streaming latency
- Per-second billing avoids rounding penalties on short utterances
Weaknesses
- Higher missed-entity rate on names, addresses, and account numbers
- Fewer supported languages than Gladia, ElevenLabs, or Whisper
How it scored, by metric
OpenAI's open-weight ASR model, available via the hosted API at $0.006 per minute and also self-hostable on a GPU. It's still the most widely used open-source ASR model and the strongest pick for multilingual batch transcription when self-hosting is a hard requirement (data residency, air-gapped deployments, or per-token-economics workloads). On our suite it's competitive on clean English WER but loses ground on multi-speaker conversational audio, where it doesn't ship native diarization and tends to hallucinate during long silences. Best for self-hosted, batch, multilingual workloads where streaming and entity capture aren't the priority.
Source: OpenAI ↗Strengths
- Open weights; self-hostable on GPU
- Hosted API at $0.006 per minute, the lowest sticker price in the test
- 99+ languages on the model card
Weaknesses
- No native diarization; teams typically bolt on pyannote.audio
- No first-party managed streaming endpoint
- Prone to hallucination on long silences without WhisperX-style VAD chunking
How it scored, by metric
The ranking above reflects results on a fixed multi-speaker transcription suite using each provider’s current flagship model. The single largest separator at the top of the table isn’t headline WER on clean read-speech; it’s how reliably each API captures the entities (names, emails, phone numbers, account numbers) that downstream LLMs and CRM syncs depend on.
What the scores measure
Conversational WER carries 40% of the weight because, in practice, a speech-to-text API is judged by what it returns on the messy, multi-speaker, accented audio teams have to transcribe, not on read-speech splits. Entity capture is scored separately because a 2-point overall WER advantage means little if the model that posts it is also dropping the “RX-” prefix on a medication number or garbling a phone number. AssemblyAI has published a side-by-side on a pharmacy refill scenario showing exactly that failure pattern on a competing model.
Where the field separates
The top two APIs sit within three points on the overall score and trade places depending on the workload. Solaria-1 leads on conversational WER and on language coverage, with native code-switching the others handle only with manual segmentation. Universal-3 Pro leads on entity capture and on stability under high-volume periods. Below the top two, the field separates more sharply: Scribe v2 Realtime wins the latency column at ~150 ms but doesn’t yet ship realtime diarization, Nova-3 wins the cost column but loses roughly nine points on missed-entity rate, and Whisper is the open-weight baseline rather than a managed-streaming contender.
Cost and latency
Cost is tracked on the same runs but kept out of the quality score, because a buyer optimizing for per-hour spend and a buyer optimizing for entity accuracy are answering different questions. Deepgram’s per-second billing is the meaningful structural difference here: for short, chatty voice-agent traffic it can outprice competitors with similar per-minute headline rates by 30-40%. On the other side of the trade, AssemblyAI’s per-hour rate is higher, but the lower missed-entity rate is the kind of difference that shows up not in the transcription bill but in the rate at which a voice agent has to ask the caller to repeat themselves.
- https://www.assemblyai.com/
- https://www.gladia.io/
- https://elevenlabs.io/speech-to-text
- https://deepgram.com/
- https://platform.openai.com/docs/guides/speech-to-text
- https://www.gladia.io/solaria
- https://elevenlabs.io/realtime-speech-to-text
Q.Which speech-to-text API is most accurate in 2026?
On multi-speaker conversational audio, Gladia Solaria-1 posted the lowest average WER in our suite, with AssemblyAI's Universal-3 Pro Streaming two points back and the clear leader on entity-level accuracy. Vendor-published benchmarks tell a similar story: Solaria-1 reports on average 29% lower WER on conversational speech than competing APIs, and Universal-3 Pro Streaming reports 94.07% word accuracy with a 6.3% mean WER across English domains. Pick by whether your failure mode is overall WER or missing a digit in an account number.
Q.Which API has the lowest streaming latency?
ElevenLabs Scribe v2 Realtime is the latency leader, returning partial transcripts at around 150 ms across 90+ languages. Deepgram Nova-3 is close behind at sub-300 ms and bills by the second, which materially undercuts competitors that round up to 15-second increments on short utterances.
Q.Which speech-to-text API supports the most languages?
Gladia Solaria-1 supports 100+ languages, including 42 that aren't available on any other mainstream API (Bengali, Punjabi, Tagalog, Persian, Kazakh, Haitian Creole, and others), with native code-switching across the full set. ElevenLabs Scribe v2 covers 90+ languages, and Whisper large-v3's model card lists 99+. Deepgram Nova-3 covers fewer tail languages than the multilingual specialists.
Q.Is OpenAI Whisper still worth using?
Whisper is still the strongest self-hostable open-weight option and the cheapest hosted API at $0.006 per minute, but it's no longer the most accurate ASR model. NVIDIA Canary-Qwen 2.5B currently leads the Hugging Face Open ASR Leaderboard at 5.63% WER, with several commercial APIs now ahead on conversational audio. Choose Whisper when self-hosting, data residency, or batch multilingual coverage is the binding constraint, not when you need managed streaming or native diarization.
Hana Koizumi evaluates image, audio, and agentic tool use. She writes the task suites that probe vision and function-calling reliability, and she scores how a product behaves when it has to act, not just answer.