How we benchmark AI products.
Every leaderboard and comparison on Top AI Tracker comes out of the same process: a fixed battery of measured tests, scored on a published rubric, and re-run as the products change.
We do not rank on impressions and we do not run vendor demos. Each product faces the same set of repeatable tests built to isolate one quality at a time, and we score the measured results against a rubric we keep public. We use the products on the kind of work the buyer actually does, not on a staged showcase.
A single number can hide a lot, so we never publish one without showing the work behind it. Every leaderboard lists its exact tests in a "How We Tested" section, and every ranked product reports its per-metric score in each one. The metrics below are the spine of that process; the specific tasks vary by category, because benchmarking a coding model is not the same as benchmarking an image generator.
| Metric | How it is measured |
|---|---|
| Task completion | Each product runs the same fixed task set for its category three times in a sandboxed harness, and we score the share of tasks finished correctly end to end — not single-shot output, the whole job done with the tests passing. |
| Accuracy & reliability | We repeat the hardest tasks many times and count the rate of correct, hand-holding-free results. Run-to-run variance is recorded and penalized: a product that nails a demo once but drifts on the tenth run is marked down for it. |
| Speed | On a fixed workload we measure median time-to-first-token and sustained tokens-per-second, averaged over hundreds of requests from the same region at the same time of day, so a noisy network cannot flatter or punish a product. |
| Cost | We price a month of observed usage at list rates and normalize to cost per useful result, so a cheap product that needs five retries does not get to look like a bargain. Cost is reported next to the quality score, never folded into it. |
| Context handling | A needle-in-context probe places facts deep in long inputs and measures retrieval rate at increasing context lengths, isolating how far a product holds usable context before recall falls off. |
| Consistency over time | Because these products change weekly, we re-run the suite on each meaningful update and date every score. A pick can lose its place when a rival ships, and the date on the ranking advances when it does. |
We weight the metrics toward what matters most for the category, publish the weighting with each ranking, then rank by the weighted total. Because every product is scored in every metric, a reader can see exactly where one won and where it lost.
Cost and latency are tracked on the same runs but reported next to the quality score, never folded into it, because optimizing for spend and optimizing for capability are different questions. Nothing here is final: these products ship meaningful changes almost weekly, so every verdict is dated and the suite is re-run on each major release. A pick can lose its spot when a rival catches up, and when that happens the date advances and the ranking says what changed.
We take no sponsorships and no payment for placement. A product cannot buy its way onto a leaderboard, buy a higher rank, or buy a better score. Every number reflects our testing and nothing else.
Priya Raman runs the Top AI Tracker test bench. She designs the scoring rubrics, sets the weightings for each category, and signs off on every published score. Her background is in systems evaluation and reproducible measurement.
Devon Mizrahi measures what a model costs to run and how fast it answers. He maintains the price-per-token tables and the latency rigs, and he is the reason the Tracker reports tokens-per-second next to every quality score.
Hana Koizumi evaluates image, audio, and agentic tool use. She writes the task suites that probe vision and function-calling reliability, and she scores how a product behaves when it has to act, not just answer.
Marcus Elwood benchmarks the assistants, IDE copilots, and writing tools people actually buy. He focuses on real-task throughput and the gap between a product's demo and its day-to-day behavior.