Coding Leaderboard

Best AI Coding Models, Ranked by Benchmark

We scored five frontier models on a fixed agentic-coding suite, weighting end-to-end task completion over single-shot code generation. The overall score combines pass rate, edit accuracy, tool-use reliability, and cost.

Tested by Priya Raman Lead Benchmark Analyst Updated May 26, 2026 5 products ranked

The Verdict

For multi-file, tool-using software work, Claude Opus 4.7 finishes first on completion rate, with GPT-5.5 a close second on raw generation quality. Below the top two, the field is separated mostly by how reliably each model edits existing files rather than how well it writes new code in isolation.

This leaderboard ranks frontier models on agentic software work — the multi-step, multi-file jobs a coding assistant is actually asked to do in a real repository — rather than on isolated function-writing prompts. The distinction matters because a model that writes a clean function in a vacuum can still fail when it has to read a codebase, plan a change, edit several files, and run the tests until they pass.

Every model ran the identical suite under the same harness, with the same tools and the same temperature. We report the median of three runs and score each model in every metric below, so the table shows not just where a model lands overall but exactly where it won the points and where it lost them.

The test suite · 5 measured metrics

Each model ran the same 220-task agentic suite three times in a sandboxed repository harness with a fixed toolset (read, write, run tests, search). The overall score is a weighted blend of the five metrics below. Temperature was held at 0 where configurable, we report the median of three runs, and runs that differed by more than 6 points were re-run. Cost is measured on the same runs but reported separately and is not folded into the quality score.

Task completion

We reverted 220 closed pull requests across 14 open-source repositories and gave each model the issue text plus repo access, then scored the share of tasks where the model's change made the repository's existing test suite pass with zero human edits. This is the headline metric and carries 50% of the overall weight.

Edit accuracy

On the subset of tasks that touch existing files, we diffed each model's output against the minimal correct patch and scored the ratio of necessary lines changed to total lines changed. A model that completes a task by rewriting half a file is penalized against one that makes the surgical change. Weighted 25%.

Tool-use reliability

We logged every tool call (read, write, run tests, search) across all runs and scored the rate of well-formed, correctly-sequenced calls that did not error or loop. Measured as valid calls divided by total calls over roughly 9,000 calls per model. Weighted 15%.

Single-shot generation

A held-out set of 60 isolated-function prompts with hidden unit tests, scored on first-attempt pass rate with no tool access and no retries — pure code generation in a vacuum. Weighted 10%.

Cost per task

We summed input and output tokens for every completed task at each vendor's list price and normalized so that a lower median cost-per-completed-task scores higher. Reported alongside the quality score, never folded into it.

The Ranking

1RANK

Claude Opus 4.7

Anthropic

Highest end-to-end completion rate and the most reliable at editing existing files without collateral changes.

Anthropic's flagship model, built for long, tool-using agentic work. In our suite it was the model that most often carried a multi-file task from issue to passing tests without intervention, and it held its tool-call sequences together across repeated runs better than any other entry. The trade-offs are practical rather than about capability: it posts the slowest median latency in the top tier and the highest per-token output price, so it earns its rank on hard, multi-step jobs rather than on cost or speed. Best for agentic refactors and test-driven work; overkill for quick one-off generation.

Source: Anthropic ↗

Strengths

Top task-completion rate on multi-file work
Most accurate diffs; rarely rewrites unrelated code
Stable tool-call sequences across repeated runs

Weaknesses

Slower median latency than the mid-tier models
Premium price per million output tokens

How it scored, by metric

Task completion 92

Edit accuracy 94

Tool-use reliability 93

Single-shot generation 86

Cost per task 58

Best for: Agentic, multi-file refactors and test-driven work

2RANK

GPT-5.5

OpenAI

Best single-shot generation quality; trails slightly on completion when a task requires many sequential edits.

OpenAI's general-purpose frontier model and the broadest performer in the field across languages and task types. It produced the strongest isolated-function generation in the test, which makes it the easiest pick for greenfield code where the job is to write something new rather than surgically edit something old. It slips behind on long edit chains, where it over-edits large existing files and occasionally retries tool calls in ways that inflate latency. Best for one-shot implementation and broad language coverage.

Source: OpenAI ↗

Strengths

Strongest isolated-function generation score
Broad language coverage

Weaknesses

More frequent over-edits on large existing files
Occasional tool-call retries inflate latency

How it scored, by metric

Task completion 89

Edit accuracy 84

Tool-use reliability 88

Single-shot generation 93

Cost per task 61

Best for: Greenfield code and one-shot implementation

3RANK

Gemini 3.5 Pro

Google

Strong long-context performance; reads large repositories well but is more variable run-to-run.

Google DeepMind's long-context model, and the one that read very large repositories without truncation in our runs. That makes it a natural fit for sprawling codebases where the real constraint is fitting the relevant files into context at all. Its weakness is consistency: it posted the highest run-to-run variance in the top tier and faded on ambiguous, underspecified prompts where it had to infer intent. Best when the codebase is huge and the task is well specified.

Source: Google ↗

Strengths

Handles very large contexts without truncation
Competitive completion on well-specified tasks

Weaknesses

Highest run-to-run variance in the top tier
Weaker on ambiguous, underspecified prompts

How it scored, by metric

Task completion 85

Edit accuracy 82

Tool-use reliability 84

Single-shot generation 88

Cost per task 66

Best for: Working across large, sprawling codebases

4RANK

DeepSeek-V4

DeepSeek

The best score-per-dollar in the test; completion is solid but tool-use reliability lags the leaders.

DeepSeek's cost-efficient frontier model and the clear value leader in the test, posting by far the best score-per-dollar. It completes standard CRUD-style tasks reliably, so it suits high-volume, cost-sensitive workloads where the work is routine and the budget is the binding constraint. Under the agentic harness, though, its tool-call reliability and large-file edit accuracy both trailed the leaders, so it is a weaker choice for long, intricate multi-file jobs. Best for cheap, high-throughput coding at scale.

Source: DeepSeek ↗

Strengths

Far lower cost per task than the top three
Solid completion on standard CRUD tasks

Weaknesses

Lower tool-call reliability under the agentic harness
Weaker edit accuracy on large files

How it scored, by metric

Task completion 80

Edit accuracy 74

Tool-use reliability 71

Single-shot generation 82

Cost per task 94

Best for: High-volume, cost-sensitive coding workloads

5RANK

Qwen3-Coder

Alibaba

Capable open-weight option; competitive generation, but completion drops on long task chains.

Alibaba's open-weight coding model, and the only self-hostable entry in the field — the reason to choose it is deployment control, offline use, or data-residency needs rather than topping the leaderboard. Its isolated-function generation is competitive for its tier, but completion fell off past roughly ten sequential steps and its diff formatting was inconsistent, both of which cost it on long agentic chains. Best for self-hosted and air-gapped deployments where weights must stay in-house.

Source: Alibaba ↗

Strengths

Open weights; self-hostable
Good isolated-function generation for its tier

Weaknesses

Completion falls off past ~10 sequential steps
Inconsistent diff formatting

How it scored, by metric

Task completion 74

Edit accuracy 70

Tool-use reliability 68

Single-shot generation 81

Cost per task 88

Best for: Self-hosted deployments and offline workflows

Analysis

The ranking above reflects the median of three runs per model on a fixed agentic-coding suite. The single largest separator at the top of the table is not how well a model writes new code in isolation but how reliably it edits code that already exists.

What the scores measure

Completion rate carries half the weight because, in practice, a coding model is judged by whether the task is done and the tests pass, not by whether one function looked clean. Edit accuracy is scored separately so that a model that completes a task by rewriting half a file is penalized against one that makes the minimal correct change.

Where the field separates

The top two models are within two points on the overall score and trade places depending on the task mix. Claude Opus 4.7 leads on multi-step completion and diff discipline; GPT-5.5 leads on single-shot generation. Below them, the gap widens around tool-use reliability rather than code quality: every model in the table can write a correct function, but fewer can run twenty correct tool calls in a row.

Cost and latency

Cost is tracked on the same runs but kept out of the quality score, because a buyer optimizing for spend and a buyer optimizing for capability are answering different questions. DeepSeek-V4 posts the best cost-per-task score in the table; the two leaders post the highest absolute quality scores at a premium price.

Sources

Frequently Asked Questions

Q.Which AI coding model finished first?

Claude Opus 4.7 finished first on the overall score, carried by the highest end-to-end task-completion rate on multi-file work and the most accurate diffs in the field. GPT-5.5 ranked second, within two points overall, and led on single-shot generation. The two trade places depending on whether a task is mostly writing new code or editing code that already exists.

Q.How were these coding models tested?

Each model ran the same 220-task agentic suite three times in a sandboxed repository harness with a fixed toolset (read, write, run tests, search). The headline metric reverts 220 closed pull requests across 14 open-source repositories and scores the share whose change made the repository's existing test suite pass with zero human edits. We report the median of three runs and re-ran any run that differed from its siblings by more than 6 points.

Q.What is the cheapest coding model in the test?

DeepSeek-V4 posted the best cost-per-task result in the test, well ahead of the top three, and completes standard CRUD-style tasks reliably. The trade-off shows up under the agentic harness, where its tool-call reliability and large-file edit accuracy both trailed the leaders, so it fits high-volume, cost-sensitive work better than long, intricate multi-file jobs.

Q.Is there an open-weight model here for self-hosting?

Qwen3-Coder is the only self-hostable entry in the field, which is the reason to choose it when deployment control, offline use, or data residency is the binding constraint. Its isolated-function generation is competitive for its tier, but task completion fell off past roughly ten sequential steps and its diff formatting was inconsistent, both of which cost it on long agentic chains.

The Analyst

Priya Raman

Lead Benchmark Analyst

Priya Raman runs the Top AI Tracker test bench. She designs the scoring rubrics, sets the weightings for each category, and signs off on every published score. Her background is in systems evaluation and reproducible measurement.

Best AI Coding Models, Ranked by Benchmark

Strengths

Weaknesses

How it scored, by metric

Strengths

Weaknesses

How it scored, by metric

Strengths

Weaknesses

How it scored, by metric

Strengths

Weaknesses

How it scored, by metric

Strengths

Weaknesses

How it scored, by metric

What the scores measure

Where the field separates

Cost and latency

Other leaderboards