Best AI Coding Models, Ranked by Benchmark
We scored five frontier models on a fixed agentic-coding suite, weighting end-to-end task completion over single-shot code generation. The overall score combines pass rate, edit accuracy, tool-use reliability, and cost.
For multi-file, tool-using software work, Claude Opus 4.7 finishes first on completion rate, with GPT-5.5 a close second on raw generation quality. Below the top two, the field is separated mostly by how reliably each model edits existing files rather than how well it writes new code in isolation.
This leaderboard ranks frontier models on agentic software work — the multi-step, multi-file jobs a coding assistant is actually asked to do in a real repository — rather than on isolated function-writing prompts. The distinction matters because a model that writes a clean function in a vacuum can still fail when it has to read a codebase, plan a change, edit several files, and run the tests until they pass.
Every model ran the identical suite under the same harness, with the same tools and the same temperature. We report the median of three runs and score each model in every metric below, so the table shows not just where a model lands overall but exactly where it won the points and where it lost them.
Each model ran the same 220-task agentic suite three times in a sandboxed repository harness with a fixed toolset (read, write, run tests, search). The overall score is a weighted blend of the five metrics below. Temperature was held at 0 where configurable, we report the median of three runs, and runs that differed by more than 6 points were re-run. Cost is measured on the same runs but reported separately and is not folded into the quality score.
We reverted 220 closed pull requests across 14 open-source repositories and gave each model the issue text plus repo access, then scored the share of tasks where the model's change made the repository's existing test suite pass with zero human edits. This is the headline metric and carries 50% of the overall weight.
On the subset of tasks that touch existing files, we diffed each model's output against the minimal correct patch and scored the ratio of necessary lines changed to total lines changed. A model that completes a task by rewriting half a file is penalized against one that makes the surgical change. Weighted 25%.
We logged every tool call (read, write, run tests, search) across all runs and scored the rate of well-formed, correctly-sequenced calls that did not error or loop. Measured as valid calls divided by total calls over roughly 9,000 calls per model. Weighted 15%.
A held-out set of 60 isolated-function prompts with hidden unit tests, scored on first-attempt pass rate with no tool access and no retries — pure code generation in a vacuum. Weighted 10%.
We summed input and output tokens for every completed task at each vendor's list price and normalized so that a lower median cost-per-completed-task scores higher. Reported alongside the quality score, never folded into it.
Anthropic's flagship model, built for long, tool-using agentic work. In our suite it was the model that most often carried a multi-file task from issue to passing tests without intervention, and it held its tool-call sequences together across repeated runs better than any other entry. The trade-offs are practical rather than about capability: it posts the slowest median latency in the top tier and the highest per-token output price, so it earns its rank on hard, multi-step jobs rather than on cost or speed. Best for agentic refactors and test-driven work; overkill for quick one-off generation.
Source: Anthropic ↗Strengths
- Top task-completion rate on multi-file work
- Most accurate diffs; rarely rewrites unrelated code
- Stable tool-call sequences across repeated runs
Weaknesses
- Slower median latency than the mid-tier models
- Premium price per million output tokens
How it scored, by metric
OpenAI's general-purpose frontier model and the broadest performer in the field across languages and task types. It produced the strongest isolated-function generation in the test, which makes it the easiest pick for greenfield code where the job is to write something new rather than surgically edit something old. It slips behind on long edit chains, where it over-edits large existing files and occasionally retries tool calls in ways that inflate latency. Best for one-shot implementation and broad language coverage.
Source: OpenAI ↗Strengths
- Strongest isolated-function generation score
- Broad language coverage
Weaknesses
- More frequent over-edits on large existing files
- Occasional tool-call retries inflate latency
How it scored, by metric
Google DeepMind's long-context model, and the one that read very large repositories without truncation in our runs. That makes it a natural fit for sprawling codebases where the real constraint is fitting the relevant files into context at all. Its weakness is consistency: it posted the highest run-to-run variance in the top tier and faded on ambiguous, underspecified prompts where it had to infer intent. Best when the codebase is huge and the task is well specified.
Source: Google ↗Strengths
- Handles very large contexts without truncation
- Competitive completion on well-specified tasks
Weaknesses
- Highest run-to-run variance in the top tier
- Weaker on ambiguous, underspecified prompts
How it scored, by metric
DeepSeek's cost-efficient frontier model and the clear value leader in the test, posting by far the best score-per-dollar. It completes standard CRUD-style tasks reliably, so it suits high-volume, cost-sensitive workloads where the work is routine and the budget is the binding constraint. Under the agentic harness, though, its tool-call reliability and large-file edit accuracy both trailed the leaders, so it is a weaker choice for long, intricate multi-file jobs. Best for cheap, high-throughput coding at scale.
Source: DeepSeek ↗Strengths
- Far lower cost per task than the top three
- Solid completion on standard CRUD tasks
Weaknesses
- Lower tool-call reliability under the agentic harness
- Weaker edit accuracy on large files
How it scored, by metric
Alibaba's open-weight coding model, and the only self-hostable entry in the field — the reason to choose it is deployment control, offline use, or data-residency needs rather than topping the leaderboard. Its isolated-function generation is competitive for its tier, but completion fell off past roughly ten sequential steps and its diff formatting was inconsistent, both of which cost it on long agentic chains. Best for self-hosted and air-gapped deployments where weights must stay in-house.
Source: Alibaba ↗Strengths
- Open weights; self-hostable
- Good isolated-function generation for its tier
Weaknesses
- Completion falls off past ~10 sequential steps
- Inconsistent diff formatting
How it scored, by metric
The ranking above reflects the median of three runs per model on a fixed agentic-coding suite. The single largest separator at the top of the table is not how well a model writes new code in isolation but how reliably it edits code that already exists.
What the scores measure
Completion rate carries half the weight because, in practice, a coding model is judged by whether the task is done and the tests pass, not by whether one function looked clean. Edit accuracy is scored separately so that a model that completes a task by rewriting half a file is penalized against one that makes the minimal correct change.
Where the field separates
The top two models are within two points on the overall score and trade places depending on the task mix. Claude Opus 4.7 leads on multi-step completion and diff discipline; GPT-5.5 leads on single-shot generation. Below them, the gap widens around tool-use reliability rather than code quality: every model in the table can write a correct function, but fewer can run twenty correct tool calls in a row.
Cost and latency
Cost is tracked on the same runs but kept out of the quality score, because a buyer optimizing for spend and a buyer optimizing for capability are answering different questions. DeepSeek-V4 posts the best cost-per-task score in the table; the two leaders post the highest absolute quality scores at a premium price.
- https://www.anthropic.com/claude/opus
- https://openai.com/index/gpt-5/
- https://deepmind.google/models/gemini/
- https://www.deepseek.com/
- https://qwenlm.github.io/
- https://www.swebench.com/
Q.Which AI coding model finished first?
Claude Opus 4.7 finished first on the overall score, carried by the highest end-to-end task-completion rate on multi-file work and the most accurate diffs in the field. GPT-5.5 ranked second, within two points overall, and led on single-shot generation. The two trade places depending on whether a task is mostly writing new code or editing code that already exists.
Q.How were these coding models tested?
Each model ran the same 220-task agentic suite three times in a sandboxed repository harness with a fixed toolset (read, write, run tests, search). The headline metric reverts 220 closed pull requests across 14 open-source repositories and scores the share whose change made the repository's existing test suite pass with zero human edits. We report the median of three runs and re-ran any run that differed from its siblings by more than 6 points.
Q.What is the cheapest coding model in the test?
DeepSeek-V4 posted the best cost-per-task result in the test, well ahead of the top three, and completes standard CRUD-style tasks reliably. The trade-off shows up under the agentic harness, where its tool-call reliability and large-file edit accuracy both trailed the leaders, so it fits high-volume, cost-sensitive work better than long, intricate multi-file jobs.
Q.Is there an open-weight model here for self-hosting?
Qwen3-Coder is the only self-hostable entry in the field, which is the reason to choose it when deployment control, offline use, or data residency is the binding constraint. Its isolated-function generation is competitive for its tier, but task completion fell off past roughly ten sequential steps and its diff formatting was inconsistent, both of which cost it on long agentic chains.
Priya Raman runs the Top AI Tracker test bench. She designs the scoring rubrics, sets the weightings for each category, and signs off on every published score. Her background is in systems evaluation and reproducible measurement.