If your team scaled up AI coding tools this year and code review has gotten harder, not easier, you're not imagining it. Faros published a study last month — 22,000 developers, 4,000 teams, two years of data — and it describes exactly what teams are experiencing. Throughput is up: task completion up 34%, epics per developer up 66%, code-specific tasks per team up 210%. Developers feel productive. Then you hit the quality column: bugs per developer up 54%, incidents per PR up 242%, median time in PR review up 441%, 31% more PRs merging with zero review.
They've called it the Acceleration Whiplash. The speed gains are real, nobody can argue against that, but downstream from them everything is essentially on fire.
The finding that should concern you most is that engineering maturity offers no protection. High DORA scores, mature DevOps practices, disciplined delivery — none of it helps. Faros is pretty direct about it: organisations with the strongest pre-AI engineering foundations are experiencing the same quality deterioration as everyone else.
We read that report, recognised our own internal numbers look different, and think we have the fix.
We have seen this before
Fifteen years ago, roughly, this exact dynamic played out.
Before CI/CD was standard, QA lived at the end of the cycle. You wrote the code, the team tested it, bugs surfaced, and developers fixed them. It worked until the volume of code outpaced the capacity of QA to absorb it. Bugs started making their way to prod. Teams responded by hiring more QA, writing more test plans, adding more checkpoints, which helped a little at the cost of making the cycle longer and slower.
The fix, when it arrived, was structural rather than additive. Stop catching problems at the end; catch them earlier, before they accumulate. Shift left. Write tests before the code. Run linters on commit, static analysis in CI. If the code can't pass a basic automated gate, it never reaches a human reviewer at all.
Faros is describing the same dynamic, replayed one layer up. AI generates code fast, but the catching still happens at the end: in review, in prod, and in incident response. The response most organisations are reaching for is more reviewers, stricter gates, longer QA. Faros is explicit that this doesn't work. "The instinct to tighten review is the wrong response," they write on page 22. "The answer is to raise the quality of code at the point of authoring."
What Faros is pointing at, though they don't name it this way, is a plan gate. Move the review to before the code exists.
What Faros found, in full
Nothing about their Acceleration Whiplash is an argument against using AI. Task throughput per developer up 33.7%, epics per developer up 66.2%, and code-specific tasks per team up 210%.
What is also real, and something you may have seen or felt yourself, is that the code being produced is not production-grade.
Bugs per developer up 54% (up from 9% in Faros's 2025 report; the acceleration is accelerating). Incidents per PR up 242.7%. Code churn up 861% — for every line being added, almost ten are being deleted within the same quarter. And median time in PR review up 441.5%.
The 861% churn figure is one worth digging into. It means a large volume of the code being written is being removed again within weeks. Faros's best guess is that developers are accepting AI-generated code quickly, then returning to replace it when it doesn't hold up in practice. Throughput measures what shipped, not what stuck around.
The review time increase is where the senior engineer tax shows up. Faros puts it plainly on page 18:
"AI-generated code presents a particular challenge for reviewers. It is often superficially convincing: idiomatic, well-named, stylistically consistent with the surrounding codebase. The structural and logical failures, when they exist, are beneath the surface. Catching them requires the reviewer to read carefully, reason about intent, and reconstruct the problem the code was meant to solve. This is slow, expensive cognitive work. The engineers with the deepest knowledge of the system are spending their most valuable hours unraveling plausible-sounding code that should never have reached them in the state it did."
That's 22,000 developers and two years of telemetry behind that quote.
The fix Faros buries in its recommendations
The recommendations section starts on page 21. Most of it covers visibility and measurement: know your incident-to-PR ratio, track code churn by repo, watch work restarts as a signal. Nothing out of the ordinary. However, one recommendation is structural and stands apart from the rest.
They open with the diagnosis on page 21: "This is an authoring problem, not a review problem."
From there it gets more specific. Page 22: "The instinct to tighten review is the wrong response. More reviewers, stricter gates, longer QA cycles treat the symptom."
Page 23 has the actual prescription: "Best practice for agentic development follows a research → plan → implement sequence. The plan phase must produce a scope that results in a PR that can be deployed to production as a distinct, self-contained unit of value."
And then this on page 24, which is worth reading slowly: "Rising restarts under high AI adoption point directly to insufficient context at the authoring stage. The fix is upstream: better context provisioning before the agent begins, not manual correction after the restart."
They're describing a plan gate. Write a plan before the agent starts, have the plan reviewed and signed off, then use that as the definition of done when the build comes back. That's the structural fix; everything else in their recommendations is measurement and incremental improvement. Out of all the suggestions, this is the one change that addresses the root cause. How do we know? Well, it worked for us.
What our numbers look like
We built an internal tool called Brent that runs this workflow. A developer writes a plan, the team reviews and approves it, a Cursor cloud agent builds against it, and a separate adversarial agent checks the finished diff against the approved plan before anything merges. The checking agent treats the PR description and commit messages as potentially misleading. It flags anything that drifted: something the plan asked for that isn't in the diff, something built that the plan never mentioned, something built differently than specified. Each finding either gets fixed or acknowledged in writing before the check goes green.
Caveat that we are just one team with one repo, not 22,000 developers. But every metric we measured went the opposite way to Faros.
| Metric | Faros — AI, no plan gate | Us — AI + plan gate |
|---|---|---|
| PR review time (open → merge) | 5.4× slower (+441%) | 4.7× faster |
| Incidents / CI failures | 3.4× more (+243%) | 2.6× fewer (22.8% → 8.6%) |
| Code churn (code thrown away) | 9.6× more (+861%) | unchanged (revert rate flat) |
| Merges with no review | 1.3× more (+31%) | none (plan reviewed first) |
| Merged PR throughput | 1.2× more (+16%) | 1.8× more |
| Changes sent back for rework | not tracked | 2.8× more often (17% vs 6%) |
Faros: change from low to high AI adoption across 22,000 developers (their reported % in brackets). Ours: before/after the plan gate, one team, three windows averaged. Directional comparison only, not a like-for-like.
The feedback cost numbers are where the mechanism shows up most clearly. A round of feedback on a PR added 15.9 hours to delivery. Seems high, but consider you have to reload context, rewrite code that's already been built, get the reviewer back, wait for CI again. On a plan, it's 0.7 hours, before any code exists. That gap is why we now send work back 17% of the time vs 6% before, 2.8× more often, all of it landing before a line is written.
13 of 165 reviewed plans were rejected and dropped outright.² Zero code written, zero CI minutes, zero reviewer time. Faros can't see that in their telemetry as it never hits any metric they'd track.
If we look into just a week of Cursor agent builds, the numbers look even more impressive: median time from approved plan to PR around 14 minutes.¹ The building runs on Composer 2.5 — $0.07 to $0.44 per task on average. Opus would run $4 to $5 for the same thing.
Why the numbers go the other way
Faros states work restarts are up 13.8%. Their framing: tasks moving back to in-progress after another stage, which costs the developer the context they'd already set down. A restart on a PR costs around 16 hours of rebuilding context and rewriting code that was already written. On a plan, it's about 45 minutes, before a line of code exists.
That gap is why our feedback numbers look different. More feedback, arriving earlier. Fewer broken ideas get as far as the build stage. Fewer bad builds in production, less churn, and your senior engineers stop spending their afternoons reconstructing intent from a two-thousand-line diff that arrived without any.
There's another thing the plan enables: the adversarial check. You need a document that describes what done means before the agent runs — that's the thing you're checking against. Without a plan, you're back to reading the diff and hoping.
The gap between knowing and acting
Faros ends the report with: "The gap between knowing and acting is the only gap that matters now."
Their data tells you what's happening: AI is producing more code than your existing processes were built to absorb, and the quality gap is widening as adoption deepens. Adding more review, more QA, and more incident response at the end doesn't fix the structural problem. The fix is upstream, in the authoring stage, before the agent runs.
If your numbers look anything like theirs, you know where to start.
We wrote up how a week of this looks in practice here, and the case for why we review plans instead of code is at notyourpeer.com.
¹ Fastest around 5 minutes, slowest around 36. Most runs happened while I had gone off to do something else — median over mean.
² Writing a plan that someone else would sign off on is, in a slightly uncomfortable way, harder than writing code. You have to know what you actually think before you start. Most rejected plans don't know what problem they're solving yet. Agents are much better at filling in gaps than at challenging your intent.
