I ran Cursor's cloud agents for a week. They saved me time and money

I've enjoyed using Cursor's cloud agents in my workflow these past couple of weeks. I hand an approved plan over, go and do something else, and come back to a finished pull request that's already been built and checked against what I asked for. What makes that work is the setup around it. As a company we have already moved our review onto the plan rather than the code, so by the time an agent starts building there's something we trust telling it what done means.

That phrase, what done means, is the whole conversation right now. You've probably seen the most recent version of it from the creator of Claude Code saying he barely writes prompts anymore, he writes loops. Point an agent at something, tell it not to stop until it's finished, and walk off. It went down about as well as you'd expect, half the replies cheering and half wanting him fired. I think he's onto something, but the bit his demos skip is that a loop running until it's "done" is only worth anything if you trust whatever is deciding it's done.

With agent code you can't make that call by reading the pull request. The diff is two different things at once: some of it is what I actually decided, the change I wanted and why, the calls that I'd defend to anyone. The rest is the thousand small mechanical choices the agent made along the way, the ones I never thought about and never would have. Reviewing the PR means judging both at once with no way to tell them apart, and not having to care about those small choices is the whole reason the agent is quick in the first place. The part I do have an opinion on ends up buried in a pile of code nobody really chose.

We've built an internal tool called Brent that moves the review forward onto the plan. Someone writes up what they're going to change and why, the rest of us read it and pick at it the way we used to pick at each other's code, and only once it's signed off does an agent go and build it. The code still gets checked, but checking a machine's work for bugs and for whether it stuck to the plan is a job you can hand to another machine. We don't call that peer review, simply because it isn't.

Last week the agent-built PRs that merged took about fourteen minutes at the median to go from an approved plan to a PR worth opening, the fastest around five and the slowest around thirty-six. The median doesn't tell you much on its own. What it meant in practice is that most of the building happened while I was off doing something else, even whilst I was asleep, which I enjoyed.

So you just let it merge whatever?

No. There are two feedback loops running, and the agent typing code is only the first of them.

That first loop is the Cursor agent on Composer 2.5, fixing its own work. It pushes, something goes red, it reads the failure and pushes again. You can see it in the commit history: most of our runs last week took a number of goes to come good, usually four or five, with the highest being nine. This number could probably be brought down by using Opus or Fable (RIP), but when none of those extra pushes are you, and just an agent wrestling CI and the linter until they go quiet, I'm not too bothered.

That's the easy part, and it's the one all the posts are about. The second loop is Brent, and it's the one that decides whether "done" is actually done. Every time the agent pushes, Brent lines the new diff up against the plan you approved and refuses to let it through until the two match.

How does Brent know the code matches the plan?

This is the bit that matters, and it's where "the agents write our code now" blogs don't have a good answer. It's the difference between an agent opening a PR and you being able to trust that PR without reading every line of it.

On every push to a plan-linked PR, Brent kicks off a separate agent whose only job is to compare the diff against the approved plan and write down everywhere the two disagree. It treats the whole PR as hostile: the description, the commit messages, the diff, the string literals, all of it.

It writes each disagreement down as a finding with one of four tags:

Missing — the plan asked for something and it isn't in the diff. The agent quietly dropped work.
Scope — the diff has grown a separate capability the plan never mentioned. The agent built something nobody asked for.
Changed — a planned thing built a different way than specified.
Beyond — a small deliberate extra in the same area.

These are hard work to catch if you are just relying on reviewing the code. A dropped test doesn't show up as a line in the diff; you only notice it's gone if you're holding the whole plan in your head while you read every file. Something unplanned is the same problem in reverse.

We also have it so the check on GitHub stays red until every finding is either fixed or acknowledged, and acknowledging one is deliberately a pain. You, or the agent, have to name the specific finding and explain it in writing: i.e. not "added tests" but "the negative path is covered in lister_test.go". Vague cover-alls like "integration and unit tests" get rejected even when the words look close, and when it's unsure it leaves the finding blocking and escalates out to a human review.

When it's done, the check lands on one of three colours:

Green: the diff matches the plan. If the plan itself is approved, Brent approves the PR.
Amber: there are differences, but every one has been acknowledged with a reason. Brent stops deciding here. It pulls in the person who approved the plan and hands them a review packet consisting of each difference, the reason given for it, and the diff it touches, so they're ruling on two or three specific calls instead of re-reading the whole PR.
Red: there are differences nobody has acknowledged. Brent sits on it and waits for the author or the agent. No human gets pulled in yet.

Producing a two-thousand-line PR is fairly trivial now; reading those two thousand lines closely enough to be sure they did everything the plan asked was not. And one way or another that reading had become most of the job. The deviation check does that comparison for me. By the time a PR lands in front of me the whole thing has shrunk to one question: do I support the two or three differences someone had to write a reason for? Every PR that merged last week got through this check, with Brent's own bot listed as a reviewer on each. About half also pulled in a human at the amber stage — the person who approved the plan and never the author — to sign off on a difference they'd acknowledged.

What about the bugs?

Bug-free? Nothing in life is really bug-free. However, each run had to get past a Cursor Bugbot review. It left comments on six of the twelve merged runs, sixteen in all, and every one was dealt with before merge; the other six it had nothing to say about. A Bugbot comment on its own never pulls in a person, though. The agent just fixes it, and a green plan check on an approved plan is enough to let the PR merge.

Bonus: save money by using a cheaper, slower model

This is where putting the plan first earns its money, literally. The expensive model does its work in the planning, helping a person think the change through rather than thinking for them, and what comes out is the plan the rest of us read and argue over. That's where the judgement sits, so that's where the strongest model and the most attention should go (currently Opus). Building is just carrying out a decision that's already been made and reviewed, which a fast, cheap model handles fine, with the deviation check standing behind it in case it wanders off again, using a more powerful model.

If we were to estimate based on Cursor's published per-task averages, not numbers metered off our own runs:

Composer 2.5 averages about $0.07 a coding task on the standard tier, around $0.44 on the fast tier.
A similar agentic task on a frontier model like Claude Opus 4.8 runs about $4 to $5.

So the building runs something like ten to sixty times cheaper on Composer than it would on a frontier model, and we still pay the premium where it counts, on the plan. Across last week's twelve merged builds that's a dollar or two of Composer time against maybe fifty dollars if we'd driven the same work with a frontier model. Take this figure with a pinch of salt, as it's only a small sample — task-level pricing with nobody metering tokens per run. I plan to look into this in more detail in the future.

And that's where we've landed. You approve a plan, and most of the time there's a PR waiting about fifteen minutes later. Bugbot has been over it, it matches what you agreed to, and anywhere it doesn't, someone had to write down why.

If you're running cloud agents on real code, I'd love to know how you're deciding when they're actually done. Drop it in the comments, because that's the part we're all still working out.

◼ END OF TAPE

PLEASE BE KIND — REWIND