Speed Has Numbers. Trust Has Error Bars.

Automate the patch, review the minor, audit the major. A solo developer's framework for managing trust debt at machine speed.

When I removed human code review and let CI gates decide what ships, I moved ten times faster. Then I had to figure out what I’d traded away — and whether I could get it back.


The experiment

I should be upfront about what this is and what it isn’t. This is one developer, working alone on a greenfield project, building a polyglot API ecosystem across eight repositories over forty-eight days. I am not Google. I am not even a team. I am the opposite end of the spectrum from the companies whose research I’m about to cite, and I want to be transparent about that.

The project — a set of language ports for IBM’s MQ REST Admin API — was deliberately chosen for its straightforward architecture. The code isn’t complex. It’s namespace mapping, convenience wrappers, and functional abstractions over a REST endpoint. I didn’t pick it to push the boundaries of what AI agents can build. I picked it to push the boundaries of how they build it. The workflow was the experiment, not the code.

Trust in auto-merge didn’t happen overnight. For the first few weeks, I reviewed and approved every pull request. The agent wrote better code than I expected — often better than I would have written — and my review comments got progressively thinner. I went from detailed review to skimming to rubber-stamping. At some point I had to ask myself: who am I performing this review for?

So I turned on auto-merge and let the CI gates decide.

The throughput increase was immediate and dramatic. Pull requests that previously sat in my review queue for hours now merged in minutes. But here’s the honest admission that frames everything that follows: the technical quality problems I face at my scale — defect detection, regression prevention, automation trust — are the same ones Google faces. The social quality problems — knowledge transfer, shared ownership, organizational coordination — are structurally absent when you’re a team of one. There is no one to transfer knowledge to. That’s not a smaller version of Google’s problem. It’s a categorically different situation, and I need to acknowledge that before making any claims about what worked.

What review actually does

Before I can argue that CI gates replaced my peer reviewer, I need to understand what peer review actually provides. The research here is unambiguous, and it surprised me.

Teams believe code review catches bugs. It mostly doesn’t. A Microsoft Research study found that only about 15% of review comments indicated a possible defect [1]. The vast majority concerned structure, style, and maintainability — not functional bugs. A Cisco case study analyzing 2,500 reviews across 3.2 million lines of code found that 61% of reviews discovered zero defects [2]. The number that actually catches bugs is smaller than most engineers assume.

So what is review doing? A landmark study by Bacchelli and Bird observed developers at Microsoft and found that while finding defects was the stated motivation, the actual outcomes were dominated by knowledge transfer, increased team awareness, and the discovery of alternative solutions [3]. A Google study of nine million reviewed changes reached the same conclusion: review’s primary value is knowledge distribution and architectural oversight, not defect detection [4].

Google requires human review for every change. No exceptions for seniority, team membership, or change size. They maintain the most sophisticated CI/CD infrastructure in the world — 4.2 million unique tests, 150 million test executions per day [5] — and they still won’t let code merge without a human saying “looks good to me.” If any organization could replace review with automation, it’s Google. They explicitly chose not to.

But there’s a counterpoint worth acknowledging. Research on review comment quality at Microsoft found that 20-44% of comments were rated “not useful” by the change authors [6]. The useful comments came from reviewers with deep contextual knowledge of the specific code area. The rest was noise. Review as commonly practiced is wasteful. The answer may be better review, not more review — but that’s a different argument than the one I was making when I turned on auto-merge.

The punchline: the defect-detection justification for human code review is weaker than most teams assume. But removing review eliminates benefits that have nothing to do with defects.

What I actually lost

At a team of one, the social benefits don’t apply. I’m not losing knowledge transfer because there’s no team to distribute knowledge across. I’m not losing shared ownership because there’s no one to share it with. The research is clear that these are emergent properties of team size [7] — they don’t exist in diminished form for a solo developer; they are structurally absent.

What I did lose was the forcing function to slow down. Review was a speed bump, and speed bumps exist for a reason. Without them, I moved fast — and architectural drift accumulated without anyone noticing.

The problems that slipped through weren’t code bugs. The CI gates caught those. What slipped through were cross-repository inconsistencies: the same API concept named differently in two ports, a bug fix applied to the reference implementation but not propagated to the others, tooling inefficiencies that compounded silently across eight repos. Process bugs. Consistency bugs. The kind of problems that no automated gate tests for because no one thought to write that test.

There was also a subtler problem. The AI agent’s context window is finite, and when it compacts prior conversation to make room, it loses information. Bug fixes discovered mid-task get abandoned. Problems identified but deferred never make it into the backlog. The agent forgets what it was working on three hours ago. At human-review speed, these gaps get caught in the next PR description or the reviewer’s “wait, what about that thing you mentioned earlier?” At auto-merge speed, they vanish.

Auto-merge didn’t eliminate review. It deferred it. Every problem still surfaced eventually — just later, when it was tangled into more code and harder to unwind.

The asymmetric precision problem

This is where I think the real insight lives, and it’s not about my project. It’s about how we reason about trade-offs when the two sides of the equation aren’t measured the same way.

Lelek and Skowroński, in Vibe Engineering [8], introduced the concept of “trust debt” — the accumulated uncertainty about whether your automated systems are doing what you expect. The metaphor is deliberate: like technical debt, trust debt compounds silently, accumulates without visible symptoms, and presents the bill at the worst possible moment.

When I turned on auto-merge, I was implicitly making a trade-off: delivery speed for trust debt. Move faster now, pay for it later in uncertainty about what actually shipped. That trade-off is real and unavoidable. But here’s the problem with managing it: the two sides aren’t commensurable.

Delivery speed is measured with arbitrary precision. I can tell you that a pull request merged in 97 seconds. I can tell you the median time from commit to deploy. I can graph the throughput increase when auto-merge was enabled and point to the exact inflection. These are direct measurements with negligible uncertainty.

Trust is a derived quantity. Forty years of human factors research [9] [10] [11] has produced instruments for measuring trust in automated systems. But every instrument decomposes trust into sub-dimensions: perceived reliability, process transparency, calibration accuracy, familiarity. Each sub-dimension requires inference from imperfect proxies — self-reports, behavioral observations, physiological measures. Each proxy carries its own measurement uncertainty. And when you compose uncertain measurements, errors compound.

The result is a quantity with error bars so wide that treating it as equivalent to a deployment timestamp is absurd. But that’s exactly what we do every time we make an implicit speed-vs-trust trade-off.

DORA — the organization that literally defined the four canonical delivery metrics (deployment frequency, lead time, change failure rate, mean time to recovery) — treats trust qualitatively [12]. Surveys and behavioral proxies, not metrics. If DORA could have added trust as a fifth key metric, they would have. The fact that they didn’t tells you something about the measurement problem.

Speed has numbers. Trust has error bars. Guess which one wins every trade-off.

This systematic asymmetry biases every decision toward velocity. Not because anyone consciously chooses to ignore trust, but because the case for speed is always crisp and quantified while the case for caution is always fuzzy and hand-wavy. The spreadsheet wins. The gut feeling loses. And trust debt accumulates.

Closing the gap

So how do you manage a trade-off where one side is measured in milliseconds and the other side is measured in “I think we’re probably fine”?

The DORA 2023 report found that code review speed was the single strongest factor in software delivery performance [13]. Not review elimination — review speed. Teams where reviews completed quickly had significantly higher deployment frequency and lower change failure rates. The optimization target was never to remove the human. It was to make the human faster.

That reframes what I did. Turning on auto-merge wasn’t a decision to eliminate review. It was an extreme optimization of review speed — to zero. And like most extreme optimizations, it worked brilliantly in the narrow case and created problems everywhere else.

The framework I’m adopting going forward maps review depth to semantic versioning:

Patch releases get full automation. CI gates are sufficient. These are minor bug fixes and small improvements — low risk, low ceremony. The gates earn this trust by catching defects reliably, and the blast radius of a missed bug is small.

Minor releases get a human checkpoint. When the version ticks from x.y.0 to x.(y+1).0, I review the cumulative diff across the entire development cycle — everything that changed between the two minor releases. Not the last patch. All of them. The point is to meta-review the aggregate: what do dozens of automated patches look like when you step back and see them as a single architectural change? Has the structure drifted? Are we duplicating patterns that should be abstracted? Did the agent introduce shortcuts that compound? This is the point where I look for the process and consistency bugs that CI gates don’t catch.

Major releases get a full architectural audit. Re-earn trust from scratch. Question the assumptions. Verify that what you think the system does is what it actually does.

Automate the patch, review the minor, audit the major.

This maps surprisingly well to Google’s own practices. Their Rosie system — the closest thing in the industry to automated auto-merge at scale — handles Large-Scale Changes: mechanical, pattern-based transformations applied across the entire monorepo [14]. Even Google auto-approves the mechanical. The principle scales down.

The risk that this framework addresses is trust decay. Bainbridge described it in 1983 [15]: the more reliable the automation becomes, the less the human operator practices the skills needed to intervene when it fails. Without formal checkpoints — without the minor-release review, without the major-release audit — the human gradually disengages. The trust gap widens not because the automation got worse, but because the human stopped checking.

The industry in transition

My micro-scale experiment reflects a macro-scale shift that the entire industry is navigating.

The DORA 2024 report found that AI adoption correlated with a 1.5% decrease in delivery throughput and a 7.2% decrease in delivery stability across 39,000 professionals [16]. That sounds alarming, but it’s consistent with a well-documented pattern. Robert Solow observed in 1987 that “you can see the computer age everywhere but in the productivity statistics” [17] — IT investment produced zero visible gains for fifteen years before the 1990s productivity boom. Brynjolfsson and colleagues formalized this as the productivity J-curve [18]: when organizations adopt a general-purpose technology, they divert labor into unmeasured intangible investment — training, process redesign, organizational restructuring — making productivity appear to decline before the investments pay off.

But the 2025 DORA report complicates the optimistic reading [19]. Individual developer productivity soared — 98% more pull requests merged — while organizational delivery metrics stayed flat. AI, it turns out, amplifies the strengths of high-performing organizations and the dysfunctions of struggling ones. Recovery isn’t automatic. It requires deliberate organizational adaptation, not just time.

There’s also the capable-model paradox. Research on inverse scaling [20] shows that the most powerful models can backfire on routine tasks — overthinking simple problems [21], getting creative where you want them to be deterministic, reasoning their way past correct answers into incorrect ones. The answer isn’t one model for everything. It’s the right model for the right task, at the right level of autonomy.

No major tech company has eliminated human review. Google, Microsoft, Meta — every one of them is optimizing review speed and reviewer quality, not removing humans from the loop. That should inform how we approach the question at any scale.

What I’m taking forward

Auto-merge stays on — for patches. The speed is real and the CI gates earn it. Over forty-eight days and hundreds of pull requests, the gates caught what they needed to catch. That trust was built empirically, not assumed.

Minor releases get a human checkpoint. Not because I don’t trust the automation, but because trust decays when it isn’t exercised. The review is for me as much as for the code.

The trust gap widened while I was moving fast. That’s not failure — it’s the expected shape of rapid iteration. You sprint to build, then you consolidate to verify. The mistake would be sprinting forever and calling the growing gap behind you someone else’s problem.

What I did is not scalable, not generalizable, and not a recommendation. It’s a data point. One developer, one project, one set of trade-offs. The research says the technical quality problems are the same at every scale — only the social problems change. The frameworks to manage them exist; Google, Microsoft, and the human factors community have spent decades building them. The hard part is mapping those frameworks down to your context without either dismissing them as “that’s for big companies” or adopting them wholesale at a cost you can’t afford.

Automate the patch. Review the minor. Audit the major.

And remember: the CI gates were never the peer reviewer. They were the safety net. The peer reviewer was always you. The question is whether you remembered to show up.


References

[1] J. Czerwonka, M. Greiler, and J. Tilford, “Code Reviews Do Not Find Bugs: How the Current Code Review Best Practice Slows Us Down,” ICSE 2015. https://www.microsoft.com/en-us/research/wp-content/uploads/2015/05/PID3556473.pdf

[2] SmartBear Software (J. Cohen et al.), Cisco Code Review Case Study, 2006/2009. https://static0.smartbear.co/support/media/resources/cc/book/code-review-cisco-case-study.pdf

[3] A. Bacchelli and C. Bird, “Expectations, Outcomes, and Challenges of Modern Code Review,” ICSE 2013, pp. 712-721. https://sback.it/publications/icse2013.pdf

[4] C. Sadowski, E. Soderberg, L. Church, M. Sipko, and A. Bacchelli, “Modern Code Review: A Case Study at Google,” ICSE-SEIP 2018. https://sback.it/publications/icse2018seip.pdf

[5] J. Micco, “The State of Continuous Integration Testing at Google,” ICSE-SEIP 2017. Google TAP: 4.2M unique tests, 150M executions/day.

[6] A. Bosu, M. Greiler, and C. Bird, “Process Aspects and Social Dynamics of Contemporary Code Review,” IEEE TSE, 2017.

[7] F. P. Brooks Jr., The Mythical Man-Month, Addison-Wesley, 1975/1995. Coordination costs grow as N*(N-1)/2; at N=1, they are zero.

[8] T. Lelek and A. Skowroński, Vibe Engineering, Manning (MEAP), 2026, ISBN 9781633434363, Section 1.4.2: “Trust: a new kind of debt.” https://livebook.manning.com/book/vibe-engineering/welcome/v-2

[9] J. D. Lee and K. A. See, “Trust in Automation: Designing for Appropriate Reliance,” Human Factors, 46(1), 50-80, 2004. https://journals.sagepub.com/doi/10.1518/hfes.46.1.50_30392

[10] J. Y. Jian, A. M. Bisantz, and C. G. Drury, “Foundations for an Empirically Determined Scale of Trust in Automated Systems,” International Journal of Cognitive Ergonomics, 4(1), 53-71, 2000.

[11] T. T. Kessler et al., “Measurement of Trust in Automation: A Narrative Review and Reference Guide,” Frontiers in Psychology, 12, 604977, 2021. https://pmc.ncbi.nlm.nih.gov/articles/PMC8562383/

[12] DORA Team, “Fostering Trust in AI,” Google Cloud, 2024-2025. https://dora.dev/research/ai/trust-in-ai/

[13] DORA Team, “Accelerate State of DevOps Report 2023,” Google Cloud. https://dora.dev/research/2023/dora-report/

[14] T. Winters, T. Manshreck, and H. Wright, Software Engineering at Google, O’Reilly, 2020, Chapter 22: “Large-Scale Changes.” https://abseil.io/resources/swe-book

[15] L. Bainbridge, “Ironies of Automation,” Automatica, 19(6), 775-779, 1983.

[16] DORA Team, “Accelerate State of DevOps Report 2024,” Google Cloud. https://dora.dev/research/2024/dora-report/

[17] R. Solow, “We’d better watch out,” New York Times Book Review, July 1987.

[18] E. Brynjolfsson, D. Rock, and C. Syverson, “The Productivity J-Curve: How Intangibles Complement General Purpose Technologies,” American Economic Journal: Macroeconomics, 13(1), 333-372, 2021. https://www.nber.org/papers/w25148

[19] DORA Team, “Accelerate State of DevOps Report 2025,” Google Cloud. https://dora.dev/research/2025/dora-report/

[20] I. R. McKenzie et al., “Inverse Scaling: When Bigger Isn’t Better,” Transactions on Machine Learning Research, October 2023. https://arxiv.org/abs/2306.09479

[21] X. Chen et al., “Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs,” December 2024. https://arxiv.org/abs/2412.21187