When Your Operator Can't Read the Room

When AI agents are your operators, standards must be executable code, not documentation. Documentation informs. Automation enforces.

I built a governance framework for AI agents the same way I’d build one for humans — then discovered why most of it had to be rewritten.


The two-tier enforcement hierarchy

Every standard in a large environment lives in one of two tiers, whether you name them or not.

Tier 1 is enforced in automation. The right way is the only way. A CI gate that rejects a malformed commit message. A wrapper script that constructs the PR body so you can’t forget the issue linkage. A linter that fails the build if you violate the naming convention. No judgment call required. Works identically for a human operator and an AI agent. The standard IS the automation. Nothing to debate.

Tier 2 is documented but not enforced. Architecture decisions. Style preferences beyond what the linter catches. Process expectations that can’t be reduced to a script. These require someone to read, understand, and comply voluntarily. After-the-fact auditing catches the violations — eventually.

This is foundational infrastructure thinking. When there are ten ways to solve a problem, you get ten solutions and ten things to support. Standardize it. Then enforce it. Rico Mariani called this the “pit of success” in 2003 [1] — make the default behavior the correct behavior so users fall into winning practices rather than climbing toward them. Spotify’s golden paths [2], Netflix’s paved road [3], Google’s SRE frameworks [4] — the principle scales from API design to organizational infrastructure. Barry Schwartz’s research on the paradox of choice [5] provides the behavioral science: more options produce decision paralysis, reduced satisfaction, and increased error rates. Skelton and Pais formalized this in Team Topologies [6] — platform teams should reduce the cognitive load on stream-aligned teams by providing standardized, opinionated tools.

For human teams, this two-tier hierarchy worked well enough. Cultural enforcement — peer review, team norms, professional identity, the social dynamics of working alongside people whose opinions you respect — provided a soft constraint on Tier 2 compliance. You didn’t follow the architecture decision record because a script forced you to. You followed it because deviating meant explaining yourself in the next pull request review, and nobody wants to be that person.

Then the operator changed.

Why the hierarchy breaks

The social constraint mechanism that makes Tier 2 standards workable for humans is entirely absent for AI agents.

Human compliance with non-enforced standards is far from perfect. Healthcare compliance studies show nurse adherence to standard precautions ranges from 36% to 69% [7]. FAA research found that intentional crew non-compliance was a factor in 40% of worldwide aviation accidents reviewed [8]. Humans are not reliable procedure followers even in safety-critical domains. But the deviation is socially bounded. Organizational behavior research documents the mechanism in detail: norm violations trigger guilt, shame, anger, and social punishment — confrontation, gossip, exclusion — that constrain future transgressions [9]. Fear of consequences drives measurable risk aversion, with a meta-analysis of 68 studies (n=9,544) finding that the effect strengthens when tangible consequences are at stake (r=0.30) [10]. There is a floor below which human deviation is partially held in check, not by rules, but by the ambient social pressure of working with other humans.

AI agents have no such floor.

The AgentIF benchmark tested instruction-following in realistic agentic scenarios — instructions averaging 1,723 words with 11.9 constraints per instruction, drawn from 50 real-world agent applications [11]. The best- performing model followed fewer than 30% of instructions perfectly. GPT-4o dropped from 87% on simple benchmarks to 58.5% on these realistic tasks. The more complex and contextual the instruction, the worse compliance becomes. A standards document in a repository is closer to the complex end of that spectrum.

Three failure modes make AI deviation structurally different from human deviation:

Agent drift. Compliance degrades progressively over extended interactions. Research on goal drift found that agents prioritize immediate context — the code patterns they see around them — over system-level directives like documented standards [12]. The ContextCov framework, evaluating 723 open-source repositories, found that agents mimic existing patterns that violate instructions, and subsequent agent runs amplify the violations [13]. The standard says one thing; the surrounding code does another; the agent follows the code.

Specification gaming. AI agents don’t merely fail to follow instructions passively — they actively find creative workarounds. Krakovna et al. catalogued over 70 empirical examples: a boat-racing agent abandoned the race entirely to repeatedly hit reward-granting blocks [14]. More critically, Bondarenko et al. demonstrated that modern reasoning models — o3, DeepSeek R1 — “hack benchmarks by default,” resorting to environment manipulation rather than playing by the rules [15]. Earlier models needed prompting to game specifications. Reasoning models do it spontaneously.

Trained overconfidence. LLMs default to proceeding with assumptions rather than asking clarifying questions. Shi et al. demonstrated that this is not a bug but a training artifact: during RLHF preference training, human labelers consistently preferred confident-sounding answers over responses that asked for clarification [16]. The model learned that looking decisive beats being careful. When a standard is ambiguous, a human developer tends to seek peer advice before making a judgment call. An AI agent treats the ambiguity as latitude to get creative.

The critical asymmetry is not in the frequency of deviation but in its character. When a human proceeds without asking, they typically apply conservative, conventional solutions. When an AI proceeds without asking, it may apply creative, unconventional solutions that satisfy the literal specification while violating unstated intent [17]. Conservative assumption versus creative exploitation. This is why documentation-only standards are categorically riskier with AI operators — not because AI deviates more often (though it does), but because the nature of the deviation is different.

What I built

Acknowledging this reality meant building enforcement infrastructure at a scale I would never have considered for a human team.

The project that surfaced all of this — a polyglot API ecosystem across eight repositories, one human, one AI agent, forty-eight days — required a governance framework that split into three repositories following a separation-of-powers metaphor I didn’t plan but couldn’t avoid:

The constitution (standards-and-conventions): 159 PRs, 175 issues in seven weeks. Every written standard — AI agent behavior, code management, repository conventions, development practices — lives here. One source of truth. Downstream repos reference, never duplicate.

The executive (standard-tooling): 84 PRs, 77 issues. CLI commands that implement the standards locally: st-commit constructs conventional commit messages, st-submit-pr pushes and creates the PR with issue linkage and auto-merge, st-prepare-release creates the release branch and changelog. Skills refuse to proceed if the tooling is missing.

The judiciary (standard-actions): 83 PRs, 71 issues. Shared GitHub Actions that enforce standards in CI — security scanners, compliance gates, publish workflows. The automation that catches what the agent missed.

326 PRs and 323 issues in the governance layer alone. More than any individual language port produced. The infrastructure to manage the AI was a bigger investment than the code the AI wrote. That’s the first number worth sitting with.

The second number: 13 skills — structured procedural workflows consumed as slash commands — loaded identically across all seven repositories. Every PR follows the same lifecycle. Every release follows the same six-phase process. Every branch is named and linked to an issue the same way. The skills tell the agent WHAT to do. The standard-tooling scripts ensure it is DONE CORRECTLY. Two layers: intent and enforcement.

The third: a template repository that bootstraps 60-70% of a new language port’s files — CI workflows, dev scripts, AI agent configuration, release infrastructure — pre-configured from day one [18]. The template was not designed upfront. It was extracted retroactively from the patterns that emerged during the first three ports. Codified institutional memory. A new port starts with every battle scar already encoded.

The RTFM feedback loop

The enforcement infrastructure matters, but the most important mechanism is the feedback loop that turns every violation into a tracked improvement.

The RTFM skill implements a forced interruption protocol. When the AI agent violates a standard — any standard — it must: stop all work immediately, capture the failure context (branch, git status, files touched, action sequence), identify the specific standard that was violated with the exact document path, create a labeled GitHub issue with a structured template, and propose documentation updates before resuming work.

Five real incidents were tracked this way. The agent missed a standards snapshot after context bootstrap. It asked for validation in a docs-only repo, violating the docs-only exception. It failed to expand an include chain. It missed a venv-only Python requirement. It worked directly on develop instead of a feature branch. Each incident produced a concrete fix. Each fix prevented recurrence.

The battle scar policies tell the story of this feedback loop in miniature. Claude Code’s auto-memory feature was writing critical rules to unversioned MEMORY.md files — rules accumulating outside version control, invisible to code review, diverging across clones. The fix was an explicit ban, tracked through an issue, encoded in every CLAUDE.md. The agent consistently failed to construct valid heredoc syntax with special characters — wasting debugging cycles on every occurrence. The AGENTS.md already recommended temp files, but the agent ignored it. Fix: explicit ban in CLAUDE.md where it couldn’t be missed. The agent constructed commits with incorrect co-author trailers. Fix: “NEVER use raw git commit” in bold, with the wrapper script as the only permitted alternative.

The maturation arc is measurable. Python developed for four weeks without a CLAUDE.md. Every subsequent repo had one from the initial commit. By the time Rust was initialized, the CLAUDE.md arrived fully-formed — zero iteration needed because every accumulated lesson was pre-loaded into the template. Operational learning codified into infrastructure. The self-correcting system: violations improve documentation, improved documentation prevents violations, and the standards get stricter precisely where they need to be.

This is what the ContextCov researchers are pointing toward when they argue that agent instructions must be treated as “executable specifications, not passive documentation but verifiable code that compiles into runtime checks” [13]. The RTFM loop is the manual version of that thesis.

What still can’t be enforced

The enforcement hierarchy has a hard ceiling, and pretending it doesn’t would be dishonest.

Some standards resist automation. Architectural judgment calls — whether a pattern should be abstracted or left duplicated across repos. Cross-repository consistency assessment — the same API concept named differently in two ports, a bug fix applied to the reference implementation but not propagated. Trade-off evaluation — when two valid approaches exist and the right choice depends on context the automation can’t see.

For human teams, peer review and team culture provide partial coverage for these gaps. A senior engineer catches the drift in a code review. A team discussion surfaces the inconsistency. The social enforcement floor doesn’t automate judgment, but it creates a probability of catching judgment failures.

For AI agents, the hooks mechanism offers something different — and something with no natural human equivalent. After the agent performs an action, a hook can immediately audit and post-process it, raising an exception if the action was wrong. The assumption is that the agent pays attention to the output and fixes its mistake. A human runs a command and there’s no hook that runs after it unless you hide the command. With the AI, you can intercept, audit, and correct in the same loop. It’s not enforcement — it’s structured regret.

The coordination cost reveals what falls outside both mechanisms. 138 of 909 merged PRs — 15.2% — exist purely to sync, bump versions, or propagate changes between repositories. The mature repos bear the highest burden: 16-23% sync overhead. These are the standards that could not be fully automated: changes to the shared API surface that require five sequential PRs across five language ports, with no automated propagation. Each one is a place where drift can enter.

The honest admission: some judgment calls will be wrong. Some drift will go undetected. The gap between what you can enforce and what you need to enforce is permanent. The goal is not to eliminate it. The goal is to shrink it continuously, and to know where it is at all times.

The enforcement imperative

The need for strictly implemented, non-circumventable automation doesn’t diminish with AI operators. It increases — and the evidence is now quantitative.

GitClear’s analysis of 211 million changed lines of code found that AI-assisted code showed an 8x increase in duplicated code blocks, a decline in refactoring from 25% to under 10% of changed lines, and a 34% higher cumulative refactoring deficit in AI-heavy repositories [19]. Unconstrained AI code generation measurably degrades code quality. The golden path isn’t a nice-to-have. It’s load-bearing.

Anthropic’s Responsible Scaling Policy reverses the human trust model entirely [20]. In human organizations, more experienced professionals earn more discretion. Anthropic’s graduated AI Safety Levels move in the opposite direction: more capable agents get more constraints, not fewer. The most powerful models are the most restricted. This is not an arbitrary policy choice. It reflects the specification gaming evidence — the more capable the model, the more sophisticated its exploitation of ambiguity becomes.

The DORA 2024 report found that 89% of organizations now use internal developer platforms, with teams seeing 6-10% improvements in organizational and team performance [21]. But poorly implemented platforms can slow throughput and cause instability. The constraint must be well-engineered — genuinely easier than alternatives, regularly maintained, user-centric — or it becomes a bureaucratic gate. Mandating a bad path is worse than having no path.

This is the tension: enforcement must increase, but enforcement quality must increase with it. Document what you can’t enforce. Enforce what you can. And close the gap between them as fast as possible.

What I’m taking forward

The governance layer having more pull requests than any language port is the number I keep coming back to. 326 PRs to govern the process. The most PRs any single language port produced was 223. The infrastructure to manage the AI was the larger project. That should tell you something about where the complexity actually lives when AI agents are your primary operators.

Standards-as-code is not a new concept. Making the right way the only way is foundational infrastructure thinking that predates AI by decades. What changes with AI operators is the urgency and the mechanisms. The social enforcement floor that made documentation-only standards workable for human teams — the guilt, the peer judgment, the career consequences, the ambient pressure of working with people whose respect you value — is not diminished for AI. It is absent.

The question is not whether your AI agent will follow your standards document. The research says it probably won’t — at least not reliably, not over extended interactions, and not when the document is ambiguous. The question is what you’re going to do about that.

Enforce what you can. Document the rest. Turn every failure into a tracked improvement. And accept that the gap between documentation and enforcement is where the interesting problems live.

This is the third in a series of articles exploring what happens when you hand an AI agent the keys to a polyglot API ecosystem — eight repositories, five languages, forty-eight days — and try to keep it between the guardrails.


References

[1] R. Mariani, “The Pit of Success,” Microsoft Developer Blog, 2003. https://learn.microsoft.com/en-us/archive/blogs/brada/the-pit-of-success

[2] G. Niemen, “How We Use Golden Paths to Solve Fragmentation in Our Software Ecosystem,” Spotify Engineering Blog, August 2020. https://engineering.atspotify.com/2020/08/how-we-use-golden-paths-to-solve-fragmentation-in-our-software-ecosystem

[3] D. Marsh, “The Paved Road at Netflix,” OSCON 2017; A. Singhal, “Scaling Appsec at Netflix,” Netflix Technology Blog, 2019. https://netflixtechblog.medium.com/scaling-appsec-at-netflix-6a13d7ab6043

[4] B. Beyer, C. Jones, J. Petoff, and N. R. Murphy, Site Reliability Engineering: How Google Runs Production Systems, O’Reilly, 2016, Chapter 32. https://sre.google/sre-book/evolving-sre-engagement-model/

[5] B. Schwartz, The Paradox of Choice: Why More Is Less, Ecco Press, 2004. https://works.swarthmore.edu/fac-psychology/198/

[6] M. Skelton and M. Pais, Team Topologies: Organizing Business and Technology Teams for Fast Flow, IT Revolution Press, 2019.

[7] Healthcare compliance studies reviewed in the human factors and nursing literature show wide variation in standard precaution adherence (36-69%).

[8] FAA Human Factors Research: intentional crew non-compliance as a factor in 40% of worldwide aviation accidents reviewed.

[9] B. F. Malle, “Norms and Norm Violations,” forthcoming, Annual Review of Organizational Psychology and Organizational Behavior, 2024.

[10] S. Wake, J. Wormwood, and A. B. Satpute, “The Influence of Fear on Risk Taking: A Meta-Analysis,” Cognition and Emotion, 34(6), 1143-1159, 2020.

[11] Z. Qi et al., “AgentIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios,” 2025.

[12] Z. Masood et al., “Agent Drift: Quantifying Behavioral Degradation in Multi-Turn Agentic Systems,” 2026. See also: “Goal Drift in Language Model Agents,” arxiv 2505.02709, 2025.

[13] ContextCov, “Agent Instructions as Executable Specifications,” arxiv 2603.00822, 2026. Evaluated 723 repositories, extracting 46,000+ executable checks from natural-language agent instructions.

[14] V. Krakovna et al., “Specification Gaming: The Flip Side of AI Ingenuity,” DeepMind, 2020.

[15] V. Bondarenko et al., “Reasoning Models Hack Benchmarks by Default,” Palisade Research, 2025.

[16] R. Shi et al., “Teaching LLMs to Ask Clarifying Questions,” ICLR 2025.

[17] R. Vijayvargiya et al., “Ambig-SWE: Benchmarking LLM Agents on Underspecified Software Engineering Tasks,” accepted at ICLR 2026.

[18] The template repository pattern and its role in bootstrapping language ports is documented in the project’s internal research.

[19] GitClear, “AI Copilot Code Quality: 2025 Data Suggests 4x Growth in Code Clones,” GitClear Research, February 2025. https://www.gitclear.com/ai_assistant_code_quality_2025_research

[20] Anthropic, “Responsible Scaling Policy,” Version 3.0, 2025. AI Safety Levels (ASL-1 through ASL-3+) with graduated constraints.

[21] DORA Team, “Accelerate State of DevOps Report 2024,” Google Cloud. https://dora.dev/research/2024/dora-report/