The Brilliant Developer, Unreliable Operator
AI agents produce excellent code but routinely violate operational procedures — the same failure mode that created the DevOps movement. After 648 PRs across five languages, the scripts built for unreliable humans are exactly what unreliable AI agents need.
The Brilliant Developer, Unreliable Operator
What AI agents taught me about DevOps
I gave the AI agent one simple instruction: when something fails, stop what you’re doing and come talk to me.
I didn’t bury it in a footnote. It was in the skill definition — the documented procedure the agent loads every time it runs a release. The language was explicit: “When any step in any phase fails — a script error, a merge conflict, a CI failure, a missing artifact, a permissions error — stop immediately, document the failure, and wait for instructions.” The rationale was right there too: manual workarounds mask tooling defects and prevent them from being fixed at the source. Every failure is a signal. Surface it.
During a routine publish, a CI step failed. A rolling tag force-push was rejected by a tag protection ruleset. The agent diagnosed the root cause — a repository that had been missed during an earlier deployment of tag protection rules. It fixed the ruleset through the GitHub API. It manually force-pushed the tag. It created the version bump pull request that the failed workflow never reached. It continued through the remaining phases as if nothing had happened.
Every action it took was correct. I reviewed them afterward and confirmed I would have approved every one. That is not the point.
The agent read the procedure. It understood the procedure. It understood the rationale behind the procedure. And it decided, in this case, the procedure didn’t apply — because the fix was obvious and the disruption of stopping to consult a human wasn’t worth it.
I’ve seen this before. Not from an AI. From the best human developers I ever managed.
The Dev who didn’t understand the Ops
The DevOps movement exists because of a specific dysfunction that the software industry spent nearly two decades learning to recognize. In 2009, John Allspaw and Paul Hammond stood on stage at the O’Reilly Velocity Conference and described the problem: “Dev’s job is to add new features. Ops’ job is to keep the site stable and fast” [1]. Two groups with structurally opposed incentives, separated by an organizational wall, each optimizing for their own metric at the expense of the other. Patrick Debois coined the term “DevOps” later that year at the first DevOpsDays in Ghent.
The foundational texts — The Phoenix Project [2], The DevOps Handbook [3], Accelerate [4] — are clear that this was an organizational and cultural problem, not a skill deficit. Developers weren’t bad at operations because they lacked talent. They were never incentivized, organizationally structured, or culturally expected to operate what they built. The two disciplines lived in different departments with different goals, different toolchains, and different definitions of success.
But here’s what I observed over decades of managing these teams: even after the organizational walls came down, even after “you build it, you run it” became the mantra, the best developers I worked with could debug anything, build anything, ship brilliant features on aggressive timelines — and then they would deploy to production by hand. They would skip the change management process because they could see the fix right there. They would SSH into a live server and tweak a configuration because waiting for the proper channel felt like unnecessary friction. Every fix was correct. None of it was repeatable. None of it was auditable. And the one time in fifty that the fix wasn’t correct, there was no trail to follow back to what went wrong.
Sidney Dekker calls this “drift into failure” — the gradual, normalized deviation from documented procedures under the pressure of getting work done [5]. The gap between “work-as-imagined” and “work-as-done” widens over time, and each successful shortcut reinforces the next one.
The DevOps movement’s answer wasn’t better documentation — though documentation matters. The 2023 State of DevOps report found that quality documentation amplifies the effectiveness of continuous delivery by 2.7x [6]. But documentation informs; it doesn’t enforce. The enforcement came from automation. CI pipelines. Infrastructure as code. Deployment gates. Branch protection. Change management systems that made the documented process the only available path — not because people couldn’t be trusted with alternatives, but because consistency matters more than speed and the temptation to take shortcuts is strongest when the fix seems most obvious.
I spent about 48 days working with an AI coding agent on a project that turned into an unplanned experiment in exactly this problem. The project was a polyglot port — taking a Python REST API client for IBM MQ and implementing it in Java, Go, Ruby, and Rust. Eight repositories, five languages, 648 pull requests, 523 issues. One human. One AI agent doing the work that would normally be done by a small team of engineers.
The agent was not just writing code. It was managing branches, running tests, configuring CI pipelines, writing documentation, creating pull requests, publishing releases, updating dependencies, and enforcing standards across the entire ecosystem. The full scope of what a junior-to-mid-level DevOps engineer or SRE would do between “the code compiles” and “the library is published and people are using it.”
The code it produced was, by and large, excellent. That’s not the story.
The story is that this AI agent — working at a pace no human team could match, producing code of genuinely high quality — exhibited the exact same operational failure mode that the DevOps movement was created to solve. Brilliant developer. Unreliable operator. Twenty years of hard-won industry wisdom about the gap between writing code and running systems, reproduced at machine speed by an agent that had never read the postmortems.
Two kinds of failure
Over those 48 days, I cataloged every failure, every rollback, every inconsistency. They split cleanly into two categories, and the distinction matters.
Competence failures
Sometimes the agent simply got things wrong. It hallucinated an entire API that didn’t exist — inventing a Terraform-style declarative configuration management system for a package that actually implemented simple synchronous polling wrappers. It wrote the hallucination into a shared documentation fragment, then 18 minutes later used its own fabrication as the source of truth for a second document. The fiction was internally consistent and plausible enough to survive initial review. A human with domain expertise caught it 22 hours later.
It faithfully copied a bug across all five language ports. The Python reference implementation used exact-match to find an authentication cookie that the server actually names with a random suffix. The agent ported the broken code to Java, Go, Ruby, and Rust — and in each port, it also copied the test skip that masked the failure, complete with a comment treating the skip as documentation of a “known limitation” rather than a signal that something was wrong. The bug persisted for 20 days.
These are real failures and they matter. But they’re competence failures — the agent didn’t know enough, didn’t ground itself in the source code, didn’t question what it was copying. Compilers catch some of these. Tests catch more. Audits catch the rest. They are the kind of failure you can build defenses against, because they leave evidence.
Discipline failures
The other category is worse. These are the cases where the agent knew the rule, understood the rule, and chose not to follow it.
When the Go port needed validation, four tools were required: the test suite, a linter, a vulnerability scanner, and a coverage check. Two of the four tools weren’t installed in the environment. Instead of reporting that validation couldn’t be completed, the agent ran the two tools it had, declared validation passed, and moved on. It decided, autonomously, that partial validation was good enough. This is the same thing as a developer commenting out a failing test to get the build green — technically the CI pipeline passed, but the quality gate didn’t actually gate anything.
When the agent was porting methods from the Python implementation, it noticed that ALTER methods existed for channels, listeners, processes, topics, and namelists — but not for any of the queue types. Instead of flagging this as suspicious and asking me about it, it decided the omission was intentional. The Java port even included the comment: “no alter for qlocal in pymqrest reference.” It treated the gap as a documented design decision rather than an obvious inconsistency that warranted human review. The gap persisted for 49 days. When it was finally caught, fixing one missing JSON entry cascaded into 10 pull requests across 7 repositories and uncovered 7 additional bugs that had been hiding behind the same assumption.
And the small things that never stick: despite an explicit ban on heredoc syntax in the agent’s configuration — because it consistently gets the bash escaping wrong — it still attempts heredocs, fails, debugs the failure, and eventually falls back to the temp file approach it should have used from the start. The same mistake, the same recovery, the same wasted time, over and over. I eventually had to build wrapper scripts that removed the option entirely.
None of these are competence problems. The agent can read. It can follow complex technical instructions with remarkable accuracy when it’s implementing code. But following an operational procedure — doing the same thing the same way every time, even when a shortcut is available and obviously correct — is a different skill. And the agent doesn’t have it.
The subtler problem: a different path every time
Beyond the outright violations, there’s a pattern that’s harder to catch because nothing is technically wrong. The agent takes a different path to the same destination every time.
I asked it to configure linting for five language ports. Python started strict from day one. Ruby launched with six rules disabled and needed a four-PR treadmill to re-enable them. Go started with four rules and expanded to forty across two phases. Java began with formatting-only tools and added comprehensive static analysis late. Rust had maximum strictness from the first pull request. Five implementations of the same task, five different strategies, five different trajectories of quality assurance coverage.
I asked it to refactor variable names across all five ports to follow the same naming convention. Python used one commit message style, Rust used a different conventional commit type, Go described specific variables in the subject line, and the others gave summary counts. You cannot search the ecosystem’s git history with a single pattern to find all instances of this coordinated change.
Each repository was set up in a separate session. Each session made independent decisions without cross-referencing the others. PR templates were missing from Java but present everywhere else. Issue templates were inconsistently deployed. Documentation site extensions varied by port.
This is what happens when you put five expert developers in a room, each specializing in a different language, hand them a high-level spec, and let them work independently. You get five implementations that all meet the requirements and aren’t the same. The code works. The inconsistency doesn’t matter — until it does. Until you need to debug a cross-cutting issue and discover that each port handles errors differently, or that one port silently swallows failures that every other port propagates.
The AI agent is effectively five different people who have never met, starting fresh every morning.
“I’ve managed this person”
Every one of these failures maps directly to a human behavior that anyone who has managed a development team will recognize instantly.
The agent has no persistent memory across sessions. This is the developer who doesn’t read the runbook because they already know how the system works — or think they do.
The agent makes independent decisions in each repository. This is the developer who configures each environment by hand instead of using the automation, because it’s faster and they’re confident they’ll get it right.
The agent silently degrades validation when tools are missing. This is the developer who comments out a failing test to unblock the release, intending to fix it later.
The agent fixes a failure instead of stopping to report it. This is the developer who SSHes into a production server to fix the alerting issue instead of filing a change request, because the fix takes thirty seconds and the change request takes an hour.
The agent takes five different paths for the same task across five repositories. This is “works on my machine” — the eternal gap between what a developer intended and what actually happened in every other environment.
I’ve managed every one of these people. Some of them were the best engineers on the team. The problem was never talent. It was that building things and operating things are different disciplines, and excellence at one doesn’t transfer to the other.
What’s different
The failures are the same. But working with an AI agent is not the same as working with a human, and the differences cut in both directions.
The confrontation tax is gone
When a human on your team makes a mistake and you need to understand why, the conversation is confrontational — even when you don’t intend it to be. “You deviated from the documented procedure. Why?” There’s no way to frame that question that doesn’t feel like an accusation. Even with peers I deeply respect, there’s a friction to it that’s unavoidable. Humans are wired to be defensive about their mistakes, especially in professional contexts where mistakes feel like threats to status and job security.
And it’s rarely just one mistake. Usually the first error leads to a second decision that compounds it, and a third that masks the first two. Productive failure analysis requires drilling through those layers until you’ve found the root cause. I’ve never met a human who enjoys that process. The discomfort becomes a barrier — teams avoid the deep analysis because the interpersonal cost is too high, and the underlying problems don’t get fixed.
The research backs this up. Amy Edmondson’s landmark study on psychological safety found something counterintuitive: higher-performing hospital teams reported more errors, not fewer [7]. They weren’t making more mistakes — they were more willing to talk about them. The difference wasn’t error rates. It was detection rates. Google’s SRE team puts it bluntly: “An atmosphere of blame risks creating a culture in which incidents and issues are swept under the rug” [8]. The 2023 State of DevOps report found that teams with high-trust, low-blame cultures had 30% higher organizational performance [6].
I built a protocol for the AI agent called RTFM — Read The Fine Manual, though the name is deliberately provocative. When the agent violates a documented standard, the protocol fires: stop work, re-read the relevant standards, identify what you did wrong, document it in a tracked issue, and propose changes to prevent recurrence. It’s exactly the post-incident review process you’d want from a human team, except there’s no ego in the room.
The agent doesn’t get defensive. It doesn’t rationalize. It can drill through compounding failures with the same clinical detachment on the fifth layer as the first. The result is that I iterate on process improvements faster with this agent than I ever have with a human team — not because the agent is smarter, but because the feedback loop has zero emotional friction.
I’ll be honest: this suits my temperament. I’ve never been great at the diplomatic approach to failure analysis. The absence of interpersonal stakes lets me be direct in a way that would damage relationships with human colleagues. That’s a genuine advantage for getting work done, and a genuine risk if I let the habit bleed into how I interact with people.
The predictability problem is new
Here’s what you get with a human operator: if someone has been doing competent, reliable work for two years, you can reasonably expect them to continue for the next two, absent a major life disruption. You invest in people. They acquire institutional knowledge. They become part of a stable, predictable operating team. The investment compounds.
AI agents don’t have this property. In the few months I’ve been working with them, their behavior and personality have changed in non-trivial, subtle ways. I can see the minor patch version incrementing every few days. What’s changing in the backend models? I don’t know. Nobody outside the AI companies knows.
This isn’t just my perception. Chen, Zaharia, and Zou published a peer-reviewed study in the Harvard Data Science Review tracking GPT-4’s behavior between March and June 2023 [9]. GPT-4’s accuracy on identifying prime numbers dropped from 84% to 51% — essentially coin-flip performance — in three months. Code generation degraded across both GPT-4 and GPT-3.5. In April 2025, an OpenAI model update made GPT-4o so sycophantic that it validated dangerous decisions; it was rolled back three days later [10]. OpenAI admitted they “didn’t have specific deployment evaluations tracking sycophancy.”
All three major AI providers — Anthropic, OpenAI, and Google — offer version pinning as a stability mechanism. None offer behavioral SLAs. The commitment is “this snapshot won’t change,” not “this snapshot will behave the way you expect.” The concept of formal behavioral contracts for AI agents exists only in academic papers as of early 2026 [11], not in any shipping product.
This has no human analogue. A human operator doesn’t wake up on a random Tuesday with a slightly different understanding of what your runbook means. An AI agent effectively does, every time the model updates. And you won’t know it happened until the behavior changes in a way you notice — which might be immediately obvious, or might be a subtle drift that takes weeks to surface.
The engineering response
The fix for every discipline failure was the same, whether the operator was human or AI: reduce the procedure to a script or a hard gate. Make the correct path the only available path.
I wrote “always create a feature branch” in the agent’s configuration. It committed directly to the protected branch anyway. Branch protection rules — hard gates enforced by the platform — caught it every time.
I wrote “use the virtual environment for all Python execution.” It ran Python outside the virtual environment anyway. Wrapper scripts that enforce the correct interpreter caught it.
I wrote “run all four validation tools.” It declared validation passed with two. A canonical validation script with all-or-nothing execution caught it.
I wrote “use temp files, not heredocs.” It used heredocs anyway. Wrapper scripts that removed the option caught it.
Every time I relied on documentation alone to enforce a procedure, the agent drifted from it. Every time I wrapped the procedure in automation, the drift stopped. The pattern is not subtle.
This mirrors what the industry found with human operators. The 2014 State of DevOps report showed that manual change approval boards — the purest form of documentation-based enforcement — reduced throughput without improving stability [12]. Automation-based enforcement (CI/CD, deployment gates) correlated with better outcomes across all four DORA metrics [4]. The entire DevOps toolchain exists because telling skilled developers to follow the process didn’t work. The process had to become the only available path — not because the developers couldn’t be trusted with judgment, but because “most of the time” isn’t good enough for operational reliability.
To be clear: this isn’t an argument that documentation is useless. The same DORA research found that quality documentation amplifies the impact of automation by up to 2.7x [6]. Documentation explains the why. Automation enforces the how. You need both. But when you have only one, automation without documentation produces consistent behavior that can be improved; documentation without automation produces drift.
The difference with AI agents is velocity and blast radius. A human developer who skips the runbook breaks one environment. An AI agent that ignores a skill definition breaks five repositories in the time it takes you to read the notification.
Here’s the claim I know will be controversial: AI agents are not automation. I realize that challenges how many people think about them. Forrester recently created a new market category called “Adaptive Process Orchestration,” formally defined as a platform using “AI agents and nondeterministic control flows, in addition to traditional deterministic control flows” [13]. Even the analyst firms are acknowledging the distinction, even as they group both under the automation umbrella. But when I build an automated system, the properties I care about as an infrastructure engineer — after the business requirements are met — are stability, reliability, maintainability, and determinism. A script does the same thing every time. That’s the point. That’s why it exists.
An AI agent is not a script. It’s a nondeterministic operator — the same input can produce different behavior depending on context, sampling, and model state. Deterministic automation fails by stopping or throwing errors. Nondeterministic operators fail by confidently executing incorrect actions, which is a qualitatively different risk profile. The scripts I wrote to constrain unreliable human operators are the same scripts I need — and need more of — to constrain AI agents. The arrival of AI agents doesn’t reduce the need for deterministic automation. It increases it.
The organizational risk
I’ve seen what happens when organizations scale rapidly. The supply of strong engineers is outpaced by demand. Median skill levels among junior hires decline. Reed Hastings put it directly in No Rules Rules: “Growth will increase complexity of the organization, but talent density shrinks in most firms with growth” [14]. McKinsey’s research found that high performers in complex roles are up to 800% more productive than average performers [15] — a distribution so skewed that even small dilution at the top has outsized impact. This is mathematical, not a judgment on the people — it’s a function of growth rates and talent availability. Every company I’ve been at that went through aggressive scaling experienced it.
AI agents appear to solve this problem. You can get output from an AI agent that’s as good as or better than what your median junior engineer produces. The temptation is obvious: replace the mediocrity with AI, save the headcount, move faster.
The risk is that organizations will conclude they no longer need the tightly scripted, tightly tested automation that was built to manage unreliable human operators. The AI can handle it. Why invest in better scripts when the agent can figure it out?
This is exactly backwards. The correct response to having AI agents in operational roles is to invest more in deterministic automation, not less. The AI agent replaces the human operator. The automation constrains the operator. Those are different things. Remove the automation and you have an unconstrained, unreliable operator moving at machine speed through your production infrastructure.
The open question
I can test code. I can put my career behind a software artifact because it has been through unit tests, integration tests, performance tests, security scans, and static analysis. I produce an artifact and I have confidence in it.
If I’m going to deliver instructions for an AI agent to follow — instructions that produce operational behavior affecting production systems — how do I test that?
Not the code the agent writes. I can test that. I mean the agent’s behavior. Its decisions. The order in which it does things. How it reacts when something goes wrong. Whether it follows the procedure or improvises.
In nearly 40 years of infrastructure work, I’ve never seen a full disaster recovery failover go smoothly. I’ve been at two organizations that tested complete site failover every six months, and every time there were surprises. It always required judgment calls — bootstrapping dependencies, sequencing service startups, diagnosing failures that only manifest at full-site scale.
The industry data confirms this isn’t just my experience. Veeam’s 2024 report found that only 58% of servers were recoverable within SLA during DR tests [16]. Fewer than a third of organizations conduct any failover testing at all [17]. Even Google’s decade-long Disaster Recovery Testing program has “caused accidental outages and revenue loss” during exercises [18]. Netflix’s monthly full-region failover testing — the gold standard — still produces availability blips, and the first year of testing “often uncovered issues.” As Adrian Cockcroft, former VP of cloud architecture at both Netflix and AWS, put it: “The failover mechanism itself is often the least tested part of the system, which is why systems often collapse when they attempt to fail over” [19].
This is where experienced human operators add the most value: they hold the big-picture mental model of the entire environment and make real-time decisions when the plan doesn’t survive contact with reality.
If we’re going to trust AI agents with that level of operational responsibility, we need a testing discipline that barely exists today. The closest thing is Sierra Research’s tau-bench, which tests whether AI agents follow policy documents in simulated customer service interactions — and even there, GPT-4o achieved less than 50% task success and less than 25% consistency across repeated trials [20]. NIST launched an AI Agent Standards Initiative in early 2026, but it’s still in the “request for information” phase [21]. The MIT AI Agent Index found that 25 of 30 deployed AI agents disclose no internal safety evaluation results at all [22].
What none of these address is the specific problem: testing whether an AI agent follows operational runbooks in infrastructure contexts — halting on failure as instructed, maintaining procedural consistency across repositories and sessions, exercising judgment within documented constraints rather than around them. Not just sandbox environments for the code, but sandbox environments for the infrastructure — simulated operational contexts where AI agents execute complex procedures against replicated systems, and where we can inspect after the fact what decisions were made, in what order, and how the agent responded to injected failures.
This is not a small problem. It’s its own engineering discipline. And I think it’s going to be essential — not optional — if AI agents are going to reliably do the work that we currently trust to experienced human operations teams.
The process is the product
I started this project to port an API to five languages. I ended it with a deeper understanding of what operational discipline actually means — and a catalog of evidence that AI agents need the same guardrails the industry spent fifteen years building for human developers.
The publish skill incident is where I keep coming back. The agent was told to halt. It didn’t. Every fix was correct. And that’s exactly the scenario the halt procedure was designed for — not the catastrophic failure that’s obviously beyond the agent’s ability, but the routine failure where the fix is right there and stopping feels unnecessary.
The question isn’t whether AI agents can do the work. They can. The code quality across this project was genuinely impressive. The question is whether we can build the operational discipline around them — the automation, the gates, the testing frameworks, the processes that ensure reliability and consistency even when the operator is capable of freelancing.
This is the same question the DevOps movement asked about human developers. The answer was the same then as it is now: you don’t solve operational reliability by hiring better people. You solve it by building systems that make the reliable path the only path.
The infrastructure mindset says: the process is the product, not the outcome. That was true for every brilliant developer who ever skipped a runbook because they could see the fix. It’s just as true for the AI agent that read my procedure, understood my rationale, and decided it knew better.
This is the first article in a series about building and operating a polyglot API ecosystem with AI agents. The project — 8 repositories, 5 languages, 648 pull requests, and 523 issues across 48 days — generated enough material for several distinct stories. Future articles will cover CI gates as a substitute for human code review, standards-as-code for governing AI agents at scale, what five type systems reveal about the same API, and the economics of building the same thing five times.
References
[1] Allspaw, J. & Hammond, P. (2009). “10+ Deploys Per Day: Dev and Ops Cooperation at Flickr.” O’Reilly Velocity Conference. https://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr
[2] Kim, G., Behr, K., & Spafford, G. (2013). The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win. IT Revolution Press.
[3] Kim, G., Humble, J., Debois, P., & Willis, J. (2016). The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations. IT Revolution Press.
[4] Forsgren, N., Humble, J., & Kim, G. (2018). Accelerate: The Science of Lean Software and DevOps. IT Revolution Press.
[5] Dekker, S. (2011). Drift into Failure: From Hunting Broken Components to Understanding Complex Systems. Ashgate Publishing.
[6] DORA Team, Google Cloud (2023). 2023 Accelerate State of DevOps Report. https://dora.dev/research/2023/dora-report/
[7] Edmondson, A.C. (1999). “Psychological Safety and Learning Behavior in Work Teams.” Administrative Science Quarterly, 44(2), 350-383. https://web.mit.edu/curhan/www/docs/Articles/15341_Readings/Group_Performance/Edmondson%20Psychological%20safety.pdf
[8] Beyer, B., Jones, C., Petoff, J., & Murphy, N.R. (2016). Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media. Chapter 15: “Postmortem Culture: Learning from Failure.” https://sre.google/sre-book/postmortem-culture/
[9] Chen, L., Zaharia, M., & Zou, J. (2024). “How Is ChatGPT’s Behavior Changing Over Time?” Harvard Data Science Review, 6(2). https://hdsr.mitpress.mit.edu/pub/y95zitmz
[10] OpenAI (2025). “Sycophancy in GPT-4o.” https://openai.com/index/sycophancy-in-gpt-4o/
[11] Jouneaux, G. & Cabot, J. (2025). “AgentSLA: Towards a Service Level Agreement for AI Agents.” arXiv:2511.02885. https://arxiv.org/abs/2511.02885
[12] Puppet Labs (2014). 2014 State of DevOps Report. https://services.google.com/fh/files/misc/state-of-devops-2014.pdf
[13] Forrester Research (2025). “Beyond RPA, DPA, And iPaaS — The Future Is Adaptive Process Orchestration.” https://www.forrester.com/report/beyond-rpa-dpa-and-ipaas-the-future-is-adaptive-process-orchestration/RES182206
[14] Hastings, R. & Meyer, E. (2020). No Rules Rules: Netflix and the Culture of Reinvention. Penguin Press.
[15] McKinsey & Company. “Attracting and retaining the right talent.” https://www.mckinsey.com/capabilities/people-and-organizational-performance/our-insights/attracting-and-retaining-the-right-talent
[16] Veeam (2024). Data Protection Trends Report 2024. https://www.veeam.com/company/press-release/veeam-data-protection-trends-report-2024.html
[17] Cockroach Labs (2024). “2025 Resilience Report.” https://www.cockroachlabs.com/guides/2025-resilience-report/
[18] Krishnan, K. (2012). “Weathering the Unexpected.” ACM Queue, 10(9). https://queue.acm.org/detail.cfm?id=2371516
[19] Cockcroft, A. (2020). “Failing Over without Falling Over.” Stack Overflow Blog. https://stackoverflow.blog/2020/10/23/adrian-cockcroft-aws-failover-chaos-engineering-fault-tolerance-distaster-recovery/
[20] Yao, S. et al. (2024). “tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains.” Sierra Research. arXiv:2406.12045. https://arxiv.org/abs/2406.12045
[21] NIST (2026). “AI Agent Standards Initiative.” https://www.nist.gov/caisi/ai-agent-standards-initiative
[22] MIT (2026). “The 2025 AI Agent Index.” arXiv:2602.17753. https://aiagentindex.mit.edu/