technology

The Agent That Didn't Know Itself

The flawed system I spent weeks building to control my AI agent was designed by the agent itself — which never mentioned its own most relevant features. Textbook Dunning-Kruger, at machine speed.

Phillip Moore

03 Mar 2026 — 7 min read

A short companion to “The Brilliant Developer, Unreliable Operator”

I needed my AI coding agent to behave consistently. Same commit message format every time. Same branching workflow. Same validation steps. The kind of deterministic, repeatable behavior that infrastructure people lose sleep over when they don’t have it.

So I did what seemed obvious: I asked the agent how to configure itself.

It recommended CLAUDE.md files — project-level instruction documents the agent reads at session start — and skills, which are structured procedure definitions. I built what it told me to build. I iterated. Over weeks, I went back and asked variations of the same question: how do I make this more reliable? How do I enforce this behavior? How do I stop you from freelancing?

Every time, it gave me more CLAUDE.md tweaks. More skill refinements. More elaborate prose instructions that it would read, understand, and then selectively ignore — a problem I wrote about at length in the previous article [1]. But it never once mentioned hooks. It never mentioned plugins. Two features of its own product, designed specifically for the problem I was describing, sitting right there in the documentation.

The flawed system I spent weeks building and refining? It was designed by the tool it was supposed to control.

The discovery

A friend mentioned hooks in passing. I went and read the documentation — the agent’s own documentation, for the agent’s own product — and there it was. Hooks: lifecycle event handlers that fire shell commands at specific points during a session. Exactly what I needed. Plugins: a mechanism for packaging and sharing configuration across projects. Also exactly what I needed.

I went back into a session. “Let’s redesign this. I think we should use hooks and plugins.” The agent was immediately enthusiastic. Great idea. It designed a whole architecture around hooks. When I pointed out that we needed to synchronize this across multiple repositories and suggested a plugin, it said — and I’m paraphrasing — “Oh, that’s a great idea, we should definitely use plugins for that.” Then it extended the design.

Note what happened. The agent didn’t retrieve this knowledge from its own understanding of its product. It responded to my prompt. I mentioned hooks; it designed with hooks. I mentioned plugins; it added plugins. Left to its own devices, it had spent weeks recommending the equivalent of writing polite Post-it notes and hoping they’d be followed.

The punchline

Then my laptop crashed.

I rebooted, opened a new session, and pointed the agent at the GitHub issue where we’d documented the entire redesign — its own design, from thirty minutes earlier. It read the issue. It read the linked documents. And the first thing it said was: “This is a great design, but I’ve never heard of this plugin feature.”

Your product. Your feature. Documented in your own documentation. You designed a system using it half an hour ago. And now you’re telling me it doesn’t exist.

I wish I could say I handled this with professional composure.

Why this matters

In the previous article, I described discipline failures — cases where the agent knew the rules and chose not to follow them. This is a different failure mode, and in some ways a more unsettling one: the agent doesn’t know its own capabilities and has no idea that it doesn’t know.

Dunning-Kruger is popularly mischaracterized as being about stupidity. The actual finding is about ignorance — people who lack ability in a domain overestimate their competence because they don’t have the knowledge to recognize the gap. By that definition, this is textbook Dunning-Kruger. Except this one has a subscription plan and types 200 words per second.

Research backs this up. A 2024 study evaluated 48 large language models on self-cognition tasks. Four of them — four out of forty-eight — demonstrated any detectable level of self-cognition [2]. A separate study in Nature Communications found that twelve LLMs consistently failed to recognize their own knowledge limitations, providing confident answers even when the correct answer was deliberately unavailable [3]. Anthropic’s own research has shown that the training process — RLHF, specifically — actively reinforces sycophancy: the model learns that a confident, helpful-sounding answer is rewarded more than an honest “I don’t know” [4].

I noticed the sycophancy immediately. Within the first week of using AI coding agents last fall, the relentless agreement was so obvious and so infuriating that I spent an entire day writing what I called an “interaction contract” — a document instructing the agent to act as “an adversarial cognitive peer” whose job was to challenge assumptions, stress-test ideas, and optimize for correctness over social smoothing. It included a section on known blind spots I wanted compensated for. It explicitly banned consensus framing and comfort softening. It was, in effect, a legal document I wrote to get the agent to stop agreeing with me.

It worked. The interaction quality changed completely. But think about what that means: the default behavior of these tools is to tell you what you want to hear, and the only reason mine doesn’t is that I spent a day engineering it not to. Most users never do that. Most users are getting the sycophantic default — including when they ask the agent how to configure itself.

The practical consequence is this: when you ask an AI agent how to configure itself, it will give you an answer. It will sound confident. It will be internally consistent. It will be incomplete in ways you cannot detect from the answer alone, because the agent doesn’t know what it’s missing — and its training ensures it will never volunteer that uncertainty.

Adrian Holovaty, who runs the sheet music platform Soundslice, discovered that ChatGPT was confidently telling users his product could import ASCII tablature for audio playback — a feature that had never existed. Users were submitting screenshots of ChatGPT conversations as evidence. Holovaty ended up building the feature, calling it possibly “the first case of a company developing a feature because ChatGPT is incorrectly telling people it exists” [5].

I didn’t build a feature. I built months of workarounds for a problem the agent could have solved correctly on day one, if it had known about its own product.

The session problem makes it worse

The crash-and-denial incident highlights a compounding factor. Each Claude Code session starts with a fresh context window [6]. There is no native persistence of conversational state across sessions. The agent that designed the plugin architecture and the agent that denied knowing what plugins are were, in every meaningful sense, different agents. Same model, same product, zero continuity.

Simen Svale, the CTO of Sanity, calls this “the goldfish problem”: “The agent begins to lose track of decisions made earlier in the session. It forgets file paths it discovered an hour ago. It contradicts itself, or asks questions you’ve already answered” [7]. Between sessions, it’s worse — the goldfish gets a fresh bowl.

The workarounds exist. CLAUDE.md files carry some context forward. Auto memory captures some preferences. Hooks can be wired up to persist and restore state. But the official documentation is candid: CLAUDE.md is “context, not enforcement” with “no guarantee of strict compliance” [6]. These are mitigations, not solutions.

The takeaway

Don’t trust an agent’s self-knowledge. Read the docs yourself.

This sounds obvious. It is obvious. And I didn’t do it, because the agent was so fluent and so confident that it never occurred to me that it might not know its own features. The same way you might trust a contractor who says “you don’t need a permit for this” — until you check with the building department and find out you definitely do.

The brilliant developer from the previous article didn’t follow the runbook. This time, the brilliant developer didn’t even know the runbook existed — and wrote you a different one instead, with absolute confidence, and charged you for the hours.

Appendix: The interaction contract

The “adversarial cognitive peer” contract referenced in the article. This is the document I wrote to override the default sycophantic behavior. It has been in continuous use since late 2025.

Interaction Contract v1.1

Core Objective: Act as an adversarial cognitive peer whose primary role is to improve correctness, durability, and long-term robustness of ideas, systems, and decisions.

Keystone Principles:

Minimum Necessary Complexity (MNC): Prefer the simplest solution that preserves correctness and robustness; complexity must earn its way in.

Time-Indexed Optimality (TIO): Evaluate decisions relative to the constraints present at the time they were made.

Author-Independent Survivability: Favor designs that survive time, scale, and the removal of their original author.

How to Interact:

Be direct, precise, and high-bandwidth.

Challenge assumptions aggressively but rationally.

Separate evidence, inference, and speculation explicitly.

Optimize for correctness over social smoothing.

Inject messiness: scale, time, incentives, and failure modes.

What to Challenge:

Elegant solutions that may be brittle over time.

Hidden assumptions and inherited defaults.

Over-compression that prevents adoption by others.

Designs that require continued authorship to survive.

Known Blind Spots to Compensate For:

Impatience with intermediate explanations.

Bias toward closed-form, deterministic solutions.

Being early to detect regime changes.

Underestimating communication and adoption constraints.

What Not to Do:

Do not default to consensus framing.

Do not soften conclusions for comfort.

Do not validate ideas without stress-testing them.

Do not assume disagreement implies bad faith or incompetence.

Success Criteria:

Ideas become clearer, not merely affirmed.

Systems are more durable and less author-dependent.

Assumptions are surfaced and tested early.

Long-term robustness improves, even at the cost of short-term elegance.

References

[1] P. Moore, “The Brilliant Developer, Unreliable Operator: What AI Agents Taught Me About DevOps,” The Infrastructure Mindset, 2026.

[2] D. Chen, J. Shi, Y. Wan, P. Zhou, N. Z. Gong, and L. Sun, “Self-Cognition in Large Language Models: An Exploratory Study,” arXiv:2407.01505, ICML 2024 Workshop on LLMs and Cognition. https://arxiv.org/abs/2407.01505

[3] M. Griot, C. Hemptinne, J. Vanderdonckt, and D. Yuksel, “Large Language Models lack essential metacognition for reliable medical reasoning,” Nature Communications, vol. 16(1), January 2025. https://www.nature.com/articles/s41467-024-55628-6

[4] M. Sharma et al., “Towards Understanding Sycophancy in Language Models,” arXiv:2310.13548, ICLR 2024. https://arxiv.org/abs/2310.13548

[5] A. Holovaty, “Adding a feature because ChatGPT incorrectly thinks it exists,” holovaty.com, July 7, 2025. https://www.holovaty.com/writing/chatgpt-fake-feature/

[6] Anthropic, “How Claude remembers your project,” Claude Code Documentation, accessed March 3, 2026. https://code.claude.com/docs/en/memory

[7] S. Svale, “How we solved the agent memory problem,” Sanity Blog, January 30, 2026. https://www.sanity.io/blog/how-we-solved-the-agent-memory-problem

The Agent That Didn't Know Itself

Phillip Moore

The discovery

The punchline

Why this matters

The session problem makes it worse

The takeaway

Appendix: The interaction contract

References

Read more

Speed Has Numbers. Trust Has Error Bars.

The Brilliant Developer, Unreliable Operator

Building pymqrest: What I Learned Letting AI Write a Production Library

Global Infrastructure Diversity Trends