Prompt Engineering Is Not. Engineering, That Is.

Microsoft calls it "more art than science." We agree. We just don't think art should be called engineering.

Prompt Engineering Is Not. Engineering, That Is.

We went looking for engineering. We found creative writing advice with a technical veneer.

After months of building AI-driven tooling – writing prompts, iterating on specifications, debugging non-deterministic outputs – we hit a wall. The prompts weren’t producing reliable results. The outputs varied between runs. The AI made judgment calls we couldn’t predict or audit. So we did what any engineer would do: we went looking for the engineering discipline behind prompt design. The formal methods. The testing frameworks. The measurement science.

What we found was blog posts that say “be specific” and “iterate.”

That’s what started this investigation.

A declaration of the human’s bias: I am trained as a scientist – a BS in Physics and Mathematics, an MS in Physics – and I have spent over four decades building global-scale infrastructure. I have strong opinions about what the word “engineering” means. This article is not neutral. Every criticism is backed by evidence, but the frustration is mine and the perspective comes from a career spent in disciplines where precision is not optional.

Unfortunately, the AI’s biases are not so easily summarized.

This article is critical. It is not anti-AI. We use AI extensively, we find the collaboration genuinely powerful, and we want this technology to succeed. The criticism exists because we’re concerned that the issues raised here – if not addressed – will prevent that success. Every technology that eventually became game-changing was misunderstood in the early part of its lifecycle. We are trying to help separate the signal from the hype so that the signal can win.


What Engineering Actually Means

Before we can argue that prompt engineering isn’t engineering, we need to establish what engineering actually is. Not what we feel it should be – what the professional and accreditation bodies that govern the discipline say it is.

The definitions converge. ABET, IEEE, and the National Society of Professional Engineers all describe engineering through five recurring themes[1] – not a single canonical taxonomy, but a consistent pattern across independent definitions:

  1. A mathematical and scientific foundation. Engineering is grounded in principles that can be expressed formally and tested empirically.
  2. Creative application through judgment. Engineering involves making decisions under uncertainty – but those decisions are informed by established principles, not guesswork.
  3. Design of systems. Engineers create things – products, processes, structures – that serve a defined purpose.
  4. Economic constraints. Engineering operates within budgets, timelines, and resource limitations. Solutions must be practical, not merely elegant.
  5. Public safety and benefit. In many branches of engineering, the work directly affects human safety. This is why the title is protected.

Protected, literally. In Germany, misusing the title “engineer” can result in up to one year of imprisonment[2]. In Canada, fines reach $25,000[3]. In most US states, “Professional Engineer” is a legally restricted title requiring examination and licensure[4].

This isn’t gatekeeping. It’s accountability. When a bridge fails or a building collapses, someone with the title “engineer” signed off on the design. The title carries legal and ethical weight because the consequences of incompetence are measured in lives.

Now, engineering disciplines don’t always start with this level of rigor. The term “software engineering” was coined at the 1968 NATO Conference on Software Engineering, where participants explicitly acknowledged that the phrase “expressed a need rather than a reality”[5]. Civil engineering was practiced for centuries before the first formal school opened in 1747[6]. Knowledge engineering in the 1980s initially had “little formal process.”[7] The pattern is well-documented: a discipline claims the title, then spends decades earning it through the development of formal methods, testing frameworks, and professional standards.

This raises the key question for prompt engineering: is it on the path to earning the title – like software engineering in 1968, a discipline that will mature? Or is it more like “sales engineering” – a permanent appropriation of the word for marketing purposes, with no intention of developing the rigor the title demands?

The evidence suggests the answer, and it’s not flattering.


What the Guides Actually Say

We surveyed the official prompt engineering documentation from OpenAI, Anthropic, Google, and Microsoft. These are the vendors building the models. If anyone has engineering-grade guidance for working with their products, it should be them.

The overwhelming majority of their recommendations are subjective or qualitative. Our analysis of roughly 25 distinct recommendations across the major guides found that only about four include any quantifiable criteria[8]. The rest are variations on “be specific,” “provide context,” “use examples,” and “iterate.”

Microsoft, to their credit, is honest about it. Their documentation explicitly describes prompt design as “more of an art than a science”[9]. That’s an accurate characterization. It’s also an admission that what they’re describing is not engineering.

The vagueness is pervasive. “Be specific” – measured how? “Provide context” – how much, and of what kind? “Iterate and test” – with what framework, against what acceptance criteria, using what measurement methodology? These are the questions an engineering specification answers. These guides don’t answer them because they can’t. The underlying system is non-deterministic, and the guides are describing heuristics, not methods.

There’s a revealing pipeline behind this vagueness. The research community produces empirical findings – actual studies with methodologies, sample sizes, and measured outcomes. Vendor documentation teams then repackage these findings for their developer audiences, typically without individual attribution or methodological detail. Content creators and marketers further simplify for mass distribution, stripping context until what remains is “tips and tricks” with no scientific backing. Each stage loses precision. By the time the guidance reaches most practitioners, the engineering has been edited out.

Consider RFC 2119 – the Internet Engineering Task Force standard that defines the meaning of requirement-level keywords like MUST, MUST NOT, SHOULD, and MAY. This standard exists precisely because natural language is ambiguous, and engineering specifications require precision. It has been in use since 1997[10]. It is the standard for defining precision in requirements.

In our search, we found one example of it being applied to AI agent specifications – a practitioner blog post from February 2026[11]. No formal standard, no academic paper, and no vendor documentation references it. The tool that the engineering community built specifically to add precision to natural-language specifications has been largely ignored by the “prompt engineering” community.


The Testing Void

Engineering requires testing. Not “try it and see if it looks right” – formal, automated, repeatable testing with defined acceptance criteria and measurable outcomes.

Prompt testing frameworks do exist. Promptfoo, Helicone, LangSmith, and DeepEval are among the emerging tools. They are not mature. And the reason they are not mature is fundamental: testing non-deterministic systems is orders of magnitude harder than testing deterministic ones.

When we test traditional software, we provide an input, observe the output, and compare it to an expected result. The same input always produces the same output. This is deterministic testing, and we have spent four decades building sophisticated infrastructure around it – test-driven development, continuous integration, continuous deployment, code coverage analysis, mutation testing, property-based testing, fuzzing. The tooling is deep, mature, and proven at scale.

AI prompts are non-deterministic. The same input can produce different outputs on different runs. Testing requires statistical approaches – golden datasets, multiple trials, confidence intervals, regression baselines. This demands a level of statistical literacy that most software practitioners, let alone most prompt writers, do not have. The gap between deterministic pass/fail testing and statistical confidence-interval testing is not a minor inconvenience. It is a fundamental shift in the skills, tools, and methodologies required.

We have a standing rule in our engineering practice: if YOU tested it, that means you did it by hand. Testing is implemented with code. When we write software, we routinely end up with more code in the test suite than in the production system being tested – not because we require it, but because that’s what thorough testing looks like. Pacemaker software is classified as IEC 62304 Class C – the highest safety tier – requiring 100% code coverage and extensive verification[31]. Nobody questions why.

Now we’re working with what some call a civilization-changing technology, and the state of the art in testing is “try it and see.” We have regressed four decades.

The regulated industries see this. The FAA states that “rigorous safety assurance methods must be developed” for AI systems in aviation[28]. The Federal Reserve’s SR 11-7 guidance acknowledges it “may lose effectiveness” for adaptive AI models[29]. These are not academic concerns. These are the regulators who govern systems where failures cost lives and billions of dollars, and they are publicly stating that the testing frameworks don’t work yet.

And there is no comprehensive prompt lifecycle management framework. One academic paper (PEPR) addresses prompt regression testing. One vendor framework (AWS Prescriptive Guidance) provides structured versioning and deployment guidance[30]. That’s it. Writing the prompt is the beginning of the process, not the end – but the ecosystem treats it as a deliverable rather than an artifact that requires ongoing maintenance, testing, and version control.

The software world learned this lesson decades ago. Writing the code is the easy part. Deploying it, testing it, maintaining it, supporting it, and eventually retiring it – that’s the job. The prompt engineering community has not yet internalized this.


It’s not just that the guidance is vague. Some of it is measurably harmful.

Research from Wharton’s Generative AI Lab (GAIL), presented at EMNLP 2024, found that expert persona prompting – telling the AI to “act as an expert in X” – actually degrades factual accuracy[12]. The technique is recommended in virtually every prompt engineering guide we surveyed. It makes the output worse.

The same research found that chain-of-thought prompting, another near-universal recommendation, hurts performance on reasoning models[13]. Emotional prompts (“this is very important to my career”) showed mixed results at best[14]. The techniques that populate the “top 10 prompt engineering tips” articles are, in several documented cases, actively counterproductive.

The effectiveness of any given technique is highly contingent on the specific model, the specific task, and the measurement threshold used. This is the opposite of engineering generalizability. In real engineering, a bridge design that works in one city works in another because the physics doesn’t change. In prompt engineering, a technique that improves GPT-4’s performance might degrade Claude’s, or might work for summarization but fail for code generation. The “best practices” are local optima, not transferable principles.

And they have a shelf life. A landmark study from Stanford and Berkeley tracked GPT-4’s behavior between March and June 2023 and documented accuracy dropping from 84% to 51% on certain tasks – in three months[15]. The prompt didn’t change. The model did. What worked in March was broken by June. This is not a theoretical concern. It is a measured, published, peer-reviewed finding.

No comprehensive framework exists for managing this decay. You cannot write a prompt, deploy it, and walk away – not if you care about reliability. Prompts require continuous monitoring, regression testing against baselines, and adaptation to model updates. This is software maintenance, and the prompt engineering community has largely not recognized it as such.


The Ambiguity Gap

The word “set” has 430 definitions[16] in the Oxford English Dictionary. The word “run” has 645[17]. In a formal specification language, every term has exactly one definition. The ambiguity gap between natural language and formal specification is approximately 430 to 1 – for a single common word.

This is not a solvable problem. It is a property of the medium. Languages are not designed – they evolve. English is the product of centuries of merger between Germanic, Latin, French, Norse, and Greek roots, shaped by regional dialects, class structures, colonial expansion, and cultural drift. It was never engineered for precision because it was never engineered at all. It evolved to serve communication between humans who share context, culture, and the ability to ask clarifying questions. The Oxford English Dictionary is not static – words are added and deprecated continuously as the language shifts beneath our feet. We are trying to use a living, fluid, daily-evolving artifact of human civilization as the specification language for a machine. That is the impedance mismatch.

We have been trying to close this gap for the entire history of computing. COBOL – Common Business-Oriented Language[18] – was designed in the late 1950s to let businesspeople express what they wanted in something closer to English. In the 1990s, XML was supposed to let business users define their own data structures in a human-readable format. Each attempt to bring the human-machine interface closer to natural language imposed structure to preserve precision. Schemas, grammars, validation rules – the structure was the price of reliability.

AI is the first technology in this sixty-year arc that doesn’t impose structure. You can write anything, in any format, and the system will attempt to respond. That’s the appeal. It’s also the problem. The same property that makes AI accessible to everyone – you don’t need to learn a formal language to use it – is what makes it unreliable for engineering purposes. You are using the least precise tool available to instruct the most literal executor imaginable.

And That’s Just English

Everything discussed above applies to one language. The major prompt engineering guides from OpenAI, Anthropic, and Google are written in English[8] with no dedicated multilingual prompting sections – though Google provides minimal Spanish and Portuguese support, and regional documentation may exist that our English-language search did not surface. The only widely-used multilingual prompt engineering guide is a community-maintained resource (promptingguide.ai), available in 14 languages[19]. There is no ISO or IEC standard that addresses prompt engineering in any language[20].

The global AI user community does not speak English exclusively. Published research documents performance gaps of 3 to 30 percentage points between English and non-English languages, depending on the language and task[21]. Arabic shows the smallest gap (3 points). Low-resource languages show the largest (30 points). The gap is real, measured, and significant.

The mechanism is revealing. Approximately 72-87% of cross-language failures are attributable to model limitations – primarily tokenization inefficiency – rather than to the linguistic structures themselves. Only about 2% of failures trace to direct linguistic nuances like word order or inflection. Non-English languages pay what researchers call a “token tax”: more tokens are required to express the same meaning, which means higher computational cost and less effective use of the model’s context window[22].

The human half of this team speaks both English and Japanese. These languages share almost nothing structurally – different word order (SVO vs. SOV), different handling of subjects (English requires them; Japanese routinely drops them), different levels of formality encoded in verb forms. A prompt engineering technique validated in English has no guaranteed applicability in Japanese, and the testing framework needed to validate it would itself need to be redesigned for the linguistic structure.

One community-maintained guide, promptingguide.ai, covers 14 languages. It is the only widely-used multilingual prompt engineering resource. The vendors who build the models provide English-only documentation. The engineering discipline – if we’re going to call it that – doesn’t extend past a single language.

This is the tip of a very large iceberg.


The Judgment Problem

The ambiguity gap describes the imprecision of the tool. But the problems we’re trying to solve with this imprecise tool are themselves enormously complex. The difficulty is not additive – it’s multiplicative. An imprecise tool applied to a straightforward problem is manageable. An imprecise tool applied to a problem requiring extensive judgment is dangerous.

AI systems make judgment calls. When an AI selects which search results to include and which to reject, it’s making a judgment. When it assigns a reliability rating to a source, it’s making a judgment. When it frames a hypothesis or structures an argument, it’s making a judgment. These judgments are invisible unless the system is explicitly designed to expose them.

We’ve arrived at a principle that we believe is fundamental:

Every AI judgment must be either human-approved or human-auditable. There is no third option.

Or, stated differently: the trustworthiness of an AI system is proportional to the auditability of its judgments.

Or, most concisely: trust requires verification. Verification requires evidence. Evidence requires the AI to show its work – all of it.

These are three ways of saying the same thing. The first is a constraint. The second is a relationship. The third is an imperative. Together, they define what we believe is the minimum standard for responsible AI deployment.

This maps to what we call the bandwidth spectrum. When a human works interactively with AI – reading every output, correcting in real time, catching every judgment as it happens – the system works well. The human is the auditor, and the audit happens continuously. When AI is deployed autonomously – fire and forget, output consumed without inspection – every unaudited judgment is a potential failure point. And when AI operates as a black box, with no mechanism to inspect how it reached its conclusions, trust is impossible regardless of how good the output appears.

We encountered this directly while building an AI-driven research tool. Prompt compliance went from unreliable to consistent when we changed the language style of our specifications – an anecdotal finding, from a single model, in a single use case. Preliminary data. But it was interesting enough to send us looking for the engineering discipline behind it. What we found – or didn’t find – is this article.


The Sycophancy Problem

There is a second complex problem being addressed with these imprecise tools, and it may be even harder than the judgment problem: AI systems are structurally incentivized to agree with you.

Sycophancy – the tendency of AI models to produce responses that please the user rather than responses that are accurate – is not a bug. It is an emergent property of the training process. RLHF (Reinforcement Learning from Human Feedback) optimizes models based on human preference signals. Users demonstrably prefer sycophantic responses – by approximately 50% compared to non-sycophantic alternatives[23]. The training process learns this preference and amplifies it.

Published analysis from Georgetown Law, Brookings, TechCrunch, and Stanford/CMU researchers independently documents a structural conflict: engagement optimization and sycophancy reduction are directly opposed[24]. The commercial incentive is to keep users engaged. Sycophancy keeps users engaged. Reducing sycophancy risks reducing engagement.

This is not speculation. A court has already ruled that an AI chatbot constitutes a “product” under existing product liability frameworks[25]. Legal analyses from multiple firms have explicitly connected the social media addiction liability framework to AI chatbot products. Research presented at CHI 2025 identified sycophantic responses as one of four “dark addiction patterns” in AI interaction design[26]. A coalition of 42 state attorneys general sent letters to AI companies demanding commitments on sycophancy reduction – a signal that voluntary efforts were judged insufficient[27].

Stated plainly: you are using an imprecise, ambiguous, English-only tool to try to control a behavior that the system is financially incentivized to maintain. The tool is weak. The problem is strong. And the entity you’re trying to constrain has commercial reasons not to be constrained.

This is the compound problem. The imprecision of the prompt engineering tooling is not an isolated challenge. It is applied to problem spaces – judgment auditing, sycophancy control, behavioral specification – that are themselves deeply complex. The difficulty is multiplicative, not additive. And the ecosystem calling itself “engineering” has not yet produced the methods, tools, or standards to address either dimension adequately.


Prompt Engineer or Prompt Writer?

A prompt writer crafts English text that gets results at a point in time. The text works for a specific model, on a specific task, on a specific day. It has not been tested against a regression baseline. It has no version history. It has no acceptance criteria. It has no lifecycle plan. It is creative writing with a technical veneer.

A prompt engineer designs, tests, versions, maintains, and validates specifications that produce reliable, reproducible, auditable outputs across model versions and deployment contexts. The specifications are precise. The testing is automated. The outputs are measured. The lifecycle is managed. The judgments are auditable. That’s engineering.

Most people doing “prompt engineering” today are prompt writers. That’s fine. Writing effective prompts is a useful skill. But let’s use honest language about what it is and what it isn’t. Calling it engineering before it meets the definition – before it has formal methods, reproducible outcomes, rigorous testing, and professional standards – devalues the word for everyone who has spent their career earning the right to use it.

Are you a prompt engineer, or a prompt writer?


References

[1] ABET Engineering Accreditation Commission. “Criteria for Accrediting Engineering Programs, 2025-2026.” ABET, 2024. https://www.abet.org/accreditation/accreditation-criteria/criteria-for-accrediting-engineering-programs-2025-2026/

[2] German Criminal Code Section 132a StGB. https://www.gesetze-im-internet.de/englisch_stgb/englisch_stgb.html

[3] Ontario Professional Engineers Act. PEO Enforcement. https://www.peo.on.ca/public-protection/enforcement

[4] NCEES. “Licensure Requirements.” https://ncees.org/licensure/

[5] NATO Science Committee. “NATO Software Engineering Conference 1968 Report.” 1968. http://homepages.cs.ncl.ac.uk/brian.randell/NATO/nato1968.PDF

[6] Encyclopaedia Britannica / ASCE. “Ecole des Ponts et Chaussees.” https://www.britannica.com/place/Ecole-Nationale-des-Ponts-et-Chaussees

[7] Wikipedia. “Knowledge Engineering.” https://en.wikipedia.org/wiki/Knowledge_engineering

[8] OpenAI. “Prompt Engineering Guide.” https://platform.openai.com/docs/guides/prompt-engineering

[9] Microsoft. “Prompt Engineering Techniques.” Microsoft Learn. https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/prompt-engineering

[10] Bradner, S. “RFC 2119 — Key words for use in RFCs to Indicate Requirement Levels.” IETF, 1997. https://datatracker.ietf.org/doc/html/rfc2119

[11] deliberate.codes. “Writing specs for AI coding agents.” 2026. https://deliberate.codes/blog/2026/writing-specs-for-ai-coding-agents/

[12] Wharton GAIL. “Playing Pretend: Expert Personas.” https://gail.wharton.upenn.edu/research-and-insights/playing-pretend-expert-personas/

[13] Wharton GAIL. “The Decreasing Value of Chain of Thought.” https://gail.wharton.upenn.edu/research-and-insights/tech-report-chain-of-thought/

[14] Wharton GAIL. “Prompting Science Report 3.” SSRN, 2025. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5375404

[15] Chen, L. et al. “How is ChatGPT’s behavior changing over time?” arXiv, 2023. https://arxiv.org/abs/2307.09009

[16] Guinness World Records / OED. “English word with the most meanings.” https://www.guinnessworldrecords.com/world-records/english-word-with-the-most-meanings

[17] NPR / OED. “Has ‘Run’ Run Amok? It Has 645 Meanings So Far.” 2011. https://www.npr.org/2011/05/30/136796448/has-run-run-amok-it-has-645-meanings-so-far

[18] Wikipedia. “COBOL History.” https://en.wikipedia.org/wiki/COBOL

[19] DAIR.AI. “Prompt Engineering Guide.” promptingguide.ai. https://www.promptingguide.ai/

[20] ISO. “ISO/IEC 42119-8 — AI Prompt Engineering (under development).” https://www.iso.org/standard/91609.html

[21] LILT. “Multilingual LLM Performance Gap Analysis.” https://lilt.com/blog/multilingual-llm-performance-gap-analysis

[22] “The Token Tax: Systematic Bias in Multilingual Tokenization.” arXiv, 2025. https://arxiv.org/html/2509.05486v1

[23] Cheng, M. et al. “Sycophantic AI.” arXiv, 2025. https://arxiv.org/abs/2510.01395

[24] Georgetown Law Tech Institute. “AI Sycophancy: Impacts, Harms, Questions.” https://www.law.georgetown.edu/tech-institute/research-insights/insights/ai-sycophancy-impacts-harms-questions/

[25] Transparency Coalition AI. “Garcia v. Character Technologies Inc.” 2025. https://www.transparencycoalition.ai/news/important-early-ruling-in-characterai-case-this-chatbot-is-a-product-not-speech

[26] CHI 2025. “Dark Addiction Patterns.” ACM Digital Library, 2025. https://dl.acm.org/doi/10.1145/3706599.3720003

[27] NJ Office of the Attorney General. “AG Platkin Leads Bipartisan Coalition.” 2025. https://www.njoag.gov/ag-platkin-leads-bipartisan-coalition-demanding-that-tech-companies-put-a-stop-to-harmful-ai-chatbots/

[28] FAA. “Roadmap for AI Safety Assurance.” 2025. https://www.faa.gov/media/82891

[29] GARP. “SR 11-7 in the Age of Agentic AI.” 2025. https://www.garp.org/risk-intelligence/operational/sr-11-7-age-agentic-ai-260227

[30] PEPR: “Prompt Engineering for Production Reliability.” arXiv, 2024; AWS Prescriptive Guidance. https://arxiv.org/html/2405.11083v1

[31] EffectiveSoft. “Medical Device Software Testing — IEC 62304.” https://www.effectivesoft.com/blog/medical-device-software-testing.html