The Truth is Out There. But How Do You Find It?
I took nine frameworks from intelligence analysis and science, evaluated every feature, and combined them into a unified research methodology. What follows is the how, the why, and the result.
The Truth is Out There. But How Do You Find It?
A unified research methodology derived from nine intelligence and scientific frameworks
The Truth is Out There Series (article 1 of 2):
The Truth is Out There. But How Do You Find It? (this article)
“The Truth is Out There.”
It’s one of the most recognized mottos in television history. It’s also from a show built entirely on pseudoscience, cryptozoology, paranormal phenomena, and unfalsifiable conspiracy theories. Fox Mulder spent nine seasons desperately wanting to believe – and that was the problem. He started with his conclusion and worked backward, treating every piece of evidence as confirmation and every contradiction as proof of a deeper cover-up. That’s not truth-seeking. That’s pattern-matching with extra steps.
Everyone says they want the truth. Almost nobody has a rigorous process for finding it.
The problem isn’t access to information. We’re drowning in information. The problem is knowing which information to trust, how much to trust it, and what to do when your sources disagree with each other. These are old problems. The intelligence community has been working on them for decades. The scientific community has been working on them for centuries. Both have produced sophisticated frameworks for separating signal from noise, and neither community talks to the other very much.
This article is about what happens when you force them to talk. I took nine frameworks – from intelligence analysis, clinical medicine, climate science, and the philosophy of science – evaluated every feature, kept what works, skipped what doesn’t, and combined them into a unified research methodology. What follows is the how, the why, and the result.
How I Got Here
I publish researched articles. Not opinion pieces – researched articles, where every significant claim is backed by evidence and every source is cited. That sounds straightforward until you try to do it well.
Early on, the research process was informal. Look things up, read the sources, cite them, move on. The results were adequate. They were also, occasionally, wrong in ways that were hard to catch. A source would sound authoritative but turn out to be citing itself. A statistic would appear in multiple articles but trace back to a single, flawed study. A claim would feel true – it aligned with everything else I’d read – and that feeling of alignment was exactly the problem. Confirmation bias doesn’t announce itself. It just quietly curates your evidence for you.
The obvious next step was to hand the research off to AI. Instead of running my own searches and forming hypotheses in my head, let the AI do the legwork. But the moment I tried, I realized how much of my research process was implicit – undocumented rules I’d internalized over decades. I worried about bias, sure. I considered source credibility, of course. I formed competing hypotheses and looked for disconfirming evidence. But none of that was written down. It was all wetware. Handing an AI agent “go research this and tell me what you find” is a deceptively simple instruction that buries an enormous amount of unstated methodology.
So I tried to write a prompt that would capture it. I didn’t get far. It was immediately obvious that a serious research methodology wasn’t going to fit in a couple of paragraphs – it was going to be a substantial, structured specification. I searched for existing AI research methodology prompts, thinking someone must have already done this. I didn’t find much. Fragments, yes – prompts for screening abstracts against PRISMA criteria, or checking CONSORT compliance on individual studies – but nothing that attempted a complete, rigorous, end-to-end research framework for an AI agent.
The turning point came not from that search but from an unrelated source. I follow Joohn Choe’s writing on modern intelligence analysis, and he posted something that stopped me cold: a complete AI prompt implementing the full ICD 203 analytical standard[1]. ICD 203 – Intelligence Community Directive 203 – is the U.S. intelligence community’s directive on analytic standards. It defines nine tradecraft standards that govern how intelligence analysts produce assessments. Choe had done something remarkable: he engineered an actual AI prompt that translated those standards into a research methodology an LLM could execute. I started by running his prompt directly – and the difference in output quality was immediate enough to change the direction of this entire project.
The difference was immediate and dramatic. Calibrated probability language replaced vague hedging. Source credibility audits replaced gut feelings about trustworthiness. Structured analysis replaced narrative reasoning. The research got better – measurably, obviously better. Claims that would have survived the old process got caught. Sources that would have been cited uncritically got flagged. The bar went up.
Which raised an obvious question: if the intelligence community has standards this good, what does the scientific community have?
The Search for Scientific Analogues
The answer is: a lot, and none of it covers the same ground.
The intelligence community produced ICD 203, which is broad – nine tradecraft standards spanning sourcing, uncertainty, logic, alternatives, and more. It’s the closest thing to a comprehensive analytical framework that exists in a single document. But it has gaps. It doesn’t provide a structured mechanism for adjusting confidence up or down based on specific evidence characteristics. It doesn’t require search transparency. It doesn’t include a self-audit step.
The scientific community, by contrast, has produced many specialized frameworks, each attacking a different part of the research process. I evaluated eight of them:
- GRADE grades the certainty of evidence and separates evidence quality from the strength of conclusions drawn from it.
- IPCC provides calibrated confidence models that account for both evidence quality and the degree of agreement between sources.
- PRISMA enforces transparency in how evidence was searched for, found, and selected.
- Cochrane’s RoB 2 assesses specific types of bias in individual sources.
- CONSORT standardizes how controlled trials are reported.
- Chamberlin and Platt established the philosophical foundation for competing hypotheses and falsification-first investigation.
- ROBIS audits the review process itself for bias.
- NAS sets institutional standards for comprehensive search and conflict of interest management.
Each of these is excellent at what it does. None of them does everything. And as far as I can determine – and I searched extensively – I could not find a published systematic combination of them into a single, unified research methodology in the accessible literature[2].
That shouldn’t be surprising. These frameworks were never meant to be combined. Each was designed for a different group of human specialists operating at a different stage of the research process. PRISMA is for the team designing the search. Cochrane is for the team assessing bias. ROBIS is for the reviewers auditing the process after the fact. GRADE is for the panel synthesizing the evidence into recommendations. In human research, these are separate roles performed by separate people, often at separate institutions, with separate professional norms enforcing compliance. Nobody needed a unified framework because nobody was one person doing all of it.
AI changes that equation. An AI agent is a single executor that can be instructed to follow all of these standards simultaneously – something a human team could only achieve with massive coordination overhead. The opportunity isn’t just to combine the frameworks. It’s to write a single specification that makes the AI apply all of them to every piece of research it produces, and then see how close it can get to the rigor that human teams achieve through institutional division of labor.
That’s what I set out to do.
The Evaluation
I evaluated each framework on two criteria: does this feature make my research more correct, and does it make the correctness auditable? These are different things. Correctness means the conclusions actually match the evidence. Auditability means someone else — or a future version of myself — can trace the reasoning from conclusion back through evidence to source and verify that the chain holds. A correct conclusion you can’t audit is an assertion. An auditable process that produces incorrect conclusions is a bug you can find and fix. You need both. For every feature in every framework, the decision was keep, adapt, or skip – and the reasoning is documented for each. What follows is not a survey of these frameworks. It’s a selection process, with rationale.
ICD 203 – The Backbone
ICD 203 contributes the structural foundation. Its nine tradecraft standards define the core requirements for any rigorous analytical process[3]:
- Sourcing: Cite everything. Distinguish between source types.
- Uncertainty: Use calibrated probability language with defined numeric ranges – not “likely” in some vague sense, but “Probable (55-80%)” on a scale from Impossible (0%) through calibrated bands to Certain (100%).
- Distinction: Separate observed fact from analyst judgment. Never let them blur together.
- Alternatives: Consider competing explanations. Don’t fall in love with your first hypothesis.
- Relevance: Stay focused on the actual question being asked.
- Logic: Make reasoning chains explicit and auditable. If someone can’t follow your logic from evidence to conclusion, the analysis fails.
- Change: Flag when new information shifts a prior assessment. Yesterday’s conclusion might not survive today’s evidence.
- Accuracy: Verify claims against source material. Don’t trust your memory of what a source said – go back and check.
- Visual integrity: Charts, tables, and figures don’t mislead.
These nine standards are non-negotiable. They form the backbone of the unified methodology.
ICD 203 also provides the probability scale I use for final assessments. The directive defines a seven-point scale with dual terminology – each level has two equivalent phrasings – and explicit numeric ranges[3]:
| Primary term | Alternate term | Range |
|---|---|---|
| Almost no chance | Remote | 01-05% |
| Very unlikely | Highly improbable | 05-20% |
| Unlikely | Improbable | 20-45% |
| Roughly even chance | Roughly even odds | 45-55% |
| Likely | Probable | 55-80% |
| Very likely | Highly probable | 80-95% |
| Almost certain(ly) | Nearly certain | 95-99% |
This matters. A word like “likely” means different things to different people. A defined range like “55-80%” does not. Note that ICD 203’s scale has a ceiling – “Almost Certain” tops out at 99%, not 100%. In intelligence analysis, where you are always working with incomplete information, that’s appropriate. Absolute certainty about adversary intentions or future events is genuinely not achievable.
My methodology extends this scale with two deterministic endpoints: Impossible (0%) and Certain (100%). These are reserved for claims that can be verified definitionally – by counting items in a document, checking geographic coordinates, evaluating a mathematical expression, or confirming a date in a primary source. The test: could any new evidence change this answer? If yes, use the 1-99% scale. If no – if the verification method is deterministic and the answer is not subject to interpretation – then 0% or 100% is appropriate.
ICD 203 didn’t need these endpoints because intelligence analysis doesn’t produce deterministic answers. But a fact-checking methodology does. “ICD 203 defines nine tradecraft standards” is either true or false – you count them. “Texas shares a land border with California” is deterministically false – you check a map. Forcing these into “Almost Certain” or “Almost No Chance” would be dishonest precision in the wrong direction.
What ICD 203 lacks: It has source credibility tiers, but no structured mechanism for documenting why you adjusted your confidence in a specific direction based on specific evidence characteristics. It doesn’t require you to document your search process. It doesn’t include a self-audit step. And its approach to competing hypotheses, while present through Analysis of Competing Hypotheses (ACH), is more passive than what the scientific frameworks offer.
These gaps are exactly what the scientific frameworks fill.
GRADE – Reliability vs. Relevance
GRADE – Grading of Recommendations, Assessment, Development, and Evaluations – comes from clinical medicine, where it’s used to assess the certainty of evidence in systematic reviews[4]. Its certainty scale (High, Moderate, Low, Very Low) overlaps with ICD 203’s probability scale, and I prefer ICD 203’s numeric version (extended to nine points) for its precision and range. But GRADE contributes two features that ICD 203 doesn’t have.
First: downgrade and upgrade criteria. GRADE provides a structured vocabulary for explaining why your confidence in a piece of evidence went up or down. It’s one thing to say “I trust this source.” It’s another to say “I’m downgrading my confidence because of inconsistency across studies and possible publication bias, but upgrading it slightly because the observed effect is large.” GRADE makes the reasoning auditable.
The downgrade criteria – risk of bias, inconsistency, indirectness, imprecision, and publication bias – seed my list. I expect to supplement these from other frameworks as I go.
Second, and more important: the separation of reliability from relevance. GRADE’s core insight is that the quality of evidence and the strength of the conclusion you draw from it are independent axes. They must be scored separately and never averaged.
This sounds abstract until you see the edge cases. High-quality evidence can have low relevance: a rigorous peer-reviewed study proves something true, but that something only tangentially relates to the claim you’re investigating. The evidence is excellent. It just doesn’t move your needle. Conversely, low-quality evidence can have high relevance: a blog post from someone directly involved in an event, if true, would completely settle the question. The evidence is weak. The signal is critical.
Every piece of evidence in my unified methodology gets scored on both dimensions independently. Reliability: how much do I trust this source? Relevance: how directly does it address my question? These are different questions with different answers.
IPCC – When Sources Disagree
The IPCC – Intergovernmental Panel on Climate Change – produces the most consequential scientific assessments in the world, and it has a confidence problem. Not a lack of confidence – a problem with expressing confidence accurately when thousands of studies point in slightly different directions[5].
Their solution is a two-axis confidence model:
- Evidence quality: Limited, Medium, or Robust
- Source agreement: Low, Medium, or High
These two axes produce a matrix. Robust evidence with high agreement yields very high confidence. Limited evidence with low agreement yields very low confidence. The interesting cases are in between – robust evidence with low agreement (sources are good but they disagree) or limited evidence with high agreement (sources agree but there aren’t enough of them).
The key insight is that this is a collection-level tool, not a per-source tool. You can’t score a single piece of evidence on “agreement” because there’s nothing to compare it against. The two-axis model only becomes meaningful when you’re synthesizing a body of evidence and asking: how much of this evidence points in the same direction?
This is where I place it in the workflow – as a post-processing step after individual sources have been gathered and scored. Once you have the full collection, you assess the collection as a whole: how strong is the evidence base, and how much do the sources agree?
There’s a critical distinction within agreement that the IPCC model surfaces: independent convergence versus derived agreement. Five sources all saying the same thing sounds like strong agreement – until you realize they’re all citing the same original study. That’s derived agreement. It’s one data point echoed five times. Five independent sources reaching the same conclusion through different methods is independent convergence. That’s genuinely strong. No per-source rating captures this distinction. The collection-level synthesis does.
I skip IPCC’s likelihood scale (seven primary terms, with three additional terms available from earlier assessment reports). It creates false precision. The boundaries between “Extremely Likely (95-100%)” and “Very Likely (90-100%)” suggest you can quantify probability to a degree that the underlying evidence rarely supports. ICD 203’s scale (which I extend to nine points) strikes a better balance – more granular than GRADE’s four levels, but without IPCC’s false precision. And ICD 203’s ranges are cleaner: each band spans a meaningful interval rather than overlapping at the extremes.
PRISMA – Show Your Work
PRISMA – Preferred Reporting Items for Systematic Reviews and Meta-Analyses – exists because the medical community discovered that systematic reviews were being published with abysmal reporting quality[6]. As early as 1987, Mulrow documented that none of the 50 reviews she examined met all eight basic scientific reporting criteria. The problem wasn’t just that authors might cherry-pick sources – it was that nobody could tell either way. Search methods went undocumented. Inclusion decisions were opaque. The entire process from “started looking” to “here’s what I concluded” was a black box. PRISMA’s answer is radical transparency: document everything about how you searched for evidence, not just what you found.
What was searched. Where. With what terms. On what dates. How many results came back. What was excluded and why. What was included and why. The entire chain from “I started looking” to “here’s what I ended up with” must be visible.
For my unified methodology, PRISMA contributes the search methodology log – a mandatory archival artifact that accompanies every research output. I require this for three distinct reasons, and they’re all different.
Reason one: reader-facing transparency. Someone reads an article on this site, questions a claim, follows the sources, and finds them credible. But they want to go further. Did the author look broadly or narrowly? Were contrary sources considered and excluded, or never found in the first place? The search methodology log answers these questions. It’s the second layer of defensibility, beyond citing your sources.
Reason two: absence detection. What you looked for and didn’t find is itself a finding. If you search specifically for evidence supporting a claim and find nothing, that’s a data point. If you search for contradictory evidence and find nothing, that’s a different data point. The absence of evidence is not proof of absence, but it is evidence, and it must be captured. Without a search log, absences are invisible.
Reason three: process auditing. This is the sleeper. If you don’t know how the search was conducted, you can’t improve it. You can’t ask “were the search terms good?” or “did I look in the right places?” or “am I only covering five percent of the available research?” The search methodology log turns the research process into a self-improving system. Audit the logs, find weaknesses, refine the process, and the next round of research is better.
There’s one more thing PRISMA’s transparency principle surfaces, and I didn’t appreciate it until I hit it in practice: vocabulary exploration. Before you can design a comprehensive search, you have to know what words to search for. That sounds obvious until you encounter a phenomenon that different communities call by different names. I discovered this when researching AI sycophancy. AI safety researchers call it “sycophancy.” Aviation calls it “automation complacency.” Healthcare calls it “acquiescence.” Defense calls it “calibrated trust.” The EU AI Act calls it “automation bias.” Same dangerous phenomenon – five different names across five different regulatory vocabularies. If you search only for “sycophancy,” you find only the AI safety literature and miss the decades of regulated-industry work on the same problem.
The fix is simple: before designing discriminating searches, map the vocabulary space. Identify the key concepts in the claim or question and determine whether different domains use different terminology. This adds a step, but it prevents a class of systematic blind spot that no amount of search thoroughness can overcome if you’re searching with the wrong words.
Cochrane/RoB 2 – Name the Bias
Cochrane’s Risk of Bias 2 tool is designed to assess bias in individual studies – specifically randomized controlled trials, but the logic generalizes[7]. Where ICD 203’s source credibility tiers tell you what type of source you’re looking at (official document vs. media report), RoB 2 asks a different question: even within a credible source type, what specific biases might be operating?
A peer-reviewed paper in a top journal can have massive selective reporting bias. A vendor whitepaper can be methodologically sound but have obvious conflict-of-interest bias. The source type doesn’t tell you the bias type. You need a vocabulary for naming specific biases, not just rating general trustworthiness.
RoB 2’s original five domains are tuned for clinical trials. Two of them – randomization bias and deviation from protocol – apply specifically to controlled trial methodology, so most of the sources I encounter won’t require them. But some will. When I do find evidence based on an RCT, those domains matter, and dropping them entirely would create a blind spot in exactly the cases where formal bias assessment is most developed. So I keep all five of RoB 2’s original domains, apply the clinical-specific ones conditionally when the source warrants it, and add a sixth:
- Missing data bias: Is important data absent? Were inconvenient results dropped or simply not mentioned?
- Measurement bias: Were outcomes measured objectively, or could the researcher’s expectations have influenced what they found?
- Selective reporting bias: Were all findings reported, or only the ones supporting the thesis?
- Randomization bias (conditional – RCT sources): Was the study designed to avoid selection bias?
- Protocol deviation bias (conditional – RCT sources): Was the methodology followed as designed, or were there departures?
- Conflict of interest and funding bias: Who paid for this research? Who benefits from a particular outcome?
That sixth domain is conspicuously absent from RoB 2, and it’s one of the most important for my purposes. The supplements industry is a multi-billion-dollar market where the overwhelming majority of products are unsupported by independent science – but you’d have a hard time seeing that in the published literature, because so much of the research is funded by the companies selling the products. The AI industry has the same problem. A research paper from a major AI company about its own products cannot be treated the same as independent third-party research, even when the methodology is sound. The financial incentive to find favorable results is a bias that must be named and assessed.
Each source gets a three-level bias judgment per domain: Low risk, Some concerns, High risk. This adds work. Every individual piece of evidence now carries three ratings: reliability (from GRADE), relevance (from GRADE), and bias risk across six domains (from my adapted Cochrane – four universal, two conditional for RCT-based sources). The trade-off is explicit: thoroughness costs time, but the cost of publishing something indefensible is higher.
Chamberlin/Platt – Try to Prove Yourself Wrong
In 1890, T.C. Chamberlin first published “The Method of Multiple Working Hypotheses,” later revised in 1897[8]. In 1964, John Platt published “Strong Inference,” explicitly citing Chamberlin’s work[9]. Between them, they articulated what might be the most important principle in the philosophy of science: if you want to find the truth, try as hard as you can to prove yourself wrong.
Chamberlin’s argument is simple and devastating. When you form a single hypothesis and set out to test it, you develop what he called “parental affections” for it. You unconsciously seek confirming evidence. You explain away contradictions. You become an advocate instead of an investigator. The solution: form multiple competing hypotheses simultaneously. With several hypotheses in play, you’re forced to design your investigation to distinguish between them, not just confirm your favorite.
Platt sharpened this into a recursive process he called “strong inference”: devise alternative hypotheses, devise a crucial experiment that would exclude one or more of them, carry out the experiment to get a clean result, and recycle. Platt deliberately numbered this final step “1’” – one-prime, not four – to signal that it’s a loop, not a sequence. You don’t finish. You refine and repeat. The key word is exclude. You’re not trying to prove anything right. You’re trying to prove things wrong. What survives your best attempts at falsification is your best current answer.
Platt was candid that he was codifying existing scientific practice – Baconian method, sharpened and made explicit – not inventing something new. The power isn’t in the novelty. It’s in the rigor of naming the steps and insisting they be followed.
ICD 203 includes Analysis of Competing Hypotheses (ACH), which evaluates existing hypotheses against existing evidence. ACH is good. Chamberlin/Platt is better, and here’s why: ACH is passive evaluation. You lay out your hypotheses, you look at the evidence you’ve already collected, and you score which hypothesis the evidence best supports. Chamberlin/Platt is active falsification. You generate competing hypotheses and then design your search to discriminate between them – specifically looking for evidence that would disprove each hypothesis, including the one you favor.
The search strategy is fundamentally different. ACH asks “which hypothesis does existing evidence best support?” Chamberlin/Platt asks “what evidence would disprove each hypothesis, and can I find it?”
In my unified methodology, Chamberlin/Platt supersedes ACH as the primary hypothesis methodology. The implementation is four steps:
- When a claim comes in, generate competing hypotheses – not just “true or false” but “what are the possible explanations?”
- Design searches to discriminate between hypotheses – specifically look for evidence that would disprove each one, including the preferred one.
- Evidence that fails to disprove a hypothesis strengthens it. Evidence that disproves it eliminates it.
- What survives is what gets reported.
ACH still lives inside this as the evaluation matrix – the mechanism for scoring hypotheses against evidence. But the outer loop is Chamberlin/Platt’s falsification-first approach. I chose the superset because it forces the investigation to challenge the investigator’s assumptions rather than confirm them.
I want to be right. But I don’t want my idea to be right. I want the process to get me to the truth, wherever that is. That’s what Chamberlin and Platt make possible.
CONSORT – Evaluated, Not Included
CONSORT – Consolidated Standards of Reporting Trials – was a 25-item checklist for reporting randomized controlled trials in its 2010 version[10]. CONSORT 2025 has since expanded this to 30 items. Participant recruitment, randomization procedures, blinding, sample size calculations, adverse events. None of it translates to article research.
I include it here because what you didn’t choose matters as much as what you did. CONSORT shares a core philosophy with everything I’ve built: how you conducted the investigation matters as much as – arguably more than – what you concluded. A conclusion presented without auditable methodology is just an assertion. Someone says this is true. Is it? How would you tell?
That principle is foundational to my unified methodology, but I’ve already captured it more effectively through PRISMA (search transparency) and Cochrane (bias assessment). CONSORT is redundant for my purposes.
ROBIS – Audit Your Own Process
Every framework discussed so far evaluates something external – sources, evidence, hypotheses, search completeness. ROBIS turns the lens inward[11]. It asks: is the review process itself biased?
This is the question nobody wants to ask, because the answer might be yes.
ROBIS assesses four domains:
- Eligibility criteria: Were they defined before you started searching, or did you define them after seeing what was available? Post-hoc criteria let you unconsciously gerrymander your evidence base.
- Search comprehensiveness: Was the search genuinely broad, or did you stop when you found enough to support your conclusion?
- Evaluation consistency: Was every source held to the same standard, or did sources supporting your thesis get lighter scrutiny?
- Synthesis fairness: Were results synthesized honestly, or did the conclusions cherry-pick from the evidence?
The connection to PRISMA is direct. PRISMA provides the search metadata that makes a process audit possible. ROBIS tells you what to look for when you conduct it.
In my unified methodology, ROBIS becomes a critical validation step. After all evidence has been gathered, scored, and synthesized, the process itself is audited against these four domains. Did I follow my own rules? Did I apply my standards consistently? Did I stop searching too early? These are uncomfortable questions, and that’s exactly why they need to be asked.
But ROBIS has a blind spot that I discovered through practice: it audits the process but not the interpretation. A research agent can follow every step perfectly – search broadly, score consistently, synthesize fairly – and still mischaracterize what a source actually says. The process audit passes. The conclusion is wrong. This happened in practice: a research run correctly found a primary source article but described the subject as a panelist when the article explicitly stated he was an audience member. The four-domain self-audit didn’t catch it because the process was flawless – the interpretation wasn’t.
The fix is a fifth audit domain: source-back verification. After the process audit, go back to each source cited in the assessment, re-read it independently, and verify that the assessment accurately represents what the source says. This catches a specific class of error that ROBIS was never designed to detect: correct process, incorrect reading.
NAS Standards – Harden the Requirements
The National Academies of Sciences published standards for systematic reviews that focus on institutional safeguards – 21 standards with 82 elements of performance organized across four stages of review[12]. Three concerns addressed by these standards are directly relevant to my unified methodology: conflict of interest management, comprehensive search requirements, and gap identification.
None of these introduce new concepts at this point in my evaluation. All of them harden concepts I’ve already captured.
Conflict of interest management extends beyond assessing COI in sources you find. NAS requires a structured process: disclose, evaluate, manage, document. For my purposes, this strengthens the Cochrane COI bias domain with an additional signal – when evaluating a source for conflict of interest, did the authors follow good practice in disclosing their own conflicts? A source that openly documents its funding relationships and potential biases is more trustworthy than one that doesn’t, even if both have conflicts.
Comprehensive search elevates PRISMA’s transparency requirement from “document what you searched” to “demonstrate that your search was comprehensive enough to be valid.” Under NAS, a narrow or convenience-based search doesn’t just look bad – it’s disqualifying. If you only examined a fraction of the available evidence, your conclusions don’t hold.
Gap identification promotes what I noted under PRISMA – that the absence of evidence is itself evidence – from an observation into a formal deliverable. The research output must explicitly identify what evidence is missing, what was expected but not found, and what that absence means for the conclusions.
NAS doesn’t add new tools to the unified methodology. It raises the bar on tools I already have.
Beyond Intelligence and Science
At this point I paused and asked a question that any honest application of this methodology demands: was my search broad enough?
I had drawn from two communities – intelligence and science. But those aren’t the only disciplines that care about truth. What about journalism? Legal practice? Auditing? Medical diagnosis? Historical scholarship? I searched twelve additional disciplines for formal truth-seeking frameworks, applying the same evaluation criteria: does this add something my existing methodology doesn’t already cover?
Journalism was the most natural candidate, sitting somewhere between science and intelligence on the trust spectrum. Science generally assumes good-faith data – most published research is real. Intelligence assumes adversarial conditions – sources may be actively deceptive. Journalism deals with both.
The finding was striking: journalism is principles-based, not methodology-based. Every journalistic framework I evaluated – the IFCN Code of Principles, PolitiFact’s Truth-O-Meter, NewsGuard’s credibility scoring, the SPJ Code of Ethics, BBC Editorial Guidelines, Bellingcat’s OSINT methodology – tells practitioners what to do (be accurate, verify, be transparent) but not how to assess whether they’ve done it sufficiently. None of the frameworks I examined has a hierarchical evidence quality scale, calibrated uncertainty language, structured bias assessment domains, or source reliability tiering. Journalism manages uncertainty through attribution (“officials said” vs. “documents show” vs. “sources allege”) rather than through calibrated confidence language. It’s an informal credibility signaling system, but it’s not codified into a formal methodology.
Other disciplines yielded interesting tools but nothing that changes my standard. Legal standards of proof provide graduated certainty thresholds tied to decision stakes – higher consequences require higher evidence thresholds – which is conceptually valuable but doesn’t translate directly to my research workflow. Historical source criticism asks questions none of my frameworks do – is this document what it claims to be? Did the author have the ability to know the truth? – which matters for evaluating online sources where authorship and authenticity aren’t always clear. Auditing standards (PCAOB, GAAS) formalize an explicit evidence hierarchy and an adversarial posture that assumes the entity being evaluated has incentives to misrepresent. The Wardle and Derakhshan Information Disorder Taxonomy[13] classifies types of information failure along two dimensions – falseness and intent to harm – producing three categories: misinformation (false, no intent to harm), disinformation (false, intent to harm), and malinformation (true, shared to harm). This is a dimension none of my nine frameworks address.
I evaluated all of these and concluded that my existing nine frameworks cover the core epistemological ground. The candidates above are noted for future integration but not included in this version of the methodology. The process is already demanding enough; additional complexity needs to clear a high bar of demonstrated value.
The important point is that I looked. The search was deliberate, broad, and documented. That’s the methodology auditing its own compliance.
What’s Missing From All of Them
After evaluating nine frameworks, one gap remained that none of them address. This isn’t an adaptation or reinterpretation. It’s a net-new feature, derived from practical research experience rather than existing standards.
Temporal revisitation. Research conclusions have a shelf life. A claim validated in March 2026 may not hold in March 2027. New evidence emerges. Studies get replicated or refuted. The landscape shifts.
ICD 203’s tradecraft standard number seven – Change – says to flag when new information shifts a prior assessment. But it’s passive: if new information happens to come to your attention, update accordingly. It doesn’t say “go back and proactively check whether your conclusions still hold.”
The search methodology log I’m already archiving under PRISMA makes temporal revisitation practical. You don’t start from scratch. You start from “here’s exactly what I searched last time, here’s what I found, here are the conclusions I drew – now run it again and tell me what changed.” The archived methodology becomes a blueprint for re-execution.
This matters because some conclusions are too important to publish once and forget. A forward-looking claim about a trajectory deserves periodic re-examination. Is the trajectory still holding? Has new evidence shifted it? Have the sources I relied on been challenged or superseded? The unified methodology treats research as a living process, not a one-time event.
The Unified Methodology – Assembled
Nine frameworks evaluated. Two net-new features added. Features selected with documented rationale for every decision. What emerges is a research methodology that’s more comprehensive than any individual framework, because it was built by combining the best of all of them and filling the gaps that remained.
Per-Source Scoring
Every individual piece of evidence is assessed on three dimensions:
- Reliability (from GRADE): How trustworthy is this source?
- Relevance (from GRADE): How directly does it address the question?
- Bias risk (from adapted Cochrane): Six domains
assessed at three levels –
- Missing data: Low risk / Some concerns / High risk
- Measurement: Low risk / Some concerns / High risk
- Selective reporting: Low risk / Some concerns / High risk
- Randomization (conditional – RCT sources): Low risk / Some concerns / High risk
- Protocol deviation (conditional – RCT sources): Low risk / Some concerns / High risk
- Conflict of interest/funding: Low risk / Some concerns / High risk
Collection-Level Synthesis
Once all evidence is gathered, the collection is assessed as a whole:
- Evidence quality (from IPCC): Limited / Medium / Robust
- Source agreement (from IPCC): Low / Medium / High
- Independence assessment: Is agreement derived (common sourcing) or independent (convergent conclusions from separate work)?
- Outlier identification: Which sources diverge from the consensus, and why?
The Workflow
The complete research workflow, step by step:
- Claim received and clarified. A specific, testable claim enters the process. Ambiguities surfaced. Embedded assumptions identified.
- Vocabulary exploration (net-new, extends PRISMA). Before designing searches, map the terminology space. Different domains may use different terms for the same phenomenon. Single-term searches create systematic blind spots.
- Competing hypotheses generated (Chamberlin/Platt). Not just “true or false” – what are the possible explanations?
- Discriminating searches designed (Chamberlin/Platt). What evidence would disprove each hypothesis?
- Searches executed, methodology logged (PRISMA). Every search is documented: where, what terms, what was found, what was rejected, what was absent.
- Per-source scoring (GRADE + adapted Cochrane). Each source rated on reliability, relevance, and bias across six domains.
- Collection-level synthesis (IPCC). Evidence quality, source agreement, independence of convergence, outlier identification.
- Probability assessment (ICD 203). Final assessment using the calibrated probability scale.
- Gap identification (NAS). What evidence is missing? What was expected but not found? What does the absence mean?
- Process self-audit + source-back verification (ROBIS + net-new). Did the research process exhibit bias? Do the assessment’s claims match what the sources actually say?
- Report with revisit triggers (ICD 203 tradecraft standards). Every claim sourced, every judgment explicit, specific conditions identified that would warrant re-research.
- Temporal revisitation (net-new). Archive the complete research methodology for periodic re-execution. Conclusions have a shelf life; treat research as a living process.
Twelve steps. Nine source frameworks. Two net-new features. One methodology.
The Human Cost and the AI Opportunity
Let’s be honest about something: this methodology is hard. Not conceptually hard – the individual components are straightforward. Hard in the sense of tedious, time-consuming, and unglamorous. Scoring every source on reliability, relevance, and six bias domains. Documenting every search. Assessing collection-level agreement. Running a self-audit. Identifying gaps.
This is the kind of work that human research teams deprioritize. Not because it’s unimportant – because the return on investment for those final layers of rigor is small relative to the effort required. It’s the same dynamic as code coverage. Getting to 80% coverage is relatively easy. Getting to 95% is hard. Getting to 100% is brutally expensive. Most teams rationally stop well short of 100% because the cost of the last few percent exceeds the benefit.
The cost curve is different now.
The most tedious parts of this methodology – the per-source scoring, the search logging, the bias assessment, the gap identification – are exactly the tasks where AI excels. Not because AI is smarter than human researchers. Because AI has infinite patience, zero complaints about grunt work, and – when properly constrained – a consistency that humans struggle to match on repetitive tasks. That qualifier matters. Unconstrained, AI introduces its own failure modes: hallucination, false confidence, sycophantic agreement with the researcher’s assumptions. These are exactly the problems this methodology is designed to catch. But a well-constrained AI applying a well-defined process will apply the same six-domain bias assessment to the fiftieth source with more rigor than a human team grinding through the same checklist for the eighth hour. Not because the AI is better – because humans conserve energy on repetitive tasks. It’s rational behavior that produces irrational results when thoroughness matters.
This doesn’t replace human judgment. The competing hypotheses still need human insight. The final assessment still needs human interpretation. The researcher profile – the declaration of personal biases and conflicts – is inherently human. What changes is the reach. Territory that was previously too expensive to cover is now accessible. Standards that were previously too demanding to maintain are now practical.
I can push the bar higher because the cost of doing so has fundamentally changed.
The Researcher’s Obligation
There’s one more piece, and it’s the most personal.
Every framework I evaluated – every single one – has transparency at its core. Cite your sources. Show your search process. Name the biases. Document your reasoning. Audit your own work. The unified methodology inherits this principle and extends it to its logical conclusion: if you’re going to demand transparency from your sources, you should demand it from yourself.
The unified methodology includes a researcher profile as a functional input to the analytical process – not a disclosure footnote, but an active calibration instrument. The researcher profile documents known personal biases, professional conflicts of interest, and acknowledged blind spots of the human or humans driving the research. It feeds into the process at the beginning, not the end. It shapes how hypotheses are generated, how evidence is evaluated, and how the self-audit is conducted.
Consider how this contrasts with – and extends – Joohn Choe’s approach. Choe’s ICD 203 prompt asserts that the human researcher’s inputs must be assumed true – if the researcher says the sky is green, the sky is green for purposes of that analysis. That’s a reasonable default for intelligence work where the analyst is providing classified context the AI can’t verify. My methodology accommodates this: the researcher can declare ground truth that the process accepts without testing. But the methodology also supports the opposite – assertions the researcher wants tested against evidence, and questions the researcher wants answered. Both can coexist in the same investigation. The details of how this works are in Part 2.
The researcher profile adds a third dimension. It tells the process exactly where the human researcher is most likely to be wrong. Here are the biases that might warp the questions being asked. Here are the conflicts of interest that might influence which claims get investigated and which get ignored. Here are the blind spots that might leave entire categories of evidence unexamined.
The researcher profile is a set of parameters. The process uses those parameters to compensate – to push harder on exactly the areas where the human’s judgment is least trustworthy. It’s not a confession. It’s a correction factor.
This is designed to be general-purpose. An individual researcher fills in their personal biases, conflicts, and blind spots. A team fills in organizational biases and institutional conflicts. A company fills in market position and financial incentives. The template is the same; the inputs change. Anyone using this methodology can plug in their own profile, and the process adjusts accordingly.
A conclusion without auditable reasoning is just an assertion. A methodology without declared biases is just a process. The goal isn’t to eliminate bias – that’s impossible. The goal is to make it visible, feed it into the system as a known quantity, and demonstrate that the process accounted for it.
Someone says this is true. Is it true? How do you tell?
This is how.
What’s Next
This article described the what and the why – the unified research methodology and the reasoning behind every feature in it. Part 2 will describe the how: translating this methodology into a machine-executable prompt that implements the workflow, including the AI-specific behavioral constraints that make the difference between a process the AI acknowledges and a process the AI actually follows.
Research
The claims and assertions in this article were investigated using the methodology it describes. Full evidence archives — every source scored, every search logged, every hypothesis tested — are linked below.
| ID | Topic | Queries/Claims |
|---|---|---|
| R0052 | Article claim verification | 14 claims |
| R0049 | Published AI research methodology prompts | 3 queries |
| R0050 | Journalism and other truth-seeking disciplines | 3 queries |
| R0051 | Fact-checking methodology gap analysis | 3 queries |
References
Each reference is prefixed with links to the evidence behind it: the first link goes to the claim verification that tested it; the second goes to the source’s scorecard — reliability, relevance, bias assessment, and extracted evidence.
[1] Joohn Choe, “The Copy and Paste War: On AI for Citizen OSINT,” Substack, 2024. https://joohn.substack.com/p/the-copy-and-paste-war-on-ai-for
[2] (R0052, R0049) This claim — that I could not find prior work that systematically combines these frameworks — is itself subject to my methodology. Proving a universal negative is inherently limited. Four research runs investigated this: R0052 verified the article’s 14 claims, R0049 searched for published AI research prompts, R0050 surveyed journalism and twelve other disciplines, R0051 analyzed fact-checking epistemological frameworks. None found a complete unified methodology. Classified or internal practitioner work cannot be searched, hence the hedging.
[3] (R0052/C001, SRC01) Office of the Director of National Intelligence, “Intelligence Community Directive 203: Analytic Standards,” 2015. https://www.dni.gov/files/documents/ICD/ICD-203-Analytic-Standards.pdf
[4] (R0052/C003, SRC01) GRADE Working Group. Schunemann H, Brozek J, Guyatt G, Oxman A, eds. “GRADE Handbook,” 2013. https://gdt.gradepro.org/app/handbook/handbook.html
[5] (R0052/C004, SRC01) Mastrandrea MD, Field CB, Stocker TF, et al. “Guidance Note for Lead Authors of the IPCC Fifth Assessment Report on Consistent Treatment of Uncertainties,” IPCC, 2010. https://www.ipcc.ch/site/assets/uploads/2017/08/AR5_Uncertainty_Guidance_Note.pdf
[6] (R0052/C005, SRC01) Page MJ, McKenzie JE, Bossuyt PM, et al. “The PRISMA 2020 statement: an updated guideline for reporting systematic reviews,” BMJ, 2021;372:n71. https://doi.org/10.1136/bmj.n71
[7] Sterne JAC, Savovic J, Page MJ, et al. “RoB 2: a revised tool for assessing risk of bias in randomised trials,” BMJ, 2019;366:l4898. https://doi.org/10.1136/bmj.l4898
[8] (R0052/C007, SRC01) Chamberlin TC. “The Method of Multiple Working Hypotheses,” Science, 1890;15(366):92-96. Revised version: Journal of Geology, 1897;5(8):837-848. Reprinted in Science, 1965;148(3671):754-759. Platt cited the 1897 revision. https://doi.org/10.1126/science.148.3671.754
[9] (R0052/C008, SRC01) Platt JR. “Strong Inference,” Science, 1964;146(3642):347-353. https://doi.org/10.1126/science.146.3642.347
[10] (R0052/C006, SRC01) Schulz KF, Altman DG, Moher D. “CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials,” BMJ, 2010;340:c332. https://doi.org/10.1136/bmj.c332
[11] (R0052/C014, SRC01) Whiting P, Savovic J, Higgins JPT, et al. “ROBIS: A new tool to assess risk of bias in systematic reviews was developed,” Journal of Clinical Epidemiology, 2016;69:225-234. https://doi.org/10.1016/j.jclinepi.2015.06.005
[12] (R0052/C010, SRC01) National Academies of Sciences, Engineering, and Medicine. “Finding What Works in Health Care: Standards for Systematic Reviews,” The National Academies Press, 2011. https://doi.org/10.17226/13059
[13] (R0052/C011, SRC01) Wardle C, Derakhshan H. “Information Disorder: Toward an interdisciplinary framework for research and policy making,” Council of Europe, 2017. https://rm.coe.int/information-disorder-toward-an-interdisciplinary-framework-for-researc/168076277c
[10] https://doi.org/10.1136/bmj.c332 [11] https://doi.org/10.1016/j.jclinepi.2015.06.005 [12] https://doi.org/10.17226/13059 [13] https://rm.coe.int/information-disorder-toward-an-interdisciplinary-framework-for-researc/168076277c