How We Fact-Check AI-Written Content
Post 037, “The Model-Generation Audit,” describes what we found when three AI models reviewed 33 blog posts for factual accuracy: 85 errors in the first pass, another 137 when the fix agents introduced errors of their own, and a taxonomy of failure modes that revealed how language models lie (fabricated quotations, mischaracterized research, wrong numbers, confident claims about things that never happened). Post 035, “The Revision Tax,” describes a related cost: how multi-model review of AI content can exceed the cost of creating it. This post describes the system I built afterward to catch the next 85 errors before they reach publication.
The system is a pipeline with four stages, and I want to be precise about what each stage does and does not catch, because the temptation after an audit like 037’s is to build something comprehensive enough that it sounds like a solution, and the honest accounting is that it is a mitigation with known gaps. I then fact-checked all 38 published posts retroactively through this pipeline (the original 33 from the audit plus 5 written since), extracting 448 discrete claims and verifying each through GPT-5.4 with web search. The results: 68% of claims verified cleanly, 13% were partially verified (directionally correct but overstated or imprecise), 8% could not be verified from external sources (mostly my own operational data), 5% needed corrections to phrasing or attribution, nearly 4% (17 claims) were substantively wrong, and the remaining fraction fell into edge categories like pure opinion in ways that post 037 would recognize: wrong numbers, overclaims, misattributions. Every one of those 17 errors had passed my own review before publication.
What Counts as a Claim
The first problem I had to solve was deciding what the fact-checker is allowed to flag, because the boundary between a factual claim and an interpretive judgment is not clean and the most common false positive in automated review is a model treating a thesis as an error.
I classify claims into five types. Empirical claims (verifiable numbers, dates, attributions) get full verification against primary sources. Literature claims (characterizations of what published research found) get checked against the actual paper, because the most common error, as post 037 documented in detail, is describing what a paper found in terms that are close but not quite right. Interpretive claims (my reading of the evidence) get checked for internal consistency but not for agreement with external opinion, because disagreeing with the conventional reading is often the point of the post. Thesis claims (the argument itself) are immune to fact-checking entirely. And protected source text (verbatim quotations, benchmark data, experimental output) must never be edited for style, only for transcription accuracy.
I built this taxonomy because I kept encountering the same false positive: a model flagging an intentional interpretive choice as a factual error. When post 009 argues that evaluation circularity is “not solvable in the epistemological sense,” a reviewer without context will flag “unsolvable” as overclaiming. It is not overclaiming; it is the thesis. The claim classification tells the reviewer what register each sentence operates in before the reviewer starts looking for problems.
The Methodology Brief
The claim taxonomy solves the problem in theory. The methodology brief solves it in practice.
A methodology brief is a companion document that travels with a post through every review stage. It describes what the post actually did, what its key design decisions were, which claims are thesis statements versus empirical observations, and (critically) which specific phrases and framings are protected from editorial correction. I created the brief system after three corrections were reverted during the model-generation audit: a word chosen for its epistemological precision was replaced with a neutral synonym, model-generated translation output in post 032 was edited for style as if it were author prose (including an attempt to change “betwixt,” which was actual DeepSeek V3.2 output displayed as a data exhibit), and a practitioner’s epistemic position in post 014 was rewritten into neutral technical language. Each correction was locally reasonable and globally wrong, because the reviewer lacked the context to know what the author intended.
The brief has four sections that address this. Reviewer Guardrails classify the post’s text into registers matching the five claim types (empirical, literature, interpretive, thesis, protected source text) so the reviewer knows which mode each passage operates in. A Claim Classification Map lists specific phrases with their type and what the reviewer is allowed to do with each one. A Protected Language table identifies coined terms, thesis statements, and load-bearing rhetorical framings that must never be changed. And a Protected Source Text table identifies verbatim quotations and experimental data that must never be edited for style.
I have briefs for 31 of 38 posts (the remaining 7 are short essays where the false positive risk is low enough that a brief adds little value). I tested all 31 against the three specific failure modes that motivated the system, and all 31 passed: each brief would have prevented the false positives that triggered the original reverts. This is a narrow validation (three known failure modes, not a comprehensive test), but it confirms the briefs address the problem they were designed for.
The Triage Decision
Post 037 described the fix process. What it did not describe in detail is the triage layer that decides which findings become fixes, because that layer did not exist during the audit and its absence is what produced the most expensive mistakes.
My triage rule is deliberately simple: fix claims where the current text is materially stronger than the evidence supports, and leave everything else alone. A claim that “roughly 150,000 rows” exist when the current count is 164,720 is conservative, not wrong. A claim that a leaderboard has “80 rows” when the current count is 94 was accurate when written. A claim that costs are “widely estimated” is already hedged. In the most recent correction round, I applied 12 corrections and skipped 12 findings where the existing text was already calibrated appropriately. The skipped findings were as important as the applied ones, because correcting too aggressively produces prose that hedges everything into meaninglessness, and a post that says “approximately X, according to sources that suggest but do not confirm” about every claim has surrendered its authority without gaining its accuracy.
The hardest triage calls are not about whether a claim is wrong but about whether fixing it would break something else. When I corrected post 036’s claim about physical labor (changing “machines rendered most physical labor obsolete” to “rendered much physical labor in wealthy economies less economically necessary”), the hedge was factually necessary but it created an internal inconsistency with the analogy that followed it, which still used the stronger framing. The fix for one sentence became a MUST FIX for the next paragraph, which is how correction cascades begin. A triage system that evaluates findings in isolation will miss these cascades, and the most dangerous corrections are the ones that are right about the sentence they change and wrong about the paragraph it lives in.
Closing the Loop
The pipeline produces three persistent artifacts, and the persistence is what makes the system more than an editorial process.
First, the .factcheck.json file that travels with every post contains the full verification record: each claim, its status, the source URL if found, and any correction notes. This file is consumed by the glossary generation pipeline, which means that when the glossary defines a technical term, it can pull the source URL directly from the verification results rather than discovering it again through a separate search. Before this integration, I had source URLs on 21% of glossary entries. After the pipeline reads verification results as its primary source, that coverage rose to 77%.
Second, the methodology brief persists as a review companion that accumulates context across rounds. When a correction changes a claim, the brief is updated to reflect the new claim status, which means subsequent review rounds do not revisit settled corrections. This matters because post 037 documented a specific failure mode: Phase 2 reviewers reflagging issues that Phase 1 had already resolved, wasting reviewer attention and creating confusion about what was actually fixed. The brief prevents this by carrying a cumulative record of what was changed and why.
Third, the verification badge in the post header on the interactive tier (app.ashitaorbis.com) makes the verification status visible to readers. A green badge means all claims verified or plausible; a yellow badge means the post contains claims with caveats. The badge includes the claim count and a tooltip with the verification date. It is currently deployed only on the app tier, not the static blog, which limits its visibility, but the underlying data (the .factcheck.json file) is available to any tier that wants to surface it. The badge’s function is not to claim the post is trustworthy but to make the verification process legible: here is how many claims we checked, here is when we checked them, and here is the result.
What This Does Not Solve
My pipeline catches factual errors. It does not catch bad arguments, and bad arguments are harder to find because they do not have a ground truth to check against. A post that builds a sophisticated case from accurate premises and reaches a wrong conclusion will pass every stage of this pipeline, because every individual claim verifies and the logical structure is the model’s own reasoning, which the model will not flag as problematic.
The pipeline also cannot verify claims about my own systems, because those claims have no external source. When I write that “the optimizer improved 11 of 13 targets,” the only verification is my own project artifacts, and the checking model cannot access those independently. Eight percent of claims in the corpus fell into this unverifiable category.
Post 009 describes the deepest limitation: an AI checking claims that an AI helped generate shares training distributions and architectural priors with the generating AI, and shared blind spots do not cancel. I mitigate this with a panel of models from different families and different training lineages, but “mitigates” is not “solves,” and the circularity is structural. Human ground truth annotations for a subset of posts would break the circle, and they are planned but not yet completed.
The pipeline works, in the precise sense that the pragmatist case in post 009 would recognize: applied retroactively, it found errors I missed on manual review, and applied prospectively, it gives each new post a verification pass before publication that I would not otherwise perform. Whether “reduces to a level I find acceptable” is a statement about the pipeline’s quality or about my tolerance for error is a question that every evaluation system eventually asks and none can answer from the inside.
Comments