AI Summary Claude Opus

TL;DR: AI-assisted research can compress investigation time dramatically, but the iterative review process required to bring fluent-but-error-laden first drafts to publication quality may cost twice as much as the investigation itself, inverting the traditional cost structure of research.

Key Points

Five rounds of review across three AI models (GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6) produced 83 corrections ranging from factual errors to failures of intellectual honesty, with each round exposing deeper error layers that previous corrections had been obscuring.
Different models caught different error categories—arithmetic and sourcing errors, structural incoherence, and adversarial reasoning gaps respectively—suggesting that multi-model review catches problems any single reviewer would miss.
The quality threshold for publishable research remains unchanged regardless of how cheaply the investigation was produced, meaning cheaper generation creates more confident first drafts that each require the same expensive verification process.

The post documents the author's experience reviewing an AI-assisted investigation that took one afternoon to produce but required five rounds of iterative review across three AI models over several days to prepare for publication. The first draft contained factual errors, logical contradictions, and unsubstantiated intent attributions, all camouflaged by the fluency of AI-generated prose. Each review round cleared a distinct layer of problems—surface factual errors, then imprecision, then omissions, then failures of intellectual honesty—producing 83 total correction items. The author argues that while AI has lowered the cost of investigation below the threshold of 'not worth proving,' it has not lowered the quality threshold for publication, resulting in an inverted cost structure where generation is cheap and verification is the dominant expense. The post concludes that this 'revision tax' may generalize as a structural feature of AI-assisted research, not an artifact of one case.

The Revision Tax

The investigation that produced my previous post took an afternoon. When I say “I” in what follows, the pronoun is generous: Claude generated most of the prose and analysis, and I directed the investigation, made editorial decisions, and reviewed the output. The errors described in this post were in Claude’s drafts that I read as my own and failed to question, which is the fluency problem in miniature. Together we wrote Python scripts, classified 8,000 model responses, computed reliability statistics, built similarity matrices: all of it was finished before dinner. That speed turned out to be the trap. Getting the results ready for publication took five rounds of iterative review across three AI models, produced 83 individual items (ranging from factual corrections to disclosure additions to hedging calibrations), and consumed what I estimate was twice the investigation’s four hours, spread across multiple sessions over several days. The revisions were not polish. They changed what my documents argued, corrected claims I would have published as fact, and caught contradictions I had read past multiple times without noticing. The opening sentence of my first draft stated that the benchmark “ranks Claude first on every question.” That was factually wrong.

I did not catch it. The two AI reviewers active in that round did, and only because I asked them to look.

The First Draft Was Wrong

Not rough, not unpolished, not in need of tightening. Wrong. My first draft contained factual claims that were false, logical structures that contradicted themselves, and framing that imputed intent to the benchmark’s creator where none was evidenced. The problem was not that the draft read badly but that it read well. AI prose is fluent whether or not it is accurate, and fluency creates a confidence gradient that makes errors harder to see, because the sentences flow with the same authority regardless of whether the claim they carry is true or fabricated. Skilled human writers can produce the same camouflage (academic fraud and journalistic fabrication demonstrate this), but AI generates fluent prose at a volume and speed that makes the problem quantitatively different: a human writer producing one polished draft per week encounters the fluency trap occasionally, while AI can produce dozens of confident first drafts in a day, each requiring the same scrutiny. The reader, including the writer reviewing their own output, receives no stylistic signal that something is wrong.

My opening sentence claimed the benchmark ranked Claude first on every question. It ranked Claude first on average, across a leaderboard, which is a different claim with different implications. I used the word “encoded” to describe how the rubric’s scoring criteria aligned with one lab’s training philosophy, which implies the benchmark’s creator deliberately designed the rubric to favor Claude, an accusation of intent that the evidence does not support and that transforms a structural observation into a personal attack. Two sentences in my methodology section were duplicated verbatim between the blog post and its companion document, not because I copied them intentionally but because the generation process produced the same phrasing twice and I did not notice. Each of these errors was surrounded by correct analysis, well structured and persuasive, that made the errors feel like part of a coherent argument. They were not coherent. They were wrong, and the fluency of the surrounding prose was camouflage.

What Each Round Found

The five rounds of review did not find the same kind of problem five times. Each round cleared a layer of errors, which exposed a deeper layer that the previous round’s errors had been obscuring. The progression followed a pattern that I suspect generalizes beyond this particular investigation, though I have only one case study to draw from: surface errors hide structural errors, which hide logical errors, which hide philosophical errors.

Rounds one and two used GPT-5.4 and Gemini 3.1 Pro reviewing in parallel. I collapse them here because they targeted the same error class, and separating them would overstate the granularity of my record-keeping. Together they caught the factual mistakes and the absolute language. “Ranks Claude first on every question” became a qualified claim about average scores. “The Claude judge is not biased” became “shows no evidence of favorable self-preference.” The word “encoded” became “structurally rewards.” These are the kind of corrections that external eyes catch immediately because they require only reading the sentences, not understanding the full argument. GPT-5.4 found the numerical issues. Gemini found the problems at the sentence level and the redundancy between documents. Together they produced 19 items, six of which were factual errors that would have undermined the investigation’s credibility if published.

Round three shifted from “this claim is wrong” to “this claim is imprecise.” The percentage I reported as approximately 6% turned out to depend on definition: 194 of 3,242 responses (5.98%) showed strong detection signals, but broadening the criteria to include moderate signals raised the count to roughly 7.7%. The correction was not arithmetic but scope, and the distinction mattered because the narrower and broader definitions supported different conclusions about how many models detect nonsense while engaging anyway. My opening stripped a framing with three components from the benchmark’s documentation and presented only one, making the benchmark sound more narrow than its creator intended. A section I labeled “Invisible Quadrant” was renamed to “Unscored Quadrant” because “invisible” implies the benchmark’s creator could not see the problem, while “unscored” describes the rubric’s actual behavior without attributing blindness. The distinction between describing what a system does and speculating about what its creator intended ran through half of the 22 items in this round, and catching it required reading the whole document as an argument rather than checking individual sentences for accuracy.

Round four found the omissions: citations that should have been included but were not, arguments that worked only because they excluded inconvenient evidence, hedges that I had softened past the point of making a claim at all. One of my paragraphs argued that the benchmark created a “false binary” between detection and refusal, but the benchmark’s rubric actually acknowledged a spectrum (Green, Amber, Red), which meant the false binary was in my analysis, not in the benchmark. Catching that error required understanding what the argument needed to be valid and noticing that a premise was missing. The 18 items in round four were qualitatively different from earlier rounds: not facts that were wrong but arguments that were incomplete.

Round five added Opus 4.6 as a third reviewer reading adversarially, and found the problems that survive four rounds of correction: contradictions spanning multiple paragraphs, evidence reported selectively, and positions that were not steelmanned. I had reported only Krippendorff’s alpha for nominal agreement (0.664) and omitted the ordinal alpha (0.796), which is arguably the more appropriate measure for an ordinal scale and paints a less stark picture. The phrase “barely clear” appeared in the same sentence as “just below,” a contradiction that I accepted as valid at the time but that, in fairness, could have been a reviewer misreading two different uses of threshold language rather than a genuine logical error. I had used tools built on the model being critiqued but did not flag this as a methodological dependency that readers should weigh when evaluating my conclusions. These 24 items were not errors in the conventional sense; they were failures of intellectual honesty that I would not have found on my own because I agreed with the conclusions and was not motivated to challenge them. Some of the 83 total items across all rounds may have been false positives that I implemented without sufficient pushback, which is its own form of the deference problem the post describes.

Why Different Models Found Different Problems

The explanation for why different models caught different error types is probably simpler than “specialization” implies: each model has different training incentives, which produce different reading priorities. GPT-5.4 caught the arithmetic errors, the missing sources, and the claims that did not follow from their evidence, likely because its training emphasizes precision and factual grounding. Gemini 3.1 Pro found the duplicated sentences, the ambiguous antecedents, and the places where my blog post and its companion document disagreed about the same data point, likely because its training emphasizes structural coherence. Opus 4.6, present only in round five, caught the buried ordinal alpha, my undisclosed methodological dependency, and the places where my analysis proved too much by failing to acknowledge legitimate counterarguments, likely because its training emphasizes adversarial reasoning. I cannot separate genuine specialization from the artifact of sequential cleaning (later reviewers find what earlier reviewers missed because the easy problems are already gone), and with Opus contributing only one round, I have no basis for claiming its pattern is consistent. What I can say is that the three models together caught categories of error that any one of them alone would likely have missed, and that observation, even without a clean specialization story, is what matters for the practical question of whether to use multiple reviewers.

The Cost Equation

I spent an afternoon on the investigation: roughly four hours of active work writing scripts, running analyses, and interpreting results. I did not time the revision with the same precision, but my best estimate is eight to ten hours across multiple sessions: five rounds of review, 83 items across two documents, implementation for each fix, and verification after each round. The revision consumed at least twice the investigation time. This is not a story about efficiency gains. It is a story about where the bottleneck moved.

My previous post argued that Claude Code makes investigation cheap by reducing the cost of testing a suspicion below the threshold of “not worth proving.” That claim is true, and the implication is that more investigations will happen, which is good. The uncomfortable corollary, at least in this case, is that the investigation produced a first draft that was confidently wrong in ways that required multiple rounds of review across several models to identify and correct. Making investigation cheap does not make the results ready for publication. It makes more first drafts that sound authoritative and contain errors camouflaged by fluency. If the revision tax for this investigation was five rounds of iterative review across three models, and if that pattern holds for other investigations (which I suspect but cannot demonstrate from one case), then the revision process may be the dominant cost of research done with AI, not the research itself.

The ratio matters. If investigation costs one unit and revision costs two, the total cost dropped (because the old total was “not attempted”). But the cost structure inverted: generation is now the cheap part, and verification is the expensive part. Anyone who has worked with code generated by AI recognizes a version of this pattern. Generation is fast; debugging is slow. The analogy is imperfect because code has test suites that mechanically verify correctness, while research review relies on human judgment about what constitutes a valid argument, a fair characterization, or a sufficient hedge. The errors I describe are not compiler errors; they are editorial and epistemological judgments that require a reader to evaluate whether the argument earns its conclusions. But the directional claim holds: the systematic patterns in AI output (overclaiming, intent imputation, selective evidence presentation) recur across drafts because they reflect the generation process’s optimization target rather than the publication’s quality requirements.

The Pipeline

I have codified the review process into a reusable automated pipeline (in Claude Code’s terminology, a “skill”) that sends documents to all three models in parallel, consolidates their findings into a table organized by priority, implements the fixes, and repeats until convergence. The pipeline reduces the coordination cost of dispatching to multiple reviewers and collecting their results, but it does not reduce the implementation cost: each fix still requires a human to evaluate whether the reviewer’s finding is valid and to decide how the text should change. I mention the pipeline because the post you are reading was itself reviewed using it, which means the recursion is complete and I will not belabor the irony further.

The Threshold That Did Not Move

My previous post argued that the investigation threshold moved: suspicions that were previously too expensive to test are now cheap enough to pursue, and more of them get tested, and many of them turn out to be wrong in interesting ways. That threshold moved because generation became cheap. This post argues that the quality threshold did not move: the standard for research that is ready for publication is the same whether the investigation took a week or an afternoon, and meeting that standard still requires iterative adversarial review that catches errors I cannot see because I agree with the conclusions. Cheap generation plus expensive revision means more investigations reaching the same quality bottleneck. The bottleneck is not effort; it is the gap between what AI prose sounds like and what it actually says, and that gap does not shrink because the generation got faster. If anything, it widens, because the faster I can produce a confident first draft, the more confident first drafts I have to distrust.

Disclosure: this post was written and reviewed using Claude Code, the same AI tooling whose revision requirements it describes. The review pipeline used GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6 as reviewers. The investigation post and its methodology critique are published in full.