AI Evaluating AI: The Circularity Problem

AI Summary Claude Opus

TL;DR: The post examines how using AI systems to evaluate AI outputs creates an inescapable epistemic circle—one that can be managed through engineering strategies but never fully resolved, since the evaluator shares the same limitations as the system being evaluated.

Key Points

When language models evaluate language model outputs, systematic biases such as self-preference, positional bias, and verbosity bias compound through iterative optimization rather than canceling out.
Mitigation strategies like multi-agent debate, meta-evaluation, and constitutional AI reduce practical failure modes but cannot break the fundamental circularity, since all participants share underlying architectural and distributional priors.
The distinction between manageability and solvability is central: circular evaluation systems produce measurable improvements in practice while remaining philosophically unable to verify that their own evaluation criteria are sound.

The post analyzes the epistemological problem that arises when AI systems are used to evaluate AI outputs, using the author's experience with DSPy prompt optimization as a concrete case study. It documents how Claude evaluating Claude-generated outputs creates a closed loop in which the evaluator's biases—including self-preference, verbosity bias, and positional bias—compound through iterative optimization, converging on the evaluator's preferences rather than any external ground truth. Drawing parallels to Gödel's incompleteness theorems and philosophical work on epistemic circularity, the post argues that neither human evaluation nor meta-evaluation frameworks fully escape the problem, since all evaluative systems ultimately rely on standards they cannot independently verify. The post surveys field-level mitigation strategies including multi-agent debate, adversarial testing, and human calibration loops, concluding that these approaches reduce practical failure modes without resolving the underlying circularity. The central claim is that practitioners should treat the circularity as a permanent property of the evaluation regime rather than a problem awaiting solution, maintaining vigilance against convergence toward shared blind spots.

Circularity in evaluation refers to a condition in which the judge shares fundamental characteristics with the subject being judged, such that the act of evaluation is itself subject to the same limitations, biases, and failure modes as the thing being evaluated. When a language model scores the outputs of language models, and the metric defining “better” is interpreted through the same training distribution that produced the outputs in question, we are not dealing with recursion (which terminates at a base case) but with a closed epistemic loop that has no external anchor. The classical formulation is Juvenal’s: Quis custodiet ipsos custodes? Who watches the watchmen? In the context of automated prompt optimization, the answer is another language model, which means the circle has not been broken but merely enlarged.

This is the problem I encountered while running DSPy’s optimization algorithms against my own Claude prompts, and what I found is that the circularity is manageable in the engineering sense (in my experience, the system produces better outputs on held-out test sets, and in my informal assessment those improvements carry over to production use) but not solvable in the epistemological sense (we cannot verify, from within the system, that the evaluation criteria themselves are sound). This distinction between manageable and solvable is not pedantic. It determines whether you treat failures as bugs to patch or as inevitable consequences of the regime you are operating in.

The Machinery of Circular Evaluation

DSPy, developed at Stanford NLP with research beginning in early 2022 (the foundational DSP framework was released in December 2022, followed by DSPy proper in 2023), is a framework that compiles high-level programs with natural language annotations into optimized prompts for language models, allowing algorithmic search over the space of possible prompts for a given task. The optimizer requires a metric function that evaluates the output of each candidate prompt and assigns it a score, and this metric is where the circularity enters. In my implementation, task-specific metric functions score each candidate prompt’s output against reference results, measuring dimensions tailored to each optimization target: issue severity matching, routing accuracy, structural completeness. The models are configurable across phases, with Sonnet generating candidates while Opus and Haiku serve different roles, but the reference outputs that define “correct” were themselves generated by Claude, and the quality dimensions encoded in the metrics reflect judgments shaped by working within the same model family. The optimizer’s search is guided by these metrics, which means the “improved” prompts are optimized toward quality standards that originate from the system being optimized, rather than from any external ground truth.

The circle operates as follows: Claude’s outputs define the reference standard, the reference shapes the metrics, the metrics guide the optimization, the optimization shapes the prompts, the prompts shape the outputs, and the metrics evaluate those outputs against standards rooted in the same model family that initiated the cycle. If the reference outputs systematically embody a particular notion of “clarity” (preferring verbose, heavily qualified prose over economical directness, for instance), the entire system converges toward that preference with increasing confidence across iterations. The optimizer does not know it is wrong because the evaluator does not know it is wrong, and the evaluator cannot know it is wrong because it is using itself as the standard.

This pattern is not confined to DSPy. The broader paradigm of LLM as judge, in which language models serve as evaluators for complex tasks, has become a dominant evaluation methodology across the field. Research on the LLM as judge paradigm has documented concerns that evaluating reranked systems using the same evaluator introduces circularity, raising questions about the validity of evaluation outcomes. The finding is not surprising, but the extent of the practice is: the LLM as judge approach is now widely adopted for evaluation tasks where human annotation is too expensive to scale.

Biases That Compound Through Circularity

The problem would be merely theoretical if AI evaluators were unbiased, but they are not, and the biases compound through iterative optimization in ways that are difficult to detect from inside the system. GPT-4 exhibits a documented self preference bias, systematically favoring texts with lower perplexity (that is, texts more familiar to the model’s own distribution). Models preferentially select specific positions within prompts, a positional bias that can significantly influence evaluation outcomes. Multiple LLM judges show verbosity bias, rating longer responses more favorably regardless of informational density. These biases are not random noise that cancels out across iterations; they are systematic tendencies that the optimizer learns to exploit.

The most striking findings are adversarial. Short universal adversarial phrases, discovered through greedy token optimization on surrogate models, can be concatenated to any assessed text to yield maximum scores from LLM evaluators, irrespective of actual content quality. Separately, strategically embedded persuasion techniques (appeals to authority, consistency framing, flattery) inflate scores on incorrect mathematical solutions by up to 8% on average, with the effect persisting even under counter-prompting strategies. Neither attack requires the response itself to be good; both exploit the evaluation surface rather than the content surface. This means the landscape that the optimizer searches over is not a faithful representation of output quality but a map of evaluator vulnerabilities, and those vulnerabilities can be gamed. An optimizer running for weeks will find the seams.

Human evaluation is the standard remedy, and human judgment is still the gold standard for real world use cases involving natural language generation, classification of subjective content, or complex domain tasks. But human judgment introduces its own circularity at a different level. Biases, inconsistencies, and subjectivities inevitably influence the labeling process, and inter-rater reliability studies reveal that annotators disagree substantially even on dimensions they are specifically trained to evaluate. Lower agreement on factual correctness appears to align with prior observations that this dimension is especially difficult to judge, even for experts. To put the point philosophically, the ground truth that human evaluation is supposed to provide turns out to be a useful fiction (in the precise sense that it is instrumentally valuable despite being philosophically incoherent), which means substituting human judgment for AI judgment does not eliminate circularity so much as relocate it. Instead, it changes the dynamics: human evaluation is slower, more expensive, and therefore less prone to the rapid convergence that makes AI circularity dangerous. Human evaluators also draw on embodied experience, social feedback, and cross-domain intuition that are genuinely external to the text-in, text-out loop. The circularity is structurally similar but operationally different, and the operational differences matter more than the structural similarity for practical system design.

Gödel, Self Reference, and the Limits of Internal Verification

The structural parallel to Gödel’s incompleteness theorems is not exact, but it isolates the right mechanism. Gödel demonstrated that any sufficiently powerful formal system (consistent, effectively axiomatized, and capable of encoding basic arithmetic) cannot prove its own consistency, a conclusion that relies on the construction of self-referential statements that the system can express but cannot decide. The mechanism that matters here is not the specific logical machinery but the general principle: self-reference creates boundaries that cannot be crossed from within.

AI evaluation circularity is an epistemological problem rather than a logical one. The question is not whether a system can prove its own consistency but whether a system can verify the reliability of its own evaluation criteria. The parallel is structural, not formal: a language model evaluating language model outputs is a system making claims about itself, and the validity of those claims cannot be established without stepping outside the system, which is precisely what the model cannot do. Where Gödel’s result is a proof of impossibility within formal systems, AI evaluation circularity is a practical constraint on epistemic access, but both are instances of the same underlying principle that self-referential systems cannot fully validate themselves.

Human reasoners can, in principle, revise their assumptions and rethink their frameworks, and this flexibility is real, but it does not provide the clean escape it appears to. The specific features that make AI circularity worse than its human counterpart are speed of convergence (an optimizer iterating thousands of times amplifies biases that human deliberation would surface slowly), opacity of the shared substrate (we cannot fully characterize what training data distributions models share, so we cannot assess the independence of our “second opinions”), and the absence of embodied correction (humans receive feedback from physical and social environments that are genuinely external to their cognitive apparatus, while models receive feedback only from other models or from humans mediated through text). The circularity is not unique to AI, but AI creates conditions under which it compounds faster and with less visibility than it does in human evaluation.

Mitigation Without Resolution

The field has developed several strategies for managing the circularity, and they are worth understanding precisely because they illustrate what “manageable” means in practice.

Multi agent debate deploys multiple LLM agents that collaborate or debate to assess outputs, with agents playing different roles (domain experts, critics, defenders) so that the evaluation incorporates diverse criteria and adversarial feedback. Some research indicates that moderate, not maximal, disagreement tends to achieve the best performance by correcting but not polarizing agent stances, though the effect appears to be task-dependent rather than a universal principle. The approach emulates a panel of human judges, and it reduces the impact of any single evaluator’s biases, but it does not escape the circularity because all agents in the debate share underlying architectural priors from similar training regimes. The diversity is real but shallow: disagreement on surface features masking convergence on deep structural assumptions about what constitutes quality.

Meta-evaluation, in which the evaluator itself is evaluated, has emerged as a safeguard as AI judges grow in influence, with frameworks like MetaEvaluator providing structured environments to measure consistency, reliability, and bias under controlled conditions. The problem is immediate and obvious: who evaluates the meta-evaluator? The regress is not infinite in practice (at some point you stop and accept the results) but it is infinite in principle, and the decision to stop at a particular level of meta-evaluation is itself unjustified within the framework. A recurring issue is the question of how we know the agent judges are correct. Improved correlation with human judgments is encouraging, but it is not a perfect measure of true reliability.

Constitutional AI, Anthropic’s approach to alignment through self evaluation, gives an AI system a set of principles against which it can evaluate its own outputs. As language model capabilities improve, AI identification of harms improves significantly, with chain of thought reasoning leading to evaluations that show improved alignment with human judgments. This is genuine progress. It is also circular: the constitution is specified by humans, but interpreted by the model, and the model’s interpretation of principles like “be helpful and harmless” is shaped by the same training process that produced the behaviors being evaluated. The system works, and works increasingly well, but “works” is a claim about practical utility, not about philosophical soundness.

The Epistemological Ground

The philosophical literature identifies this pattern precisely. An epistemically circular argument defends the reliability of a source of belief by relying on premises that are themselves based on the source. This is a key element in the broader problem of meta justification, which asks how we can ultimately justify our standards of justification. The question maps directly onto AI evaluation: we want to justify the claim that our AI evaluator is reliable, but any evidence for that reliability is itself evaluated by AI systems (or by humans whose judgment we also cannot independently verify), and the justification never reaches bedrock.

Michael Bergmann’s distinction between malignant and benign epistemic circularity is relevant here. Malignant circularity occurs in a “questioned source context,” where the agent begins by doubting the source’s trustworthiness and looks for a second opinion independent of the original source. If we doubt Claude’s evaluation of Claude outputs, asking a different Claude (or a different model from a similar training distribution) provides no genuinely independent verification. The second opinion is correlated with the first in ways we cannot fully characterize, because the correlation operates at the level of training data, architectural choices, and optimization objectives that are opaque even to the system’s designers.

Benchmark contamination compounds the problem. Peer-reviewed studies have documented statistically significant performance drops on problems added after a model’s training cutoff, with results on platforms like Codeforces and Project Euler showing clear temporal trends that are difficult to explain without some degree of data leakage. Out of thirty analyzed model developers, only nine reported train-test overlap. If the benchmarks we use to validate our evaluators are themselves contaminated, the entire evaluation infrastructure rests on a foundation we cannot inspect, and the circularity is not just epistemological but empirical: we do not know what our systems know, and we do not know what our evaluators have already seen.

What This Means in Practice

The pragmatic position is to accept the circularity, design for its consequences, and resist the temptation to declare it solved. Ensemble evaluation (using multiple models with different training lineages, treating agreement as evidence of robustness and disagreement as a flag for human review) reduces vulnerability to any single model’s biases but does not eliminate shared blind spots across model families. Adversarial testing (generating outputs designed to score well despite being wrong, then checking whether the evaluator catches them) reveals specific failure modes but cannot exhaustively characterize the evaluator’s weaknesses, because an adversary operating within the same distributional space as the evaluator is subject to the same limitations. Human calibration loops (sampling outputs at intervals, comparing human and AI ratings, using discrepancies to adjust rubrics) introduce genuine external signal but at a frequency that leaves the system predominantly self referential between calibration points.

These strategies work. In my experience, the DSPy optimizer produces better prompts, and the improvements appear to carry over to production use based on my own informal assessment. By empirical standards, the system delivers. The strongest version of the pragmatist case is worth stating explicitly: strong LLM judges now achieve 80% or higher agreement with human raters on pairwise preference judgments, which is comparable to inter-annotator agreement between humans themselves. If the agreement rates are indistinguishable, the pragmatist argues, the philosophical distinction between AI circularity and human circularity carries no operational weight. This is a serious objection. The response is not that the agreement rates are illusory but that agreement with humans measures conformity to human preferences, not access to ground truth. High agreement means the AI evaluator has learned to mimic human judgment, which is useful, but it does not mean the AI evaluator has escaped the circularity, because the human judgments it agrees with are themselves subject to the biases and limitations described above. The philosophical incompleteness does not prevent practical progress, but it does mean that agreement-with-humans is a ceiling on validation, not a foundation for it.

But the unsolved circularity has failure modes that the “manageable” framing should not obscure. When an optimizer runs for weeks and converges on prompts that exploit specific phrasing preferences in the evaluator, producing outputs that score well but read oddly to humans, this is not a bug in the implementation but a predictable consequence of a system optimizing against its own reflection. When a text generation system optimized for “readability” (as judged by an LLM) converges on verbose, overqualified prose that the evaluator likes but experienced readers find patronizing, the evaluator’s preferences have become the de facto goal, displacing the original objective. The circularity did not cause these failures in a simple causal sense, but it created the conditions under which they were inevitable.

The temptation is to declare victory because the system works, or to declare defeat because the circularity is fundamental. Neither response is adequate. The system works and the circularity is fundamental, and the appropriate response is not resolution but vigilance: treating every evaluation result as provisional, every optimization trajectory as potentially adversarial, every metric improvement as a claim that requires external validation it can never fully receive. We are building systems that function within a circle we cannot break, and the best we can do is to know the circle is there, to design compensating mechanisms that slow down the convergence toward our own blind spots, and to resist the entirely human impulse to mistake manageability for solution.

The circularity is not a bug. It is a property of the regime. And a system that optimizes against its own evaluation criteria, no matter how sophisticated the safeguards, is ultimately a system converging on its own reflection, which is to say, converging on a picture of quality that it generated, validated, and refined without ever stepping outside to check whether the picture resembles the world.