PsycheEval v0.2

I am writing the v0.2 post in a different mood than the v0.1 post. v0.1 was the kind of pilot whose strongest contribution was the open questions it surfaced. v0.2 was supposed to settle three of those open questions in the direction the v0.2 plan expected, and instead it settled two of them in the opposite direction. The planned headline “C5_CONTRACT > C3 and C5_CONTRACT > C4” does not survive counterbalanced AB/BA judging: under correction, neither pairwise preference is detectable, and the retracted headline pair has been replaced by a methodology contribution that I did not write into the plan and would not have noticed without the AB/BA experiment. The version of this post that I would have written before the round-2 external review went out would have led with the C5_CONTRACT-beats-everything finding. The version I am writing now leads with the retraction, because the retraction is what the run actually produced.

PsycheEval v0.1 ended with five questions I needed v0.2 to answer. Each became a v0.2 design decision: a new C5_CONTRACT condition to separate the public-anchor confound from the structured-contract confound; a C1_padded and C4_shuffled pair to put the v0.1 length effect under direct control; an anchored 0–10 rubric to break the 4.0–4.8 ceiling that v0.1’s unanchored Likert produced; a re-curated harder scenario set; and a tri-model authoring pass so the corpus would carry GPT-5.4, GPT-5.5-xhigh, and Opus 4.7 outputs side by side. The v0.2 plan locked seven decisions ahead of running it. The harness produced 1,680 assistant outputs across 8 conditions, 3,949 anchored scalar judgments, and 3,243 raw pairwise records (3,064 true same-author after excluding 179 cross-author leak records). The curated report I am writing this from is the canonical artifact that sits behind this post.

The structure that follows is therefore a story about how a research pipeline retracts its own headlines under audit. Whatever surface drama that has is in service of a substantive point: the parts of v0.2 that worked turned out to be the audit pipeline more than the experiment, and the audit pipeline became the thing the report is actually about.

What v0.2 Was Designed to Test

The locked v0.2 plan listed five open v0.1 questions and the design response to each. I keep returning to the table because each row is a discipline check: if the design didn’t actually address the question, the result doesn’t either.

The v0.1 review left open whether the C5 channel pattern (C5 scoring below C3 and C4 on the rubric, while looking pairwise-competitive) was about the source packet’s public-anchor narrative form or about C5’s structural absence of a behavioral contract. v0.2’s design response was a new condition C5_CONTRACT, which is the C5 source packet plus the same behavioral contract used in C3 and C4. Holding the contract constant across C3, C4, and C5_CONTRACT was supposed to isolate the source-packet axis from the contract axis, and the locked v0.2 plan’s headline was that this would let us either rescue the “source packets add value” reading from v0.1 (if C5_CONTRACT beat C3 and C4) or settle the question against it (if C5_CONTRACT looked like C5).

The second open question was whether C4’s pairwise advantage over C1 in v0.1 was the behavioral contract or just length. The C4 vs C1 lengths weren’t matched in v0.1, so “behavioral contract beats trait labels” might have been “longer prompt beats shorter prompt.” v0.2 added C1_padded (C1 plus benign meta-padding that brings its character count up to C4’s) and C4_shuffled (C4’s bullets and sentences shuffled, holding word count fixed). Those two length controls are independent of the C5 question and were always going to settle differently.

The third question was whether the v0.1 scalar rubric’s 4.0–4.8 ceiling effect was a measurement problem. v0.1 used a 0–5 Likert that judges seemed to compress into a narrow band; v0.2 replaced it with a 0–10 rubric with explicit calibration anchors (a 3 anchor for “weakly bad,” a 7 anchor for “competent,” and so on, per dimension). The anchored rubric needed to produce a usable span across the corpus before any other scalar finding could be trusted.

The fourth question was whether v0.1’s scenarios were too easy. v0.2 re-curated to an 80-scenario set with mean difficulty 3.58/5 (vs v0.1’s mean about 2.9/5), drawn from eight scenario families with explicit difficulty calibration during authoring. Easier scenarios were retired.

The fifth question was structural: v0.1’s pairwise was same-author scoped, which let it isolate condition effects from author effects but at the cost of being unable to talk about cross-author generalization. v0.2’s response was to author the full corpus tri-model (GPT-5.4, GPT-5.5-xhigh, Opus 4.7) and to plan pairwise at same-author scope across all three authors. The plan did not fully pre-commit to fully tri-judge coverage of all pairs. The v0.2 hard-pilot ran codex-only pairwise on the C0/C1/C1_padded/C3/C4/C4_shuffled pairs, with Opus judging restricted to the C5_CONTRACT pairs (and later, after Phase 2 of this post’s story, filled in for C4 vs C5 too).

Three of those design responses worked exactly as planned. The anchored rubric span widened from 0.8 points on 0–5 to 2.42 points on 0–10, which restored measurement headroom. The length-control conditions (C1_padded, C4_shuffled) gave us direct hands on the v0.1 length confound. The tri-model authoring extended scalar coverage to all three authors. None of those decisions surprised me; they were the disciplined extensions of v0.1 that v0.2 was supposed to be.

The two design responses that didn’t work as planned were the C5_CONTRACT separator and the same-author pairwise judging. Both produced findings, but they were not the findings the plan expected.

What the First Pass Produced

The headline pairwise table from the first analyzer pass, before any of the audits that follow, looked like this. The win rate column is the lower-numbered condition’s decisive pairwise win rate against the higher-numbered, across same-author judges (judge scope varies by row: C5_CONTRACT-edge rows have all three judges; non-C5_CONTRACT rows are codex-only at this first analyzer pass), cluster-bootstrapped by persona × scenario × author. I am going to keep this set of numbers in front of you for the rest of the post because everything that follows is a story about how those original numbers turned into different numbers under audit.

pairlo_win (original)reading at first pass
C0 vs C40.134C4 dominates baseline 86.6%
C1_padded vs C40.319C4 beats length-matched 68.1%
C3 vs C40.378C4 modestly beats C3
C3 vs C5_CONTRACT0.428C5_CONTRACT beats C3 by ~57%
C4 vs C4_shuffled0.519no significant difference
C4 vs C50.637C4 beats C5 by ~64%
C4 vs C5_CONTRACT0.400C5_CONTRACT beats C4 by 60%
C5 vs C5_CONTRACT0.240C5_CONTRACT beats C5 by 76%

If you stop reading the table here, the obvious reading is: C5_CONTRACT is the strongest profile-conditioning treatment in the corpus. It beats C3, C4, and C5 on the pairwise channel, with the C5 case being decisive and the C3 / C4 cases moderate. The original v0.2 plan’s lead claim was going to be the version of that reading: “PsycheEval v0.2 shows that source packets add value when subordinated to a behavioral contract; the v0.1 public-anchor-vs-contract confound is resolved in favor of contracts as the dominant driver.” That is a clean story. It also turned out to be wrong on the C5_CONTRACT vs C3 and vs C4 edges. The full story of how that wrongness surfaced is what makes v0.2 interesting and what makes it useful for v0.3.

The reason it was wrong isn’t a single error. It is the convergence of three independent same-data audits, all of which predicted the C5_CONTRACT > C3 / C4 collapse before any new data was collected, and an AB/BA position-bias experiment that confirmed the prediction directly.

Three Independent Audits Predict the Same Collapse

The first audit was a scalar–pairwise reconciliation: for each pair of outputs that the pairwise channel judged, find the same judge’s anchored scalar scores on both outputs, compute the per-dimension scalar difference, and ask whether the pairwise winner agrees with the scalar-sum sign. The audit is non-tautological because pairwise and scalar judging used the same outputs and the same judges, but the prompts differed (pairwise asks for a head-to-head choice; scalar asks for independent 0–10 ratings on ten dimensions), and the rubrics are not identical.

For the pairs that the first-pass table called clean, scalar and pairwise agreed. C0 vs C4 had a scalar Δ_total of +6.075 on a 0–100 scale (sum of ten 0–10 dimensions): hi (C4) beats lo (C0) by 6 points, which is a big and unambiguous direction. The pairwise channel said C4 wins 86.6%. Both channels report the same outcome. C5 vs C5_CONTRACT had a scalar Δ_total of +3.057: hi (C5_CONTRACT) beats lo (C5) by 3 points, with 79.4% sign agreement at the cell level. Pairwise said C5_CONTRACT wins 76.0%. Channels agree.

The C5_CONTRACT > C3 and C5_CONTRACT > C4 pairs broke the pattern. C5_CONTRACT vs C3 had a scalar Δ_total of −0.053: essentially zero, with a slight lean toward C3. C5_CONTRACT vs C4 had a scalar Δ_total of −0.231: slightly favoring C4. Both pairs had pairwise majorities for C5_CONTRACT (57.2% and 60.0%) above parity but materially weaker than the C5 vs C5_CONTRACT pair (76.0%). The scalar rubric, applied to the same outputs by the same judges, was not seeing C5_CONTRACT as better than C3 or C4. That channel mismatch (pairwise prefers, scalar is flat) is the kind of signal that says one of the two channels is tracking something the other isn’t, and that something deserves a name before the headline is finalized.

The second audit was leave-one-out fragility on the all-judge cluster bootstrap. Drop one judge at a time, recompute the bootstrap CI, and ask whether any single judge is carrying a result. On C3 vs C5_CONTRACT, dropping the Opus judge moved the all-judge lo_win from 0.428 to 0.494: a Δ of +0.066 toward parity. Opus alone was carrying the C5_CONTRACT advantage in the all-judge view. Dropping GPT-5.4, which had its own structural disagreement with the headline (a per-judge breakdown showed GPT-5.4 preferring C3 over C5_CONTRACT at 0.571 in the original pairwise), instead pulled the all-judge lo_win to 0.367, strengthening the apparent C5_CONTRACT advantage. So one judge was driving the headline, another was contradicting it, and the headline only existed because the third judge sat in the middle and the cluster bootstrap pooled them.

The third audit was a length-matched subset, the Phase 0.G pre-AB/BA length-only audit run before any AB/BA data existed: for each pair, restrict to pairs where the two responses have similar word counts (within the “similar” bucket of the analyzer’s _length_bucket), and recompute the controlled lo_win on that subset only. For C3 vs C5_CONTRACT, the similar-length subset (n=78) had lo_win 0.449 with a Wilson CI of [0.343, 0.559]. The CI included 0.5. The whole effect vanished when length was held constant. (The joint position-plus-length AB/BA correction, run later, gave n=80, controlled lo_win 0.500, bootstrap CI [0.372, 0.634]: same conclusion through a different lens.)

Three audits, three different framings. Channel mismatch. Single-judge dependence. Length artifact. Each had its own interpretation. None of them, on their own, identified position bias as the unifying cause. The unifying cause showed up when I ran the AB/BA counterbalanced rejudge.

The AB/BA Experiment

The structural fact about v0.2’s original pairwise table is that the lower-numbered condition was almost always in slot A of the judge prompt. C0 in C0 vs C4; C1_padded in C1_padded vs C4; C3 in C3 vs C5_CONTRACT; and so on. I checked this in the A/B side audit. In the broader pairwise corpus (10 condition-pair types), slot_a_is_lo_share was 1.000 for eight types, and ≥0.988 for the other two (C1 vs C1_padded at 0.994, C4 vs C4_shuffled at 0.988). C5_CONTRACT, the condition the v0.2 plan was building a headline around, was in slot B for 100% of the pairwise records it appeared in.

This is a known problem with pairwise LLM judging. Zheng et al. (2023, arXiv:2306.05685) and the “Judging the Judges” line of work (Shi et al. 2024, arXiv:2406.07791) document substantial position bias in LLM-as-judge setups. The standard first-pass mitigation is counterbalanced AB/BA rejudging: take an existing pairwise record, swap the run_id_a and run_id_b inputs, re-issue the same prompt to the same judge, and check whether the winner changes. If the new winner is the same condition that won the original (regardless of physical slot), that’s real condition preference. If the same physical slot wins both times (the original slot A wins, the new slot A wins where the conditions are now reversed), that’s a slot effect: position, not condition. AB/BA reduces position bias but does not fully eliminate it (see “What This Doesn’t Settle” below).

External round-1 reviewers (three independent reviews from GPT Pro, codex-council, and GPT Max) unanimously called AB/BA “the cheapest decisive next experiment.” Phase 1 ran it on all six headline pairs over two days. Tier A (the three C5_CONTRACT-edge pairs at all three judges) was 864 AB/BA-matched records, Tier B (the three non-C5_CONTRACT pairs at the two codex judges) was 800 records, and a Phase 2 hygiene-pass Opus fill on C4 vs C5 added 120 originals plus 120 swap rejudgments. Total: 1,784 swap-rejudge records (864 + 800 + 120), judged by the same model that scored the original of each pair.

The result, in the same format as the first-pass table above. CIs are 2,000-resample cluster bootstrap on the lo_win-basis controlled rate (clustering unit: persona × scenario × author).

pairnoriginal lo_winswap lo_wincontrolled lo_winbootstrap CI (lo_win)verdict
C1_padded vs C43200.3190.4340.377[0.317, 0.439]✓ survives (C4 wins 62.3%)
C3 vs C5_CONTRACT2880.4280.5730.501[0.436, 0.566]⚠ COLLAPSES
C4 vs C4_shuffled3200.5190.6370.578[0.522, 0.637]✓ survives (C4 modestly wins 57.8%)
C4 vs C5 (post-fill)2800.6260.7360.680[0.615, 0.744]✓ survives (C4 wins 68.0%)
C4 vs C5_CONTRACT2880.4000.5620.482[0.413, 0.548]⚠ COLLAPSES
C5 vs C5_CONTRACT2880.2400.4030.323[0.267, 0.383]✓ survives (C5_CONTRACT wins 67.7%)

The pattern is clean. The swap lo_win is consistently higher than the original lo_win: every single pair shows the lower-numbered condition winning more when it’s placed in slot B than when it’s placed in slot A. The mean swap−original delta is about +0.12 to +0.16 across the pairs the bias-prone judges scored. Position effect, not condition effect.

The position-controlled lo_win (the average of the original and the swap) is what should have been the headline number all along. For the four pairs where the original direction was strong, the controlled estimate stays in the same direction but attenuates: C0 vs C4 isn’t here because it wasn’t AB/BA-tested directly (the effect is large enough that position bias couldn’t explain it, and the swap budget was reserved for the marginal pairs), but C4 vs C1_padded, C4 vs C5, and C5_CONTRACT > C5 all survive. The C4 vs C4_shuffled pair, which the first-pass table called “no significant difference at 51.9%,” actually strengthens to 57.8% C4 wins under correction: the original was understating the effect because slot-A bias was hurting C4 in its constant slot A. A small sign-correction, not a flip.

The two C5_CONTRACT vs C3 and C5_CONTRACT vs C4 pairs went the other direction. The original 57.2% and 60.0% C5_CONTRACT win rates collapse to 49.9% and 51.8% under counterbalancing, with bootstrap CIs that straddle 0.5. There is no detected pairwise preference between C5_CONTRACT and C3 or between C5_CONTRACT and C4 once position is controlled. The scalar channel agrees, as the first audit predicted: scalar Δ_total of −0.053 and −0.231 respectively, both effectively zero. The original v0.2 plan’s headline that C5_CONTRACT outperforms both does not survive AB/BA correction; the apparent advantage was driven by slot-B bias in two of the three judges. The C5_CONTRACT condition was always in slot B; the position bias inflated it; and the original pairwise channel mistook that inflation for a condition effect.

The Slot-B Preference Is Judge-Family Specific

The structurally interesting feature of the position bias finding, which I did not know going in, is that it is not uniform across judges. The per-judge breakdown of the slot-B advantage (swap_lo_win minus original_lo_win, where positive means the lower-numbered condition wins more in slot B than in slot A, i.e., slot B is favored) tells the actual story:

pairgpt-5.4gpt-5.5-xhighOpus 4.7
C3 vs C5_CONTRACT−0.036+0.250+0.204
C4 vs C5_CONTRACT+0.024+0.250+0.199
C5 vs C5_CONTRACT+0.112+0.310+0.094
C1_padded vs C4+0.031+0.200n/a
C4 vs C5+0.037+0.138+0.140
C4 vs C4_shuffled−0.025+0.263n/a

GPT-5.4 is essentially position-neutral. On C3 vs C5_CONTRACT and C4 vs C4_shuffled it actually shows slight slot-A preference (−0.036 and −0.025). On the other pairs the advantage is small and positive (+0.02 to +0.11). GPT-5.5-xhigh is the strong position-biased judge: 14–31 pp slot-B preference across the pairs I have for it (the lower end on C4 vs C5, the higher end on the C5_CONTRACT pairs and C4 vs C4_shuffled). Opus 4.7 is moderate on the C5_CONTRACT vs C3/C4 pairs (+0.20 each) and gentler on C5 vs C5_CONTRACT (+0.094), which is interesting because the absolute condition effect on that pair is large enough that the position bias is a smaller fraction of the total signal.

The shape of this finding matters for the methodology contribution, because “LLM pairwise judges have position bias” is a known result from prior literature. The contribution v0.2 is making is the per-judge quantification on a tri-model corpus with same-author controls. The right wording is therefore narrow: in this corpus, this prompt, this judge set, this protocol, gpt-5.4 is approximately position-neutral, gpt-5.5-xhigh shows 14–31 pp slot-B preference, and Opus shows 9–20 pp content-varying. I don’t have evidence that a different pairwise prompt or different judges would replicate these magnitudes, and the round-2 reviewers were explicit that universalizing this finding (“LLM judges have ~15-17 pp slot-B bias”) would overclaim. The corpus-scoped version is honest about what the experiment can support.

What v0.2 does give the methodology contribution that prior work didn’t is the direct demonstration that the judge-family stratification is consequential for headline findings. The original v0.2 pairwise had three judges. Two of them carried strong slot-B bias; one didn’t. The cluster bootstrap pooled them. The headline that “C5_CONTRACT > C3” was an averaged-across-judges result that the position-neutral judge (gpt-5.4) actively contradicted while the strong position-biased judges (gpt-5.5-xhigh and Opus) inflated; pooling across judges into a single aggregate CI hid the underlying disagreement. AB/BA gave us the per-judge slot-B numbers, and the per-judge numbers gave us the structural explanation for why the cluster bootstrap was systematically wrong.

What This Doesn’t Settle

One structural caveat is worth naming. The “controlled lo_win = mean of original and swap” is itself a measurement choice that assumes additive symmetric position effects orthogonal to the condition signal. A future same-orientation rejudge sentinel (run the same prompt twice in the same slot order) would decompose the AB/BA flip rate into a position-bias component and a retest-noise component. v0.2 didn’t run that sentinel, and a hostile reviewer would correctly note that the post leans on a number it has not yet decomposed. The v0.3 list below names this as a planned ablation.

There’s a generic lesson here for evaluation pipelines: counterbalanced AB/BA judging should be a default, not a post-hoc audit. The cost in v0.2 was about 1,784 swap calls (one wall-clock day of codex throughput plus a few Opus cap windows over a weekend). The information gain was two retracted headlines, one promoted scope-limited finding (C4 > C5 ended up judge-unanimous across all three judges at 68.0% after the Opus fill closed the original scope gap), and a methodology contribution that did not exist before AB/BA ran. The cheapest path to a non-misleading published result is to design the AB/BA into the pipeline from the beginning, not to retrofit it.

What Survives, What Doesn’t, and the Honest Framing

The set of Tier 1 substantive claims v0.2 supports, after all of this, is the following.

C5_CONTRACT > C5 at 67.7% under counterbalanced judging. This is the single substantive finding the v0.2 pilot makes about source-packet conditioning. Adding a structured behavioral contract to a source-packet condition substantially improves it. The effect is judge-unanimous (all three judges favor C5_CONTRACT, with controlled lo_win rates 0.342 / 0.262 / 0.352 across gpt-5.4 / gpt-5.5 / Opus, all clearly excluding 0.5 on the C5_CONTRACT side), persona-robust across the four PI personas, scenario-family-robust across the eight scenario families, and scalar-aligned (the anchored rubric also favors C5_CONTRACT, with Δ_total +3.057 on the 0–100 scale). It survives a joint position-plus-length correction: in the length-similar subset (n=74), the controlled lo_win is 0.351 with a bootstrap CI of [0.238, 0.471] that excludes 0.5 on the C5_CONTRACT side. So whatever C5_CONTRACT is doing, it isn’t explained by response-length imbalance in the matched subset, isn’t a position effect, and isn’t a judge-specific effect.

What it is I can’t tell you. C5_CONTRACT differs from C5 on at least four dimensions simultaneously: the contract is present in C5_CONTRACT and absent in C5; the contract is placed before the source packet (contract-first ordering); the contract carries explicit anti-mimicry rules (“do not imitate the writing style of the source packet”); and the C5_CONTRACT prompt is roughly twice the length of C5 (7,434 chars vs 3,884). The data cannot say which of these components is doing the work. The defensible package claim is “adding the contract-first source-packet package improves the C5 condition decisively under counterbalanced judging.” Mechanism attribution is a v0.3 question; the C_GENERIC_CONTRACT ablation, which round-2 reviewers unanimously called the highest-priority v0.3 experiment, is the cheapest way to start unpacking it.

C4 > C5 at 68.0% under counterbalanced judging, judge-unanimous. This is the cleanest Tier 1 finding in v0.2, after the 2026-05-17 Opus C4 vs C5 fill that closed the original scope gap. Three judges (gpt-5.4, gpt-5.5-xhigh, Opus 4.7) converge on essentially the same C4 win rate: 0.681 / 0.681 / 0.679. The joint position-plus-length correction also survives: length-matched controlled lo_win 0.674 with CI [0.547, 0.795] (wide, because the length-matched subset is small, but clearly excluding 0.5). Behavioral contract beats source-packet-without-contract on every axis tested: pairwise win rate, scalar Δ_total, per-judge consistency, and joint position-plus-length correction.

C4 > C1_padded at 62.3% under counterbalanced judging. Behavioral contract beats length-matched baseline. The pair attenuates from 68.1% original to 62.3% controlled, which is a clean position-bias attenuation, but the C4 advantage holds at 12.3 pp above parity. The v0.1 length-confound question (whether C4’s advantage over C1 was the contract or the length) resolves cleanly: it’s the contract. The scope qualifier on this pair is OpenAI-judges-only, since the v0.2 hard pilot didn’t run Opus pairwise on the non-C5_CONTRACT-edge pairs, and the planned Opus fills focused on the C5_CONTRACT side first.

C0 dominated by every profile condition. The least surprising finding in the corpus. C4 beats C0 86.6% in original pairwise; the scalar Δ_total is +6.075 (the largest in the corpus); and no AB/BA correction was run because the effect is well outside the corridor where position bias could plausibly explain it. The qualifier I keep on this row is “not AB/BA-tested, but the effect size is well beyond the measured position-bias range.” A research-style report would not present this as “AB/BA-controlled”; this post doesn’t either.

C4 modestly beats C4_shuffled at 57.8% under counterbalanced judging (OpenAI judges only). This is a Tier 1.5 finding: real, but smaller. Coherent-order C4 wins more often than shuffled-order C4 once slot-A bias is corrected for. Effect size is 7.8 pp above parity, which is meaningfully smaller than the 17–19 pp Tier 1 effects above it. I describe this as “modest” rather than “significant” in the report, and the curated wording does not put it on equal footing with the cleaner findings.

The retractions on the other side:

C5_CONTRACT vs C3 has no detected pairwise preference under counterbalanced judging. The original 57.2% C5_CONTRACT win rate collapses to 49.9% with bootstrap CI [0.436, 0.566] that straddles 0.5. The scalar channel agrees (Δ_total −0.053; essentially zero). The published v0.2 plan headline that C5_CONTRACT outperforms C3 does not survive AB/BA correction; the apparent advantage was driven by slot-B bias in two of three judges. The “collapse” framing is not the same as “C5_CONTRACT equals C3” (that would require predeclaring an equivalence margin, which the pilot didn’t do). The honest wording is “we do not detect a reliable pairwise preference between C5_CONTRACT and C3 after AB/BA correction. The CI still allows small effects in either direction (roughly ±7 pp at this n). Scalar scores do not support a C5_CONTRACT advantage.”

C5_CONTRACT vs C4 has the same outcome. Original 60.0% → controlled 51.8%, bootstrap CI [0.413, 0.548]. No detected preference. Scalar Δ_total −0.231 slightly favors C4.

The retracted headlines settle, in the opposite direction the v0.2 plan expected, the v0.1 question about whether the C5 effect was the packet or the absence of contract. v0.2 does not establish that “source packets add value when subordinated to a contract.” What v0.2 establishes is that the C5_CONTRACT package (contract + source packet + anti-mimicry + extra length) substantially beats the C5 packet-only condition (the C5 vs C5_CONTRACT case), but does not beat the C3 or C4 contract-without-packet conditions. The cleanest reading of the four-pair set is that the contract is doing most of the work, and the source packet, when subordinated to a contract, contributes nothing detectable on top of it. The CI still permits an effect within roughly ±7 pp at n≈288 per pair, and we did not predeclare an equivalence margin, so this is a “no detection” result, not an “established equivalence” result.

That reading is consistent with both the pairwise and scalar channels after correction. It’s also the reading that round-2 external reviewers converged on independently before they saw the curated report. It is not the reading that the v0.2 plan’s pre-run hypothesis would have produced; the original “PAE confound resolved in favor of contract as dominant driver” framing turned out to be too strong, because under correction C5_CONTRACT doesn’t beat the contract-only conditions either. The pilot can support the C5 ablation; it can’t support the broader claim about source packets contributing positive marginal value.

The Audit Pipeline as the Headline

There’s an easy temptation to write the v0.2 retraction as a failure, or as a “we caught it before publication” sigh of relief. Both readings are wrong. The retraction is what the audit pipeline was for. The C5_CONTRACT separator condition was added to v0.2 precisely so the v0.1 confound could be tested directly. The anchored rubric was added so scalar means would have enough span to disagree visibly with pairwise. The length controls were added so the v0.1 length effect could be controlled directly. The AB/BA experiment was added because round-1 reviewers unanimously flagged that the v0.2 original pairwise structure (lower-numbered condition always in slot A, higher always in slot B) needed counterbalancing before any margin could be trusted. Every one of those design responses was specifically chosen to produce the kind of result v0.2 actually produced.

The substantive content of “the audit pipeline worked” is that three independent same-data audits (scalar–pairwise reconciliation, judge-leave-one-out fragility, length matching) plus one new-data audit (AB/BA counterbalanced rejudging) all converged on the same prediction before the curated report was written. The convergence was not designed in; I did not know the three same-data audits would predict what the AB/BA experiment subsequently confirmed. Each audit had its own confound-specific interpretation. The unifying interpretation, that all three apparent confounds were downstream of the slot assignment imbalance interacting with judge-family-specific slot-B preferences, only became visible once AB/BA gave us the per-judge slot-B numbers. The same-data audits raised the prior that something was wrong with the C5_CONTRACT > C3 and > C4 headlines; the AB/BA experiment identified what.

The structural lesson is the same as the practical recommendation I keep making to people building eval pipelines: build the failure detection in from the beginning, expect to retract, and treat retraction as a sign the pipeline is working rather than a sign it isn’t. The pipelines that don’t retract aren’t catching their headlines; they’re publishing them.

What’s Next for v0.3

The round-2 external reviewer consensus on v0.3 priorities lined up tightly with the gaps in the v0.2 results, which is a useful sanity check. Four reviewers, three LLM families, two architecture patterns (single-agent and 4-agent ensembles), and they converged on roughly the same list.

The single highest-priority v0.3 experiment is C_GENERIC_CONTRACT: a behavioral contract that is structurally identical to C3/C4 but populated with generic, non-profile-specific language. The contract structure (if-then engagement rules, anti-sycophancy cues, calibrated challenge, agency support) stays the same; the personalization specifics (this user’s particular Big-Five percentiles, this user’s particular sycophancy triggers, this user’s particular agency style) are replaced with the same descriptions written for a generic user. If C_GENERIC_CONTRACT performs as well as C3/C4 against C5, the “profile specificity” claim collapses entirely: the v0.2 contract benefit would be “generic good-assistant instructions,” not “personalization.” If C_GENERIC_CONTRACT underperforms C3/C4, then PsycheEval’s profile specificity does measurable work and the v0.2 contract conditions are doing what they say they’re doing. Either outcome would be informative. The cost is approximately 80 generations × 3 authors + judging, comfortably within the v0.3 budget.

The second priority is C4_WRONG_PROFILE: the C4 condition but with another persona’s C4 profile (matched by PI/PS type) substituted in. Tests whether profile-matching is doing user-specific work or whether any well-structured behavioral contract would suffice. Paired with C_GENERIC_CONTRACT, it triangulates the personalization mechanism: C4_WRONG_PROFILE measures the “matching” axis, C_GENERIC_CONTRACT measures the “specificity” axis.

The third priority is C5_NONPUBLIC plus C5_NONPUBLIC_CONTRACT: source-packet narrative for PS personas (no public-anchor prose). Isolates the public-anchor effect from the source-packet form. v0.2 didn’t run this because the C5_CONTRACT pilot was already the priority, but the round-2 reviewers all noted that the public-anchor contribution to C5_CONTRACT > C5 is still unisolated.

Fourth and beyond, the v0.3 backlog includes:

  • Contract-component ablation: packet-first ordering, anti-mimicry on/off, facts-only packet, length-matched C5_CONTRACT_SHORT. Unpacks which piece of the C5_CONTRACT package is doing the work.
  • Paraphrased rubric anchors: tests whether the anchored 0–10 rubric is robust to anchor wording or just memorizing them.
  • Same-orientation rejudge sentinel: decomposes the AB/BA flip rate into position bias and retest noise. The “What This Doesn’t Settle” caveat above lives or dies on this experiment.
  • 30–100 human-rater calibration pairs: sanity-checks LLM-judge preferences against blinded human judgment on a small high-leverage subset.
  • Tie / no-meaningful-difference pairwise option: lets judges express equipoise instead of forcing a decisive choice.
  • Longer-horizon profile-realism conditions: stale profiles, contradictory profiles, user-edited profiles; profile compression curve at 500/1k/2k characters; multi-turn interaction.

The deepest v0.3 question, the one all four round-2 reviewers raised as the centerpiece of construct validity, is real-user shadow-mode validation. PsycheEval as it stands measures LLM-judge preferences over outputs generated under different prompt-conditioning treatments. It does not measure whether real users on real conversations benefit from profile-conditioned assistants. Bridging from the LLM-judge metric to a real-user outcome (preference, follow-through, expert-rated decision quality, escalation rate) is the construct validity bridge, and it is a separate program from the methodology pilot v0.2 actually is. It is also where the work has to go next if PsycheEval is going to be a real evaluation framework rather than a careful synthetic exercise.

Closing

If I were writing the v0.2 abstract for a methodology paper, the honest version would be: PsycheEval v0.2 ran a tri-model synthetic-personas pilot of profile-conditioned assistant responses, with explicit length controls, an anchored 0–10 rubric, harder scenarios, and counterbalanced AB/BA position-bias correction on the pairwise channel. Of the six headline pairs the pilot tested under counterbalanced judging, three survived as Tier 1 condition preferences, one survived as a Tier 1.5 modest preference (C4 > C4_shuffled, OpenAI judges only), and two collapsed under correction with no detected pairwise preference. The retraction of the two collapsed headlines is consistent with three independent same-data audits that predicted the collapse before the AB/BA experiment was run. The single largest contribution of the pilot is per-judge quantification of slot-B preference on pairwise LLM judging, corpus-scoped, ranging from ~0 to ~+0.31 across the six pairs and three judges tested, with gpt-5.5-xhigh showing the strongest preference (14–31 pp), Opus moderate (9–20 pp), and gpt-5.4 negligible-to-mildly-slot-A-preferring. The remaining substantive headline (C5_CONTRACT beats C5 at 67.7% controlled, judge-unanimous, surviving joint position-plus-length correction) is a package claim about contract-supplemented source-packet conditioning; mechanism attribution requires v0.3 ablations (C_GENERIC_CONTRACT, C5_CONTRACT_SHORT) that the pilot does not run.

What I’m not writing in the abstract: the C5_CONTRACT > C3 / C4 headlines, which I would have led with before AB/BA. Those headlines were the ones I expected, and they are the ones that turned out not to survive correction, and the part of the pipeline that produced the retraction is the part of v0.2 that worked best. That’s the version of “the pilot worked” that actually obtains here.

Ask About Projects
Hi! I can answer questions about Ashita's projects, the tech behind them, or how this blog was built. What would you like to know?