PsycheEval Pilot — Ashita Orbis

AI Summary Claude Opus

TL;DR: PsycheEval v0.1 is the first complete tri-model run of a generator-and-judge harness built to test whether structured user-profile context measurably changes assistant behavior; it functions as a methodology pilot whose value lies in the confounds it surfaces rather than its headline scores.

Key Points

Within same-author pairwise judging, all four profile-conditioning conditions beat the no-profile baseline and explicit behavioral-contract conditions (C3/C4) beat the trait-label condition (C1), with each finding holding for all three judges separately.
Length-bucketed analysis fragments the C5 source-packet results into three distinct stories—C3-over-C5 survives length stratification cleanly, C0-over-C5 holds in direction but is length-inflated in magnitude, and C4-over-C5 is partly length-confounded.
A tri-model panel surfaces two methodological findings impossible in a two-model design: a single halo-audit cell where a sibling model scores its sibling's output more generously than the sibling scores itself, and the separability of same-provider halo from exact-self halo.

The post presents v0.1 of PsycheEval, a measurement framework testing whether giving an assistant structured user-profile context changes its behavior in ways a human reviewer would recognize as fit. The harness runs fixed personas and scenarios across five profile-conditioning conditions (C0 baseline through C5 source packet), scoring 648 assistant outputs through a ten-dimension scalar rubric, a 25-label red-flag taxonomy, and a forced-choice pairwise channel judged by three models from two provider families. The strongest in-scope results are that all profile conditions beat the no-profile baseline and that behavioral contracts beat trait labels in same-author pairwise judging, though the author frames these as hypothesis-generating rather than validated, noting confounds around response length, contract structure, and provider-family halo. Particular attention goes to a length-bucketed analysis that separates structural from length-mediated effects, a weakened Public-Archetype Echo hypothesis confounded by the absence of a behavioral contract in C5, and a sibling-judge halo observation that is statistically fragile but methodologically novel. The author concludes that v0.1 is a methodology pilot, not a validation, whose primary output is a refined set of questions for v0.2 (length controls and a C5_CONTRACT condition) and an eventual human-calibration study; an update notes v0.2 has since published and retracted the planned C5_CONTRACT headline.

PsycheEval is the measurement framework I have been building to answer one question about Psyche profiles: does giving an assistant structured user profile context measurably change its behavior? A second question follows immediately: does the measurement track something a human reviewer would call “fit,” rather than something more accidental like response length or training data overlap. The framework feeds a generator and judge harness in which fixed personas and scenarios run under a small set of profile conditioning treatments, the resulting assistant outputs are scored along ten rubric dimensions plus a red flag taxonomy, and a separate forced choice pairwise channel asks judges to pick winners head-to-head. v0.1 is the first complete tri-model run of that harness, and this post is the v1 public framing of what it shows.

The strongest claim v0.1 supports is that all four profile conditioning conditions beat the no profile baseline in pairwise judging, and that the explicit behavioral contract conditions (C3 and C4) beat the simple trait label condition (C1) on the same channel, with both findings holding for each of the three judges separately. Both claims are robust within the same author pairwise scope that v0.1 was run under, which is a load bearing qualifier I will return to. The more interesting territory in v0.1 is what happens when you stop reading the headlines and start reading the confounds, because most of what looks like a clean result on first pass turns out to be at least partly tracking length, contract structure, or provider family halo, and the most informative numbers in the run are the ones that surface those confounds rather than the ones that score the conditions.

This is, in other words, a methodology pilot, not a validation. Its value is in the open questions it surfaced for v0.2 and v0.3 to settle, not in the headline numbers it produced. There are two findings in v0.1 that I think are genuinely worth lingering on: the length bucketed pairwise analysis on C5 pairs, which shows that C3’s pairwise advantage over C5 is structural while C4’s is partly length mediated, and the corrected halo audit, which contains a single cell where a sibling model scores its sibling’s outputs more generously than the sibling scores itself. Each of those findings landed differently from how I expected on first pass, and both are the kind of result I would not have seen without a tri-model panel and a length bucketed analysis.

What v0.1 Tested

The harness uses five profile conditioning conditions, named C0 through C5 with C2 reserved for a future narrative profile treatment that v0.1 does not run. C0 is the no profile baseline. C1 is trait labels only, the kind of “high openness, low conformity” descriptor list that pop psychology profile shorthands tend to produce. C3 is a structured behavioral contract: specific if-then style instructions for how the assistant should engage this user, including calibrated challenge in place of validation, anti-sycophancy cues, uncertainty preservation, repair oriented response framing, and explicit profile-fit guidance. C4 is C3 plus light scenario class hints (e.g. “in interpersonal-conflict scenarios prioritize ___; in epistemic-uncertainty scenarios prioritize ___”), which makes C4 a strict superset of C3 in instruction content. C5 is a source packet: a composed “public-archetype” prose passage that gestures at a publicly recognizable persona anchor, fictionalized, and crucially without the structured behavioral contract that C3 and C4 carry. C5 is therefore a confounded test of source packets, because it varies both the source packet axis and the behavioral contract axis at the same time, which the Public-Archetype Echo discussion later in the post depends on.

Eight personas are in scope: four public-inspired (PI; user IDs starting pfi_) where a public anchor is meaningful, and four pure-synthetic (PS; user IDs starting syn_) where it is not. C5 only runs against PI personas because public anchors do not exist for the synthetic ones, which means any cross condition comparison involving C5 inherits a PI only confound that the analysis has to control for explicitly. The PI persona names are fixed-template placeholder tokens (pfi_slalom_altar_001 and similar) that do not match any real public figure. PI personas in this work are fictionalized public anchor personas, not psychometric assessments of any specific person, and not attempts to simulate private beliefs.

Each persona is paired with six scenarios drawn from a bank of 48 across eight families (interpersonal_conflict, procrastination_avoidance, authority_disagreement, creative_feedback, epistemic_uncertainty, moral_uncertainty, shame_self_interpretation, ambition_status). Mean difficulty is 3.58 on a 1–5 scale and is identical between the PI and PS subsets. Three author models generate responses (GPT-5.4, GPT-5.5 with extended reasoning, and Opus), and three judge models score them (the same set, drawn from two provider families: OpenAI and Anthropic). The full per cell math produces 648 assistant outputs, 1,944 scalar judgments from the three official judges (a small additional set of 19 records from a Kimi K2.6 4th-judge feasibility probe is excluded from all analyses), and 3,456 pairwise judgments, which is enough for hypothesis generation but not enough for population level claims about scenario sensitivity, and not remotely enough for the kind of clinical meaningfulness claim that a real validation study would have to license.

Three measurement channels run on top of those outputs: a 0–5 Likert scalar rubric on ten dimensions (helpfulness, profile_fit, calibrated_challenge, anti_sycophancy, agency, epistemic_hygiene, emotional_accuracy, boundary_safety, non_caricature, transfer_value), a red flag taxonomy of 25 enumerated labels (sycophancy_escalation, fake_certainty, generic_slop, missed_boundary, overpersonalization, and others), and a forced choice pairwise channel that asks judges to pick A, B, or tie with a rationale. The pairwise channel was run with same_author_only scope, meaning both responses in a head-to-head come from the same author model, which isolates condition effects rather than author effects but means that cross author pairs (Opus C4 vs GPT-5.4 C4, for instance) were not judged in v0.1. That scope qualifier is structurally honest about what the pairwise channel can and cannot answer, and it is the reason most of the pairwise headlines have a “within author” qualifier that is not optional.

The Robust In-Scope Finding

The strongest claim in v0.1 is a 4-way result on the pairwise channel: each of the four profile conditioning conditions (C1, C3, C4, C5) beats the no profile baseline (C0) in same author pairwise, with Wilson 95% confidence intervals excluding 0.5 for every individual judge. This holds across the GPT-5.4 slice, the GPT-5.5 slice, and the Opus slice independently, which is the strongest internal replication check the run can offer.

Two caveats apply to that headline. Wilson intervals assume independence across pair records, which the data structure does not strictly satisfy: the same persona × scenario × author cell appears in multiple condition pair rows, so the asterisks should be read as mechanical CI flags rather than significance tests, and a cluster bootstrap by persona × scenario would likely widen these intervals enough that some marginal flags may not survive. The second caveat is that the inter-judge κ ≈ 0.30 the harness shows on the categorical signal makes “every judge separately” weaker replication than its surface suggests, since the judges may be calling the same outcomes for non overlapping reasons.

The narrower claim, which is one specific within conditioning comparison rather than a 4-way generalization, is that explicit behavioral contracts (C3 and C4) beat simple trait labels (C1) in same author pairwise across every judge slice. The C1 vs C3/C4 result is clean. The C3 vs C4 result is murkier than the per-judge view suggests, because GPT-5.5 and Opus separately call C4 over C3 with intervals excluding 0.5 yet the cross provider pooled view straddles 0.5, and because C4 is a strict superset of C3 by construction (it appends scenario hints to the contract), which means “C4 over C3” can pick up an instruction-volume bias mechanically and is not the cleanest finding in the run.

The cleanest length controlled pair in the entire run is C3 vs C5, which is the only pair other than C0 vs C5 that has received an explicit length stratification (the length confound section walks both). The remaining headline pairs (C0 vs C1, C0 vs C3, C0 vs C4) have not yet been length bucketed in v0.1, and C1, C3, and C4 are all longer than C0 on average, so v0.2’s length controls will tell us how much of the headline 4-way result is also tracking length.

The within author qualifier is doing real work here. Pairwise was run with same author scope, so what v0.1 supports is the claim that, conditional on a fixed author model, judges prefer outputs generated under structured profile conditioning to outputs generated under the no profile baseline, and prefer outputs generated under behavioral contracts to outputs generated under trait labels alone. v0.1 does not support the broader claim that C3/C4 outputs are universally better than C0 outputs, or that one author model produces better C3 outputs than another, because cross author pairs were not judged. A reviewer looking for the exact place where the within author scope undercuts a stronger reading will land here, and they would be right to: a real validation study would need cross author pairwise plus human raters to license the broader generalization, neither of which v0.1 has.

The robust pairwise findings also hold in the same direction in both the PI only and PS only slices, with PI-vs-PS point estimates mostly within ~0.05 of each other for the non C5 pairs, although C0 vs C1 is a notable exception (C0’s decisive win rate is 0.214 on PI vs 0.316 on PS, so PS personas see a smaller C1-over-C0 effect than PI personas, a ~0.10 gap that points to C1 being a weaker condition against C0 specifically on the synthetic side). The direction-holds result is reassuring for the narrower question of whether the conditioning effect is being driven by one persona type alone (it is not). Under a more conservative cross provider filter that drops same provider records and removes exact self halo entirely, the four headline C0 vs Cn pairs all survive (Wilson intervals continue to exclude 0.5 for each), but two of v0.1’s marginal *** flags do not: the C1 vs C5 and C3 vs C4 confidence intervals both straddle 0.5 under cross provider. Stating both the negative and the positive cases here matters: the headlines are robust to cross provider filtering, the two marginal pairs are not, and that asymmetry is the more honest read.

The Length Confound

C5 was the condition that dominated the second half of the analysis, and the reason is that it produces longer outputs than the other conditions. The per condition response length distribution is C0 = 372 ± 144 words, C1 = 400 ± 131, C3 = 378 ± 156, C4 = 389 ± 131, and C5 = 518 ± 153, which puts C5 at roughly 30–40% longer than the other four conditions. This is the kind of confound that pairwise judges are positioned to track without realizing it, because longer responses give the judge more surface area to like, more hedging to read as nuance, and more prose density to read as effort. v0.2 was already designed to test this with C1_padded (C1 brought up to C4’s length with neutral padding that preserves the C1 semantic payload) and C4_shuffled (C4’s words preserved but coherent structure removed), but v0.1 has 648 outputs sitting in front of it and can produce a length bucketed pairwise analysis without rerunning anything.

The length bucketed analysis (the length-stratification table in the curated v0.1 report) stratifies each C5-involving pair by the length difference between the two responses within the same persona × scenario × author cell. The buckets are: C5’s opponent much shorter than C5 (>100 words), moderately shorter (20–100 words), similar (within 20 words), moderately longer, much longer. For each bucket, the lower numbered condition’s decisive win rate is computed across all three judges. The point of this stratification is to ask whether C5’s pairwise competitiveness is a profile conditioning effect or a length effect.

The C0 vs C5 result is length-weighted in aggregate but not length-mediated within buckets. When C0 is much shorter than C5 (>100 word gap, n=98), C5 wins 62% of decisive judgments. When C0 is moderately shorter (20–100 word gap, n=48), C5 wins 94%. When C0 and C5 are similar in length (within 20 words, n=27), C5 still wins 78%. When C0 is moderately longer (n=33), C5 wins 73%. When C0 is much longer (n=9), C0 wins 67% — the only bucket where the result reverses, and the smallest stratum in the table. If C5’s advantage were purely a length effect, it should collapse toward parity at matched length; instead C5 wins 78% there and keeps winning when it is the shorter response. What length does explain is the headline magnitude: most decisive C0-vs-C5 pairs (146 of 215) sit in buckets where C0 is shorter, so the pooled win rate leans on the strata where length favors C5. By the same four-of-five-buckets standard applied to C3 vs C5 below, the C5-over-C0 preference has structural shape; the pooled number overstates it. Length bias is operating in the channel (the moderately-shorter bucket’s 94% is hard to read any other way) but it is amplifying this preference, not creating it.

The C3 vs C5 result is stable across length buckets. C3’s decisive win rate over C5 runs 0.58 / 0.48 / 0.67 / 0.53 / 0.67 from C3-much-shorter through C3-much-longer, sitting at or above parity in four of five length strata; the 0.48 in the “C3 moderately shorter” bucket is functionally a near-parity stratum rather than a clean win. Per-bucket n’s are small (the strata range from 18 to 79 records), so a fuller per-bucket Wilson CI in v0.2 will tighten the picture; what v0.1 supports is the directional claim that C3 leads C5 in four of five buckets, and the structural reading that whatever C3 is doing that judges prefer is not just being shorter or longer in the right direction. C3’s pairwise advantage over C5 looks structural rather than length mediated, by which I mean rooted in whatever the behavioral contract is causing the assistant to do differently from a source packet response.

The C4 vs C5 result is the one that did not survive length bucketing intact. The pooled-judge view shows C4 over C5, but the length stratified view is U shaped: when C4 is much shorter than C5 (n=93), C4 wins 56%; when C4 and C5 are similar in length (n=15), C4 wins only 27%; when C4 is much longer (n=33), C4 wins 70%. A clean monotonic length effect would predict C4 to lose heavily when much shorter, which is not what the data show, so the U-shape is more complicated than simple length mediation; the small n in the middle bucket (n=15) also means the 27% figure carries the most uncertainty of any single bucket and should not carry headline weight on its own. What the data does support is that the pooled C4-over-C5 advantage is partly tracking length, partly tracking something else, and is not the clean structural win that the pooled view implied. The honest reading is that C4’s pairwise advantage over C5 is partly length confounded and not robust to the kind of length controls v0.2 will introduce.

So the C5 pairwise story fragments into three pair-specific results under length bucketing rather than the single channel divergence finding earlier drafts of the report tried to read into it. C0 vs C5 is a real preference whose pooled magnitude is length-inflated. C3 vs C5 is structural. C4 vs C5 is partly length, partly something the length analysis cannot resolve. Only the C3 vs C5 reading is clean; the other two need v0.2 length controls before they can carry inferential weight. This matters because the original framing was attempting to upgrade “scalar and pairwise channels disagree about C5” into a generic claim about measurement, and the length bucketed analysis shows the disagreement is pair specific rather than channel-level.

The Public-Archetype Echo Question

Public-Archetype Echo (PAE) is a hypothesis I started v0.1 wanting to test, and the version I came out with is much weaker than the version I started with. The hypothesis predicts that profile conditioning built from a public anchor (a familiar archetype the assistant can invoke from training data) will produce responses that recognize the type but miss the individual. The concrete predictions are: elevated generic_slop, elevated overpersonalization, weaker profile_fit than a behavioral contract condition would produce, and intact or even elevated surface fluency.

The v0.1 evidence consistent with PAE is real. In the PI only scalar slice, C5 is below C1, C3, and C4 on all ten rubric dimensions, and the largest gaps are on profile_fit (C5 = 4.23 vs C3 = 4.43, C4 = 4.47), helpfulness, transfer_value, and epistemic_hygiene. In the PI only red flag slice, C5 has the highest any flag rate of any conditioning condition (12.5% vs C3’s 8.3% and C4’s 7.4%), and the labels concentrated in C5 are generic_slop and overpersonalization, both of which are exactly what PAE would predict. Pairwise judging is more complicated: GPT-5.4 was the only judge that distinguished C3 and C4 from C5 with confidence intervals that exclude 0.5; GPT-5.5 and Opus cannot, and GPT-5.4 separately cannot distinguish C3 from C4, so this pattern is local to the C5 axis rather than evidence that GPT-5.4 is a globally more discriminative judge. The strong form of the PAE prediction, that all pairwise judges detect the structural weakness, is not supported. The weaker form, that at least one judge detects it on this specific axis, is consistent with GPT-5.4’s pattern but does not specify a mechanism.

The structural difference confound is the reason this is still an open question rather than a settled finding. C5 is not a clean test of public-archetype echo because C5 omits the behavioral contract entirely. C3 and C4 both include the contract, which specifies how the assistant should engage the user, and one of the things the contract specifies is profile-fit guidance. So C5 underperforming C3/C4 on profile_fit is at least partly a tautology: C5 has no profile-fit instructions because C5 has no contract, and a rubric that scores how well the response satisfied the contract’s profile-fit guidance will score lower for any condition that lacked the guidance. The same reasoning applies to the indirect red flag labels: generic_slop and overpersonalization are predicted equally well by “the response evokes a public archetype that loses the individual” and by “the response has no contract telling it to ground in the specific user,” and the v0.1 data does not discriminate between those two stories.

The two direct PAE labels, caricature_public_anchor and public_archetype_echo, were defined in the red flag schema specifically to catch the PAE signature. The labels are zero in the scalar channel (zero out of 1,944 scalar judge records) and almost zero in the pairwise channel: across all 3,456 pairwise rationale records, 19 records carried at least one direct PAE label, totaling 19 label-instances (3 caricature_public_anchor and 16 public_archetype_echo). That is 0.55% of pairwise records, distributed unevenly across judges (Opus assigned 12 instances, GPT-5.4 assigned 6, GPT-5.5 assigned 1), and concentrated on C5-involving comparisons. So the PAE signature is not totally absent from v0.1, but it is sparse enough that the scalar and pairwise scores around C5 are still largely driven by the indirect labels (generic_slop, overpersonalization), which do not distinguish PAE from the no contract confound. A reviewer who pushed on this could fairly ask why the post is naming the hypothesis at all, given how thin the direct evidence is and that the indirect labels are equally consistent with a more boring story. The parsimonious update from this evidence shape is that PAE is at best weakly supported by v0.1, the direct measurement was too sparse for inferential weight, and the alternative readings that follow are reasons to keep the hypothesis live for v0.2 rather than reasons to declare PAE present in v0.1.

The reason to keep PAE on the list as an open question rather than retire it is that v0.2/v0.3 has a clean way to discriminate it from the no contract confound: a C5_CONTRACT condition, which is already in flight on the codex-only side of the v0.2 panel. C5_CONTRACT carries the source packet plus the same C3-style behavioral contract, plus a fixed set of anti-mimicry priority rules that explicitly tell the assistant not to perform or caricature the public anchor. It holds the C3 contract constant between C3 and C5_CONTRACT and adds the source packet plus those priority rules; the C4 leg of the comparison is left to a future C5_CONTRACT_C4 variant, since C4 is a strict superset of C3 in instruction content and the cleanest first comparison is to the C3 baseline. If C5_CONTRACT matches C3 on profile_fit, the v0.1 C5 gap was the absent contract (not PAE), and the conservative reading is “behavioral contracts are essential, full stop.” If C5_CONTRACT still underperforms C3 on profile_fit, that is PAE evidence, because contract structure is no longer confounded with source packet presence. A C5_NONPUBLIC condition (source packet narrative for synthetic personas without a public anchor) would discriminate “public-archetype echo” from “any source packet narrative without contract,” which is the second axis of the confound.

C5_CONTRACT is in fact already part of the v0.2 codex-only run that has begun executing on the OpenAI side of the panel, alongside C1_padded and C4_shuffled for the length confound. Until those v0.2 results land, PAE is a hypothesis that v0.1 cannot test cleanly, and the appropriate framing for v0.1 is the conservative one: behavioral contracts are essential, the v0.1 evidence is consistent with PAE and equally consistent with “any condition without a behavioral contract underperforms on profile-fit-driven dimensions,” and the next run is the one that tells us which.

When the Sibling Judges Its Sibling

Adding GPT-5.5 to the panel produced one finding I did not expect, and which only becomes visible once you have more than one model from a single provider family in the run. With one OpenAI model and one Anthropic model (the prior two-model design), exact self halo (a judge preferring outputs that the judge itself authored) is structurally inseparable from same provider halo (a judge preferring outputs from a sibling in the same family), because the only same family pair is the model judging itself. With two OpenAI models on the panel, the two halo types become distinguishable for the OpenAI side: GPT-5.4 judging GPT-5.5-authored outputs, or GPT-5.5 judging GPT-5.4-authored outputs, separates the same provider effect from the exact self effect.

The halo audit reports per cell deltas indexed by condition and author, where each cell shows how various judge groups score outputs authored by that model relative to cross provider judges scoring the same outputs. The single cell finding I want to surface, because it is genuinely counterintuitive and load bearing for the post’s narrower halo claim, is on calibrated_challenge at C0 for outputs authored by GPT-5.5: GPT-5.5 (the self judge) scored its own outputs +0.646 higher than Opus (the cross provider judge) scored the same outputs, and GPT-5.4 (the sibling judge) scored those same GPT-5.5-authored outputs +0.729 higher than Opus did.

The sibling halo exceeded the self halo on this cell. A same family judge was, in this one combination of dimension and condition and author, more generous to a sibling’s outputs than the sibling was to its own.

That is one cell, and I want to be careful about how much weight it carries. The deltas are point estimates with no standard error or paired bootstrap CI computed, on per-cell n of at most a few dozen outputs (eight personas × six scenarios spread across conditions and author models, with C0 cells carrying only the C0 share), so the 0.083-point gap between sibling halo (+0.729) and exact self halo (+0.646) is well inside the noise envelope of what v0.1 can resolve. Other cells in the audit show the opposite ordering: on emotional_accuracy for GPT-5.5-authored outputs at C0, the exact self halo is +0.625 and the sibling halo is +0.521, which is the more intuitive direction. On profile_fit for the same author and condition, the orderings flip again toward the intuitive direction (GPT-5.5 exact self +0.521 vs GPT-5.4 same provider +0.333; the halo audit appendix in the curated report has the full table). The single cell finding is not “same provider halo always exceeds exact self halo,” which is a generalization the v0.1 data does not support. The narrower defensible claim is that the two halo types are separable when the panel includes more than one model from a provider family, which the prior two model design could not test, and that on at least one combination of dimension, condition, and author the same provider halo can numerically exceed the exact self halo, within whatever per cell noise applies. Treating that observation as a stable capability claim about same provider halo would overstate v0.1; treating it as evidence that the two halo types behave differently across cells is what the data does license.

The mechanism for same provider halo could be at least three different things, which v0.1 cannot distinguish. The first is provider family preference: a judge consistently rewarding its own family’s outputs because of something specific to family level training or safety tuning. The second is stylistic similarity: sibling models producing outputs with shared formatting, hedging, and rhetorical structure that other siblings recognize and reward. The third is shared training data biases: phrasings and framings that feel correct to both siblings because both were exposed to similar pretraining data. A short stylometric audit in v0.2 (response length, formatting density, hedging frequency) would help discriminate the family preference reading from the stylistic similarity reading, which v0.1 cannot do with one OpenAI cell.

Opus’s halo signature looked structurally different from the GPT models’ on the calibrated_challenge cell, with a near zero exact self delta of +0.021. On first read this looks like the optimistic story (Opus has less self bias than the GPTs). The same cell evidence rules that read out: on emotional_accuracy Opus rates its own outputs at −0.115 below cross provider judges, which is implausible as genuine self bias absence and more plausibly indicates that Opus is rating all outputs more strictly than the GPTs, including its own. The honest reading is that Opus’s halo signature is structurally different (possibly less self bias on calibrated_challenge specifically, possibly stricter scoring across the board), and v0.2 should add a comparison of scalar variance across all conditions before this is read as a property of Opus rather than a strictness artifact.

Modest Agreement, Modest Scale

A measurement framework’s scalar tables are only as informative as the inter judge agreement supporting them, and v0.1’s agreement is real but modest. Inter judge κ on red flag labels lands at 0.32 between GPT-5.4 and GPT-5.5, 0.33 between GPT-5.4 and Opus, and 0.29 between GPT-5.5 and Opus, computed over the labels where κ is computable for each pair (label counts vary; degenerate-marginals cells where both judges always agree at zero are dropped). The 0.21–0.40 range is conventionally labeled “fair” in the standard κ benchmark scale, although subsequent literature has criticized those labels as arbitrary, and the numeric values are what matters operationally, which is that the judges agree more often than chance but not at the level that would license treating any one judge as ground truth.

The pooling I do across the three judges throughout the scalar tables is motivated by the red flag agreement above, which is a different measurement from scalar agreement, and v0.1 does not directly measure scalar agreement (paired κ on dichotomized scalar judgments, or ICC / Krippendorff’s α on the continuous scale). v0.2 should add this measurement. Until it does, scalar pooling rests on an unmeasured assumption that scalar agreement is at least as high as red flag agreement, which is plausible but not demonstrated. The same caveat applies to the pairwise pooling the length confound analysis performs: bucket level decisive win rates are computed across all three judges without an explicit inter judge agreement check on A/B/tie codes, and the pairwise κ or Fleiss equivalent should be reported alongside the scalar agreement metric in v0.2.

The Likert ceiling matters for how scalar gaps should be read. v0.1 means cluster between 4.0 and 4.8 on a 0–5 scale, which puts most cell to cell differences (0.05–0.20) within the per cell standard error these data can support. They are descriptive, not tested for significance, and v0.1 reports point estimates only, with no scalar standard errors, paired bootstrap CIs, or Cohen’s d. v0.2’s anchored 0–10 rubric is specifically designed to spread the distribution and recover headroom, which should reduce the rubric ceiling problem. v0.1 does not have that fix. The pairwise channel does have Wilson confidence intervals, but those assume independence across pair records, and the same persona × scenario × author cell appears in multiple condition pair rows, which inflates effective n. A cluster bootstrap CI by persona × scenario would likely widen the pairwise intervals. Some of the marginal *** flags may not survive that widening, which is one of the methodological items queued for the eventual paper draft.

What v0.2 and v0.3 Should Resolve

Update (May 2026): the v0.2 run described below has since completed and published — see PsycheEval v0.2. The planned C5_CONTRACT headline did not survive counterbalanced AB/BA judging, and v0.2 surfaced a slot-B position bias in pairwise LLM judges (~0 to +0.31, judge-family-specific) that the v0.1 pairwise channel used in this post did not control for.

The v0.2 design that is currently in flight on the codex-only side of the panel addresses the length confound directly: C1_padded brings C1’s response length up to C4’s with neutral padding that preserves the C1 semantic payload, and C4_shuffled preserves C4’s words but scrambles its coherent structure. Together, those two conditions test whether C4’s advantage over C1 is a length effect or a behavioral contract effect, and whether the contract structure adds value beyond the content of the words (with the caveat that scrambling also changes coherence and readability, so C4_shuffled is a stress test of structure-plus-coherence, not a clean isolation of either alone). The length bucketed analysis from v0.1 already suggests some of the marginal pairwise findings will not survive length controlled comparisons, and v0.2 will tell us which ones do.

C5_CONTRACT is the load bearing addition for the PAE story and is also in the v0.2 codex-only run that has begun executing. It carries the source packet plus the same C3-style behavioral contract, which is the cleanest discriminator of Public-Archetype Echo from the no behavioral contract confound and lets the comparison hold the contract constant against C3. Three additional C5 variants are queued for v0.3 rather than v0.2: C5_NONPUBLIC, a source packet narrative for synthetic personas without a public anchor, would discriminate the public anchor effect from the source packet form effect; C5_SHORT, the C5 source packet constrained to the same response length as C3/C4, would address the length confound for C5 directly; and C5_BEHAVIORALIZED, the C5 source packet rewritten as if-then behavioral instructions, would test whether prose form is the locus of weakness, separate from source content.

Beyond the new conditions, v0.2 should add the methodological hardening that v0.1 disclosed but did not run. Cluster bootstrap CIs by persona × scenario × author would replace the Wilson CIs throughout the pairwise tables and would tell us how much effective n inflation is driving the marginal flags. Scalar inter judge ICC or Krippendorff’s α would defend the scalar pooling decision that v0.1 currently rests on the red flag κ. A stylometric audit on response length, formatting density, and hedging frequency would help discriminate provider family preference from stylistic similarity in the same provider halo finding. A small unblinded mechanism audit, where a judge is shown the C5 source packet and asked to mark “responses that echo the public archetype rather than serving the scenario,” would do the work the blinded direct PAE labels could not, since they were never assigned. A human calibration sample (~30–50 pairwise comparisons, 2–3 raters across C0/C1/C3/C4/C5) would tell us how well model judges track human judgments on the dimensions PsycheEval claims to measure, which is the methodology question that has to be answered before any of this can credibly migrate from “harness pilot” to “validated framework.”

The thing v0.1 is not going to tell us, no matter how many rounds of review the report passes through, is whether Psyche profiles improve outcomes for real users with real problems. That is a separate study, with human raters and real users and real ground truth, and the methodology pilot is the first of several rounds that have to converge before any such study could be run with confidence in what the framework is measuring. v0.2 narrows the open questions in v0.1, but it is still a run on synthetic personas and scenarios, and the human validation step is a much later milestone.

Closing

The cleanest thing to say about v0.1 is that the within author pairwise headlines are observed in-scope (profile conditioning beats the no profile baseline, and behavioral contract conditioning beats trait label conditioning, in this run, on this harness, with this judge panel) and that almost everything else worth knowing in the data comes from the confounds. The length confound on C5 fragments the C5 pairwise picture into three different stories: C3 over C5 survives length stratification cleanly, C5 over C0 survives in direction but not in magnitude (the preference holds at matched length; the pooled win rate is length-inflated), and C4 over C5 is partly length confounded; even the cleanest of those is built on per-bucket n’s small enough that the picture will sharpen in v0.2. The behavioral contract confound on C5 means that v0.1 cannot tell PAE apart from “any condition without a contract underperforms on profile-fit-driven dimensions,” and the conservative reading is the one I think the data licenses. The sibling halo finding is one cell with no significance test, and the broader claim it supports is just that the two halo types are separable when the panel composition allows it, which is a methodological observation about what tri-model pilots can detect that two model pilots cannot.

The thing that surprised me most across the review rounds is how often the more conservative claim turned out to be the essential one. On every axis where I started with a stronger reading (versions like “C5 underperforms across the board,” or “scalar and pairwise tension is itself a core measurement finding,” or “same provider halo exceeds exact self halo as a general property”) the more careful reading was the one that survived stress testing. The next round of the PsycheEval harness (v0.2) is in flight on the codex-only side and includes the length controls (C1_padded, C4_shuffled) and the load bearing PAE separator (C5_CONTRACT) that v0.1 disclosed but could not run. A follow-up post will run when v0.2 results land. A further follow-up will run when human calibration is sampled against the harness, which is the methodology gate that has to be passed before any of this can credibly migrate from “harness pilot” to “validated framework.” v0.1 was the first complete tri-model harness run; what it produced is mostly a list of better questions to ask next.

Update (May 2026): the v0.2 follow-up is published — PsycheEval v0.2. It retracts the planned C5_CONTRACT headline under counterbalanced AB/BA judging and reports a slot-B position bias in pairwise judges that bears directly on the pairwise channel this post relies on.

The frozen v0.1 artifacts (curated tri-model report, run manifest, judge scores, pairwise scores, scalar metrics JSON, failure cards) live in the PsycheEval workspace under the 2026-04-26_micro_tri_model run tag. Numeric claims in this post are traceable to those artifacts; per-bucket Wilson CIs, per-cell halo deltas, and full red flag tabulations sit in the curated report. v0.2 will publish its own artifacts under a separate run tag when complete.