What AI Learns About You When You’re Not Looking
A formal research paper presenting the methodology and ten analytical contributions from this work is available: Convergent AI-Mediated Personality Assessment.
Two posts in this series described two approaches to the same problem: understanding a person’s psychology from their digital trace.
Building Your Own Personality Profile with AI described Psyche: seventeen validated instruments (later expanded to 39 in the full battery), three distinct inference methods, ~786 questionnaire items, a semi-structured interview, and analysis of a 1.47-million-word writing corpus. Explicit measurement. The output is a structured profile with dimensional scores, confidence intervals, and behavioral specifications.
From Text Messages to Literary Memoir described the narrative pipeline: 267MB of text messages fed through Claude Opus as a literary engine, producing approximately 189,000 words of first-person memoir across 128 chapters grounded in hash-based source citations. Implicit inference. The output is a literary construction that infers the interior life from the evidence of what was said, to whom, and when.
Both aimed at the same target (one person’s psychology) through completely different methods. Neither was designed to validate the other. Psyche was built as an open-source personality profiling framework. The narrative pipeline was built to see what literary generation could extract from messaging data. The comparison happened after the fact, when the outputs existed side by side and the obvious question became unavoidable: do they see the same person?
The Experimental Setup
Psyche’s approach: administer validated instruments (Big Five at 30-facet granularity via IPIP-NEO-300, HEXACO-60, attachment via ECR-R, emotion regulation via ERQ-10, empathy via IRI-28, self-monitoring, locus of control, grit, vocational interests, basic needs), combine with text inference by an LLM from a diverse writing corpus, conduct a semi-structured interview. Triangulate across three methods. (Empath lexical analysis was originally a fourth method but was removed from synthesis after analysis showed it measures language register rather than personality; it remains as a corpus characterization tool.) Produce a ten dimension persona model mapping scores to behavioral predictions.
The narrative pipeline’s approach: extract a quantitative style profile and qualitative personality reference from messaging data (totaling about 2.3KB of behavioral notes), feed Opus the message archives and these references, let it generate first-person literary memoir across 128 chapters spanning two arcs (72 in the first, 56 in the second) (the first covering several years, the second roughly eight months). The personality reference describes texting behavior: emoji rates, filler words, hedging patterns. Any psychological depth beyond these surface metrics was inferred by Opus from the messaging data itself.
The key asymmetry: Psyche measures with instruments designed for the purpose. The narrative pipeline never received a personality profile. Every psychological mechanism it identifies (the attachment architecture, the emotion regulation patterns, the analysis-as-action-substitute) was derived by Opus from how a person texts. If these implicit inferences align with Psyche’s explicit measurements, it means Opus successfully extracted personality from conversational data. If they diverge, either Opus got something wrong, Psyche missed something, or the person changed between the period being narrated and the measurement point.
The Entanglement Problem
Before reporting results, the structural confound needs to be stated plainly. These are not independent assessments.
Shared model. Claude Opus generated both the Psyche synthesis report and the voice clone narratives. Same model means shared rendering tendencies. If Opus has a bias toward certain personality constructions (if it gravitates toward particular psychological frameworks regardless of input) convergence is partly artifactual.
Overlapping corpus. Psyche’s LLM analysis drew on writing samples that include content adjacent to SMS. The degree of overlap with the narrative pipeline’s source data is uncertain but nonzero.
These entanglements mean convergence is partly expected. Divergence is more informative. Where the two outputs disagree despite sharing a model and overlapping data, the disagreement is more likely to reflect genuine signal. The evaluation interprets results with this asymmetry: convergence is noted but weighted cautiously; divergence is taken seriously.
Three Tiers of Evidence
The comparison operates across three evidential tiers:
Tier 1 (Real): Passages in the narrative with {FN:uid} citation markers linking to actual messages in the archive. Also: Psyche interview quotes, SMS corpus statistics. This is ground truth: things that were actually said or measured.
Tier 2 (Inferred): The unsourced 88-92% of the narrative, which is Opus’s literary construction of inner life, emotional states, psychological mechanisms. This is what we’re evaluating.
Tier 3 (Measured): Psyche’s instrument scores with confidence intervals and behavioral specifications. This is the benchmark.
The evaluation asks: does Tier 2 align with Tier 3? Where Opus’s unsourced literary construction matches Psyche’s explicit psychometric measurement, Opus successfully inferred personality from texting data. Where they diverge, we have something more interesting than agreement.
The Temporal Dimension
The narrative spans roughly 2017 to 2026. Psyche was administered in February 2026. The person changed across this period.
The personality reference documents specific evolution between eras: emoji usage increased from 9.1% to 61.5%, question rate dropped from 18.3% to 10.9%. These metrics illustrate how the pipeline quantifies change over time. The earlier arc shows indirect communication patterns: emotions processed through proxies, vulnerable statements routed through hedges. The later arc shows direct engagement. The person who took Psyche’s instruments is the evolved version, someone who has already undergone the changes the narratives dramatize.
The methodological question this raises is snapshot versus trajectory. A standard comparison across methods would use a binary consistent/divergent scale. The temporal gap between the narrative period (2017-2026) and the measurement point (February 2026) requires two additional categories to capture change and partial overlap:
- CONSISTENT: Opus’s construction aligns with Psyche’s measurement
- EVOLVED: The earlier and later arcs show real change; Psyche captures one end of the trajectory
- PARTIALLY CONSISTENT: Alignment on some aspects, gaps on others
- DIVERGENT: Sources describe contradictory traits
“Evolved” is not a euphemism for “inconsistent.” It means the narrative documents a change that Psyche’s snapshot captures the endpoint of. The person who routed every vulnerable statement through a proxy and the person who engages directly are the same person at different points in a developmental arc. The evolution is the finding.
Results by Dimension
The evaluation compared Opus’s implicit model against Psyche’s explicit measurement across ten dimensions. The summary:
| Dimension | Rating | Methodology Note |
|---|---|---|
| Epistemic Style | CONSISTENT | Strongest convergence. Instruments and narrative independently agree on analytical processing, including its specific failure mode under irreducible experience. |
| Conflict Response | CONSISTENT | Narrative reconstructed the exact regulation sequence (reappraisal first, suppression when that fails) that instrument scores predict, playing out across entire arcs rather than single scenes. |
| Interpersonal Patterns | CONSISTENT | Same metaphor (glass pane separation) appeared in three independent evidence tiers. Proxy architecture visible across both arcs. |
| Attachment | CONSISTENT | Anxious-preoccupied pattern (elevated anxiety, near-floor avoidance) appears identically across both arcs with different people eight years apart. Same behavioral architecture, different partner, different era. Predictive of behavioral outcomes across the narrative. |
| Stress Response | CONSISTENT | Dual-deployment regulation (reappraisal then suppression) and momentum-based recovery inferred from messaging patterns match ERQ scores at near-ceiling (reappraisal 94th percentile, suppression 83rd percentile) and interview evidence of crisis behavior. |
| Communication | EVOLVED | Pipeline captured trajectory from indirect (proxied, hedged) to direct engagement. Psyche’s low self-monitoring measurement captures the resolved endpoint, not the origin. |
| Decision-Making | EVOLVED | Same deliberation architecture produces different conversion rates across eras: near-zero action in the earlier arc, rapid conversion in the later. Psyche measures the mature form. |
| Self-Concept | EVOLVED | Gap between ceiling analytical capacity and low self-esteem persists across eras, but the relationship to the gap changes. Psyche caught a person mid-trajectory. |
| Motivation | PARTIAL | SMS data captures relational motivation intensity but misses vocational and investigative dimensions Psyche measured at ceiling. Data source limitation, not method failure. |
| Flow States | PARTIAL | Specific phenomenology of intellectual flow has no narrative counterpart. Some dimensions are structurally unreachable from messaging data. |
Five dimensions showed straightforward convergence. Three showed real change across the narrative’s temporal span, with Psyche capturing one end of the trajectory. Two were limited by the data source rather than the method. The most informative results are not in the convergence (which is partly expected given the entanglements) but in what each approach captured that the other couldn’t.
What Opus Got Right
The most interesting results are where Opus’s unsourced narrative construction (the 90% that is literary inference, not cited evidence) aligns with Psyche’s measurement. These are ordered by evidential strength.
The attachment architecture. Opus constructs textbook anxious-preoccupied across both arcs: hypervigilance, monitoring of evidence, rapid investment, movement toward rather than away from. The same behavioral signature appears with two different people eight years apart. This cross-arc consistency is the strongest evidence in the evaluation because it is harder to explain solely by shared-model entanglement: the model constructed both arcs independently from different message archives, and the specific behavioral signature (hypervigilance, evidence-monitoring, rapid investment) is not a generic model default. The reassurance protocol built on evidence is Opus’s most precise psychological inference, derived from texting patterns alone, matching dimensional attachment scores.
Dual deployment of regulation. Opus constructs the exact reappraisal-then-suppression sequence that emotion regulation scores predict, across entire arcs. A confession of love gets recategorized as an imposition. When that fails to contain the emotion, the narrator retreats to curating what they show. The system breaks, reconstitutes, and breaks again at timescales of years. No instrument was consulted. The regulation architecture was inferred from how a person texts.
The competence gap. Opus consistently constructs a narrator who can analyze with extraordinary precision but cannot evaluate their own worth. This is the gap between ceiling analytical reasoning and self-esteem in the lowest quintile rendered as narrative architecture. Instruments measure the two endpoints separately. Opus synthesized them into a lived experience of the gap: someone who can structurally analyze a texting grammar but can’t evaluate whether their own feelings are welcome.
Analysis-as-action-substitute. The most integrative construction. No single instrument operationalizes this specific behavioral prediction. Opus synthesized it from messaging patterns showing compulsive deliberation preceding every significant action: the deliberation architecture so thorough it replaces the need to decide. Psyche’s instruments measure the components in pieces (analytical reasoning, need for cognition, balanced locus of control). Opus synthesized them into a mechanism the psychometric toolkit doesn’t have a name for.
Temporal dynamics. The pipeline captured something instruments fundamentally can’t: how the same trait architecture produces different behavioral outputs across years. The hedging-to-directness trajectory in communication. The deliberation-to-action compression in decision-making. The deficit-to-integration sequence in self-concept. Psyche provides a snapshot. The narrative shows the motion. This is the methodology’s most distinctive contribution: not replicating what instruments measure, but capturing what they structurally cannot.
What Opus Got Wrong
The literary voice is too articulate. The sourced quotes are hedged, fragmentary, grammatically collapsed under emotional pressure. Opus’s constructed passages are polished, analytical, architecturally precise. The gap between the sourced and constructed registers is itself a finding: the pipeline produces a more psychologically articulate version of the subject than the subject is when texting in real time. This is by design (the personality reference frames the output as literary memoir, not a simulation of texting patterns). But it means the narrative systematically overstates the subject’s self-awareness in the moment.
Motivational complexity is flattened. Text messages are a relational medium. The pipeline saw only the relational slice and constructed motivation as relational singularity. The real person maintained investigative engagements through the darkest periods: systematic philosophical discussions, counterfactual analysis of decision points. Psyche’s access to academic writing and interview data captured the full motivational landscape. Within the relational slice, the narrative was accurate. But it missed what was equally defining. The data source limits what the method can capture.
Some inferences may be projection. The narrative contains striking psychological metaphors (fear of starvation applied to intimacy, a competent machine running without oil applied to self-concept) that the subject never articulated. They’re psychologically plausible and evidentially ungrounded. The risk is that Opus is projecting literary conceits onto a personality rather than inferring from evidence. The citation system makes this visible (no {FN:uid} = unsourced), but visibility doesn’t equal accuracy.
Structurally unreachable dimensions. Some of what Psyche measured has no possible narrative counterpart from SMS data. Cognitive phenomenology (how thought is experienced), dissociative flow states, empathy decomposition: these constructs don’t surface in text messages. The gap is not a method failure but a data source boundary. Wherever Psyche had access to data outside the relational domain, it captures dimensions the pipeline can’t reach.
What Each Method Captures That the Other Can’t
The comparison reveals a clean division of labor.
The narrative pipeline captures what instruments can’t:
Temporal dynamics. Psyche provides a snapshot; the narratives show the same traits evolving across years and arcs. The same hedging architecture producing indirect confession in one context and direct declaration in another. No instrument captures trajectory.
Failure modes. When the analytical framework collapses, when suppression fails, when the dual track system short circuits: these are personality under maximum stress. Instruments that rely on self report measure typical behavior. Narratives show what happens when typical behavior is no longer available.
Phenomenological texture. What low self-esteem feels like from inside. What the glass pane looks like. What anxious attachment sounds like at 5 AM. Psyche’s scores are precise. The narratives are vivid.
Psyche captures what narratives can’t:
Quantified dimensionality. Moderate attachment anxiety is not the same as extreme attachment anxiety. Psyche can distinguish them. The narratives show “anxious” without specifying degree.
Dimensions outside relationships. Interest cycling, vocational orientation, cognitive phenomenology, empathy decomposition: constructs that don’t surface in text message data but are captured by instruments designed for the purpose.
The things people won’t say. Instruments relying on self report ask directly about tendencies the subject might never articulate in conversation. The narratives can only work with what was said to someone. The instruments work with what was said to the instrument.
The Digital Exhaust Thesis
If Opus can infer personality from messaging data that converges with psychometric testing (and the evidence suggests it can, at least partially, for dimensions that messaging data captures) then the digital exhaust people generate incidentally contains more psychological signal than we assumed.
Post 025 showed that statistical pattern matching on this data extracts logistics, not personality. The fine-tuned model learned “ok sounds good.” The literary inference approach extracts psychology (attachment architecture, emotion regulation patterns, decision-making mechanisms) from the same data.
The difference is method, not data. The 267MB archive that the narrative pipeline drew from provided enough signal for literary inference to align with psychometric measurement on 5 of 10 dimensions, with 3 additional dimensions showing developmental change that Psyche captures only at endpoint (consistent at different timepoints rather than independently corroborated), even though fine-tuning on a two-person subset of that data extracted only logistics. The data was always there. The extraction method determined what came out.
This has implications beyond the specific project, though they should be held loosely given N=1 and the entanglement confounds acknowledged earlier. If messaging data contains personality signal accessible through literary inference, then messaging platforms may hold latent personality profiles of their users. The data people generate by coordinating dinner plans and saying “ok sounds good” (the logistics that Post 025 dismissed as psychologically shallow) encodes attachment patterns, regulation strategies, and interpersonal architecture that a sufficiently capable model may be able to decode.
The Kumar and Epley (2021) finding (that voice-based communication creates stronger felt connection than text) likely holds at the level of individual interactions. But at the level of patterns across thousands of messages, the aggregate signal is richer than any individual message suggests. The logistics gap exists message by message. It partially closes across the corpus.
Aim Higher Than Replication
The comparison between Psyche and the narrative pipeline suggests a direction for personality measurement that existing instruments can’t reach.
Current psychometric instruments measure static traits through self report. You answer ~786 items about your typical behavior. The instruments produce dimensional scores. The scores are reliable and valid within their measurement framework. But they are fundamentally limited to what the instruments ask about and what the subject is willing and able to report.
The narrative approach, applied more systematically with better evidential grounding and independent validation, could eventually capture constructs that existing instruments can’t measure:
Temporal dynamics. How traits evolve across arcs and life stages. Not “what is your attachment anxiety score” but “how has your attachment architecture changed across partners and decades.”
Failure modes under load. What happens to regulation strategies when they’re overwhelmed. Not “how do you typically manage stress” but “what does your regulatory architecture look like when it’s insufficient.”
Interaction signatures. How personality manifests differently across specific relational contexts: not averages across situations, but behavioral fingerprints specific to each context.
Sublinguistic markers. Emoji rates, ellipsis patterns, message length distributions, timing between messages. These are not personality traits. They are behavioral traces that encode personality in ways self-report instruments can’t access, because the subject isn’t aware they’re producing them.
The goal shouldn’t be to replicate existing instruments using natural language (to build a “Big Five from text messages” that converges with the NEO-300). The goal should be to measure what existing instruments can’t. The narrative pipeline’s most interesting inferences (the analysis-as-action-substitute, the architecture of processing through proxies, the trajectory from hedging to directness) are constructs that no standard instrument operationalizes. If they’re accurate, they represent measurement capacity that extends beyond the current psychometric toolkit.
Limitations and Honest Uncertainty
The evaluation produced 5 CONSISTENT, 3 EVOLVED, 2 PARTIALLY CONSISTENT, and 0 DIVERGENT ratings across 10 dimensions. The zero divergence result is potentially suspicious.
The structural explanation: the entanglement (shared model, overlapping corpus) makes true divergence unlikely. The same model gravitating toward coherent personality rendering suppresses contradictions.
The confirmation bias question: did the evaluation find convergence because convergence was expected? Mitigation: no ratings were preset. The results are heterogeneous (the “evolved” and “partially consistent” ratings document genuine complexity that an evaluation seeking confirmation would smooth over). But this is a mitigation, not an elimination.
The authorship question: when Opus constructs the subject’s inner life, whose psychology is it? Three possibilities. First, the subject’s actual psychology, successfully inferred from texting patterns. Second, Opus’s projection, imposing literary and psychological coherence on noisy behavioral data. Third, an entangled construction: partly the subject’s signal, partly Opus’s rendering tendencies, inseparable. The evidence favors the third option. The sourced quotes establish behavioral facts. The constructed passages synthesize them using frameworks that are at least partially Opus’s own. The personality is real. The articulation is Opus’s.
What a truly independent test would require: a profile generated by Method A from Corpus X, tested against observations from Method B on Corpus Y, where X and Y share no common data and Methods A and B share no common model. The closest achievable version: administer Psyche to someone, then have a different model (not Claude) generate behavioral predictions from a corpus Psyche never saw (work emails, not personal writing), and compare. Even then, the human subject is the common factor. True independence is methodologically impossible when studying a single person.
The 1M Context Experiment
Updated March 2026. The original post evaluated qualitative convergence between methods. This section reports a quantitative followup: how much does the narrative pipeline’s context window affect personality signal preservation?
The qualitative analysis above evaluated whether the narrative captured the right psychological mechanisms: the attachment architecture, the regulation sequences, the trajectory from hedging to directness. The quantitative analysis below asks a different question: does the narrative preserve the person’s current personality profile as measured by Psyche? Both matter independently. The qualitative comparison tests whether the narrative identifies the right mechanisms. The quantitative comparison tests whether it calibrates those mechanisms at the right intensity.
Everything above compared what the narrative pipeline inferred against what Psyche measured. The comparison was qualitative: does Opus construct the right attachment architecture, the right regulation sequence, the right competence gap? The answer was mostly yes (5 dimensions consistent, 3 showing documented evolution), with caveats about entanglement and articulateness inflation.
The question this section asks is different. Not what did Opus infer, but how accurately does the personality signal in the generated narrative match the measured profile? And critically: when the accuracy improves, what caused the improvement?
The Setup
The original narrative pipeline operated within a 100-150K token context window. For each chapter, an LLM generated a research brief (a curated list of 5-15 messages with emotional arcs), and a separate writing agent produced the chapter from that brief plus about 500 pre-filtered messages. The model never saw the full message archive at once. It saw summaries of summaries, each layer of abstraction compressing and potentially distorting the source signal.
Claude’s context window expanded to 1 million tokens in early 2026. The later arc was selected for the 1M experiment because the full archive (19,867 messages, roughly 534,000 tokens) fits in a single context window, making this a qualitative architectural difference (every agent in the pipeline sees everything) rather than a quantitative scaling exercise. This made it possible to regenerate the narrative with a fundamentally different architecture: instead of working from curated briefs, the model could read every message directly while planning the outline, mapping emotional phases, and writing each chapter.
The regenerated narrative used the same model (Opus 4.6), the same first-person memoir format, the same citation system. The only architectural change: the planning and writing agents could see all 19,867 messages rather than working from pre-filtered subsets.
Measuring Personality Accuracy
To quantify personality signal preservation, a personality evaluation model reads the generated narrative and infers Big Five domain scores (Neuroticism, Extraversion, Openness, Agreeableness, Conscientiousness) using an assessment-optimized prompt (a structured evaluation template that asks the model to cite specific textual evidence for each domain score, based on the prompting methodology from Peters and Matz 2024 which demonstrated r=.443 correlation with self-report versus r=.117 for generic prompting). The inferred scores are compared against two benchmarks. The primary benchmark is the Opus corpus reference profile (N=51.1, E=28.3, O=88.6, A=36.4, C=61.4): Opus’s own inference about the person from a 1.47-million-word writing corpus. This is itself a model output, not an objective measurement, so Opus-vs-Opus comparisons measure internal consistency. The secondary benchmark is the Psyche merged profile (N=56.7, E=28.7, O=79.1, A=52.6, C=53.3), synthesized from 39 validated instruments plus a semi-structured interview, which provides a stronger test of accuracy because it incorporates data independent of Opus. The metric is mean absolute delta: the average distance between the narrative-inferred score and the benchmark score across all five domains.
A perfect narrative (one that preserves personality signal with zero distortion) would score mean |Δ| = 0. A narrative about a generic person with no personality signal would score around 33 (the expected absolute distance between two independent uniform draws on a 1-100 scale).
The old pipeline narrative scored mean |Δ| = 11.4 against the Opus corpus reference profile. The biggest distortions: Neuroticism inflated by +23.9 points (the narrative portrayed someone far more anxious than the instruments measured) and Agreeableness inflated by +19.3 points (the narrative portrayed someone far warmer).
The 1M pipeline narrative scored mean |Δ| = 3.9. A 66% improvement.
The Neuroticism inflation dropped from +23.9 to +3.9. The Agreeableness inflation dropped from +19.3 to +4.3. Extraversion, Openness, and Conscientiousness (which were already well-preserved) stayed within a few points of the reference profile.
What Changed and What Didn’t
The improvement concentrated in exactly the dimensions where the old pipeline failed worst. N and A are sensitive to sampling bias in ways that E, O, and C are not. When a pipeline pre-filters for emotionally interesting messages (because those make better stories), it systematically inflates apparent emotional reactivity and warmth. The narrator appears always anxious and always generous, because the mundane messages where they’re calm and neutral never made it into the chapter brief.
The 1M pipeline changed both planning and writing context simultaneously. The planner saw all 19,867 messages when designing the outline and emotional phase map; the chapter writers saw the full archive too, including the 18,000 messages about pad grading, radio frequencies, and what to have for dinner. The ablation study (below) later revealed that the planning change was the dominant factor — but the net effect was that with the full behavioral baseline visible, the model correctly inferred that the person’s trait-level neuroticism is moderate with occasional state-level spikes, rather than concluding from a curated highlight reel that the person is constitutionally anxious.
The Ablation That Changed the Story
This ablation is analyzed in detail in Where to Spend Your Context Window.
A 66% improvement sounds definitive, but three confounds clouded the result. The 1M narrative was shorter (20,000 words versus 24,000). The same model evaluated both narratives (potential self-scoring bias). And there were no confidence intervals.
The ablation study tested each confound by removing one variable at a time.
The context restriction test. If the improvement comes from seeing more messages during writing, then restricting the chapter writer back to ~200 messages per chapter should reproduce the old pipeline’s inflation. The ablation kept the 1M pipeline’s outline and discovery artifacts (which were created with full-archive visibility) but limited each chapter writer to a random subsample of 200 messages from its time period.
The result: mean |Δ| = 2.4 (four evaluation runs, 95% CI: [2.0, 2.8]). The restricted version did not reproduce the inflation, confirming that the old pipeline’s failure originated in planning, not in writing.
The word count test. If the improvement comes from the shorter narrative having less contradictory surface area, then writing a longer narrative (~28,000 words, exceeding the old pipeline’s 24,000) should degrade accuracy.
The result: mean |Δ| = 1.7 (four evaluation runs, 95% CI: [1.4, 2.0]). The improvement not only survived at higher word counts but strengthened. The long condition outperformed both the 1M pipeline (3.9) and the filtered ablation (2.4), suggesting that additional output words provide more surface area for the evaluator to detect accurate personality signals rather than introducing contradictions.
Confidence intervals. Four evaluation runs each on all four conditions (four re-evaluations of the same generated text, bounding the evaluator’s scoring variance rather than the pipeline’s generation variance). The Neuroticism difference, which is the largest single improvement, produced non-overlapping 95% confidence intervals: old N = [72.4, 77.4] versus 1M N = [52.5, 55.2]. The filtered and long conditions show even tighter N intervals: filtered N = [54.6, 55.2], long N = [48.5, 51.9]. The N improvement exceeds measurement noise across all new conditions. A separate generation variance test (regenerating the 1M narrative from identical pipeline inputs) confirmed that these CIs are approximately valid for total uncertainty: the second generation scored mean |Δ| = 3.9, identical to the original and inside its CI [3.4, 4.4].
Cross-model validation. A factorial corpus evaluation (5 source registers (personal SMS, academic writing, casual messaging, AI conversations, and a stratified mix) × 2 evaluator models × 3 runs each, 30 chunked runs plus 9 Opus full-context runs, 39 total) quantified systematic differences between Opus 4.6 and GPT-5.4 as personality evaluators. GPT-5.4 shows consistent positive biases relative to Opus across all registers: Extraversion +15.3, Conscientiousness +19.2, Agreeableness +10.6, Neuroticism +8.3 (mean of five per-register means, each computed from 3-run averages). Openness is the only domain where both models agree (mean bias -2.8), consistent with O being the strongest textual signal in the psychometric literature.
The original cross-model comparison used a single GPT corpus evaluation that happened to sample only academic writing (zero SMS data included due to a sampling bug). The corrected factorial design evaluates each model against the narrative’s actual source register (personal SMS), producing register-appropriate reference profiles with confidence intervals. Under this design, both evaluators agree that the old pipeline inflates Neuroticism substantially and that all 1M-derived conditions reduce this inflation. The finer-grained condition rankings remain evaluator-specific: Opus ranks the newer conditions as improvements; GPT’s systematic C inflation across corpus registers (+19.2 mean) dominates the aggregate metric and flattens between-condition differences.
Generation model confound. A 2×2 experiment ({Opus-generated, GPT-generated} × {Opus-evaluated, GPT-evaluated}) tested whether the generating model affects the personality signal. GPT-5.4 generated a parallel narrative from the same pipeline inputs (outlines, chapter briefs, voice profiles, canonical facts). Under replication (2-3 runs per cell), the four cells measured against the Psyche merged profile: Opus-generated/Opus-evaluated 8.9, Opus-generated/GPT-evaluated 15.5, GPT-generated/Opus-evaluated 8.8, GPT-generated/GPT-evaluated 14.1. The evaluator effect (12.4 points mean |Δ| difference between same-generator cross-evaluator pairs) was nearly 7× the generator effect (1.8 points between same-evaluator cross-generator pairs). The two Opus-evaluated cells converged under replication (8.9 vs 8.8), confirming that generators produce indistinguishable profiles when read by the same evaluator. A formal equivalence test places the generator effect inside a ±3-point region of practical equivalence.
| Condition | Messages per Chapter | Words | Mean |Δ| (Opus) | 95% CI | Runs | |-----------|---------------------|-------|---------|--------|------| | Old pipeline | ~500 (pre-filtered) | 24,000 | 11.4 | [9.9, 12.9] | 4 | | 1M pipeline | All 19,867 | 20,000 | 3.9 | [3.4, 4.4] | 4 | | Filtered (1M outline, 200 msgs) | 200 (subsampled) | 32,000 | 2.4 | [2.0, 2.8] | 4 | | Long (1M, higher word target) | All in time range | 28,000 | 1.7 | [1.4, 2.0] | 4 |
Per-domain deltas against the Opus corpus reference profile (N=51.1, E=28.3, O=88.6, A=36.4, C=61.4):
| Condition | ΔN | ΔE | ΔO | ΔA | ΔC |
|---|---|---|---|---|---|
| Old | +23.9 | -3.6 | +2.7 | +19.3 | +7.6 |
| 1M | +3.9 | +2.7 | +2.4 | +4.3 | +6.3 |
| Filtered | +3.8 | -0.9 | +2.2 | +1.6 | -2.8 |
| Long | -0.9 | -2.2 | +1.5 | +0.6 | -3.1 |
The old pipeline’s inflation concentrates in two domains: Neuroticism (+23.9) and Agreeableness (+19.3). The newer conditions correct both, with the improvement in N being the most robust finding across evaluators and benchmarks. Against the Opus reference profile, O and E are well-preserved across all conditions, meaning those dimensions are insensitive to context management.
What the Ablation Reveals
The first hypothesis was clean: more context during writing produces better personality signal. The filtered ablation killed it. Restricting the chapter writer to 200 messages did not reproduce the old pipeline’s failure (2.4, not 11.4). The old pipeline’s distortion originated in planning, not in writing.
The old pipeline’s chapter briefs (curated lists of 5-15 messages per chapter) were themselves generated by an LLM working from limited context. Each layer of abstraction amplified the narrative framing. “He texted about anxiety” became “his anxiety about X” became a chapter where anxiety was the dominant emotional color. The 1M pipeline collapsed these layers by giving the planner everything. The planner got the person right. The writer just had to follow the plan. This much is confirmed by all four conditions outperforming the old pipeline by a wide margin.
The second hypothesis was more specific: having 19,867 messages in the writing context introduces distraction rather than signal. Initial single-run data appeared to support this (filtered and long both scored 1.8 versus 1M’s 3.9). With four evaluation runs each, the picture changes. The long condition (full 1M context, ~28,000 words) scores 1.7 ± 0.3, outperforming filtered (restricted writing context, ~32,000 words) at 2.4 ± 0.4. The confidence intervals barely overlap. Full writing context does not add noise; it appears to help.
What explains the 1M pipeline’s 3.9 falling behind both filtered (2.4) and long (1.7)? The variable that separates 1M from both ablation conditions is output length: the 1M pipeline produced ~20,000 words, while filtered produced ~32,000 and long produced ~28,000. Longer narratives provide more surface area for the personality evaluation prompt to detect accurate signals. The 1M pipeline’s 20,000-word output may have been too compressed to fully express the personality profile, even though the model had all the right context to work from.
The ranking under Opus evaluation (long 1.7 < filtered 2.4 < 1M 3.9 < old 11.4) suggests three factors ordered by effect size. Planning context is the dominant factor: all three new conditions, which share 1M-derived planning artifacts, massively outperform the old pipeline. Output length is a secondary factor: longer narratives score better, likely by giving the evaluator more behavioral evidence to work with. Writing context restriction has a small negative effect: filtered (restricted) underperforms long (unrestricted) when both have adequate output length, suggesting the full archive provides useful calibration during writing that 200 subsampled messages cannot.
Against the Merged Ground Truth
All results above use the Opus corpus reference profile as the benchmark. Against the Psyche merged profile (the stronger, instrument-derived benchmark defined in the Measuring Personality Accuracy section), the pattern holds:
| Condition | Mean |Δ| (Opus ref.) | Mean |Δ| (Merged) | |-----------|---------------------|----------------------| | Old pipeline | 11.4 | 10.7 | | 1M pipeline | 3.9 | 8.4 | | Filtered | 2.4 ± 0.4 | 7.6 | | Long | 1.7 ± 0.3 | 8.5 |
The improvement is smaller against the merged profile (the gap between old and the best-performing new condition shrinks from 9.7 to 3.1 points against the Opus reference, versus 10.7 to 7.6 against the merged profile). This is expected: the merged profile has higher Agreeableness (52.6 versus 36.4) and lower Openness (79.1 versus 88.6), which changes which dimensions drive the delta. But the direction is consistent. The 1M-derived narratives outperform the old pipeline against both benchmarks.
Disaggregating by domain against the merged profile reveals what the mean hides:
| Condition | ΔN | ΔE | ΔO | ΔA | ΔC |
|---|---|---|---|---|---|
| Old | +18.3 | -4.0 | +12.2 | +3.1 | +15.7 |
| 1M | -1.7 | +2.3 | +11.9 | -11.9 | +14.4 |
| Filtered | -1.9 | -1.1 | +12.3 | -14.4 | +8.3 |
| Long | -6.7 | -3.2 | +11.4 | -15.8 | +5.2 |
Two patterns emerge. First, Openness is inflated by +11 to +12 points across every condition against the merged profile (where instrument-measured O is 79.1, substantially below both LLM corpus inferences of ~89-91). Facet decomposition across five source registers confirms that this stability is genuine rather than a ceiling artifact of a single subfacet: removing the highest-scoring facet (Ideas) does not reduce the cross-register range (8.8 points with or without Ideas). All O subfacets are elevated in narrative text, consistent with literary memoir being inherently “open” in register. This inflation is not fixable by pipeline architecture.
Second, Agreeableness reverses direction between the old and new pipelines. The old pipeline overshoots (+3.1), while the new conditions undershoot by 12-16 points. The old pipeline’s planning artifacts, curated from emotionally intense messages, selected for warmth-coded interactions. The 1M pipeline’s planners saw the full archive, including the majority of messages that are transactional, terse, or mildly antagonistic, and produced a more reserved characterization. Against the Opus corpus GT (A=36.4), this undershooting looks like accuracy. Against the merged profile (A=52.6), it looks like a miss in the opposite direction.
The Differential Personality Test
Updated March 2026. The 1M experiment tested whether context management affects personality fidelity. This section tests whether the personality signal is real in the first place.
Everything above analyzed narratives written from the subject’s first-person perspective. The personality profiles derived from those narratives resembled the subject’s measured profile. But there’s an uncomfortable alternative explanation: maybe Opus produces similar personality profiles regardless of whose perspective it writes from. If the model has a fixed “authorial fingerprint” that bleeds into every narrative, the apparent signal preservation is an artifact rather than a finding.
The narrative pipeline produced more than the subject’s account. Each arc also generated the interlocutor’s first-person perspective, a third-person omniscient account, and a dual-perspective narrative. If the model genuinely encodes personality signal from the content, these perspectives should produce different profiles. If it stamps a fixed fingerprint, they should all look the same.
They don’t all look the same.
| Perspective | N | E | O | A | C | Mean Δ from subject |
|---|---|---|---|---|---|---|
| Subject (corpus baseline) | 51.1 | 28.3 | 88.6 | 36.4 | 61.4 | — |
| Arc 1 interlocutor (1st person) | 64.7 | 36.0 | 79.3 | 42.0 | 44.3 | 10.7 |
| Arc 2 interlocutor (1st person) | 50.7 | 62.3 | 80.0 | 54.0 | 62.3 | 12.3 |
| Third-person (arc 1) | 63.7 | 26.7 | 89.7 | 35.7 | 64.0 | 3.7 |
| Third-person (arc 2) | 71.0 | 24.5 | 90.0 | 38.0 | 60.0 | 5.6 |
| Blended (arc 1) | 65.5 | 24.0 | 90.8 | 40.5 | 55.0 | 6.3 |
| Blended (arc 2) | 69.5 | 24.5 | 92.5 | 35.5 | 55.0 | 6.7 |
The gradient is clean. When the narrative is about the subject but told from outside (third-person), the profile stays close: 3.7 and 5.6 points from baseline. The same person, different literary mode, similar personality. When the narrative constructs a different person’s interior life (the interlocutor), the profile diverges: 10.7 and 12.3 points.
The single strongest piece of evidence is the arc 2 interlocutor’s Extraversion: 62.3. The subject’s is 28.3. That’s a 34-point shift on a single dimension, placing a different person above the population mean on a trait where the subject scores near floor. That is not a model stamping its default voice. That is a model reading the content and inferring a different person.
The dual-PoV narratives, which alternate between both perspectives, produce intermediate deltas (6.3 and 6.7). Blended input, blended output. The model is averaging two personality signals, not defaulting to one.
One systematic pattern cuts across all perspectives: Neuroticism is elevated everywhere (+18 to +24 points above the SMS corpus baseline). Facet decomposition reveals this inflation concentrates in anxiety (+31), vulnerability (+29), and self-consciousness (+17) — facets foregrounded by literary memoir’s emotional register — while anger (-0.4) and impulsiveness (-6.7) are unaffected. The genre effect is facet-specific, not trait-wide. This is the literary genre effect documented earlier. First-person memoir and relationship narratives foreground emotional content regardless of whose emotions they’re about. The N inflation is a property of the genre, not the person. The other four dimensions (E, O, A, C) all show perspective-dependent differentiation.
This resolves the ambiguity that hung over the signal preservation analysis. The model does not have a fixed personality it stamps onto every narrative. It infers personality from the content and produces character-specific output. When it writes as the subject, the profile resembles the subject. When it writes as the interlocutor, the profile resembles someone who is dramatically more extraverted, more agreeable, and less open. These are two different people rendered differently by the same model from the same underlying data.
The simulator hypothesis predicted exactly this: the model infers the latent agent from the text, with the specific agent determined by the narrative perspective rather than fixed in the weights. The differential experiment provides quantitative evidence that this prediction holds for personality specifically.
What This Means
Two AI-mediated approaches to personality converge more than they diverge when applied to the same person. The convergence is partially artifactual (shared model, shared data). The divergence is informative (scope gaps in messaging data, temporal evolution). Neither approach alone is sufficient. Together, they suggest that digital exhaust contains more psychological signal than we assumed, and that the extraction method matters more than the data source.
The 1M context experiment adds a quantitative dimension to that finding. The same pipeline architecture with different context management produces dramatically different profile fidelity under Opus evaluation: 11.4 mean delta for the old pipeline versus 3.9 for the 1M pipeline, and the ablation conditions score better still (filtered 2.4, long 1.7, all four conditions with four evaluation runs each). The improvement is concentrated in Neuroticism, which is the one finding robust enough to survive cross-model evaluation: both Opus and GPT-5.4 agree the old pipeline inflates N substantially, and all newer conditions reduce that inflation. The finer-grained ranking among conditions does not hold across evaluators, which means interpretations beyond “planning context matters most” should be treated as Opus-specific observations rather than established findings.
A factorial corpus evaluation across five source registers confirmed that the two evaluators have systematic, register-independent biases (GPT inflates E by +15.3, C by +19.2 relative to Opus), but also that register genuinely matters: the same person’s personality reads differently in intimate texting, academic writing, casual messaging, and analytical AI conversations. The corpus evaluation closest to this pipeline’s source material (SMS with the arc 2 interlocutor) produces the most appropriate reference profile for evaluating narrative fidelity.
The generation confound — the possibility that apparent personality signal was an artifact of shared-model entanglement rather than genuine signal in the text — was tested by having GPT-5.4 generate a parallel narrative from identical pipeline inputs. The evaluator effect was nearly 7× the generator effect under replication. The personality signal is in the text; the generating model’s contribution is formally negligible (1.8 points, inside a ±3-point equivalence bound). When the planner works from curated excerpts, the narrative inflates emotional reactivity. When the planner sees everything, the Neuroticism distortion largely disappears, and this holds across both tested generating models (Opus 4.6 and GPT-5.4).
The practical implication remains uncomfortable and is now quantified. The old pipeline diverged from the benchmark in a specific, measurable direction: it portrayed someone 24 points more neurotic and 19 points more agreeable than the Opus corpus reference profile. The 1M pipeline reduces the Neuroticism distortion to 4 points against the same reference (or 2 points against the instrument-derived merged profile). Agreeableness tells a more complicated story: it drops from +19 to +4 against the Opus reference (an improvement), but against the merged profile it swings from +3 to -12 (overcorrecting in the opposite direction). Which benchmark you trust determines whether the 1M pipeline improved A or merely reversed its error. The difference between “someone who is always anxious” and “someone who is occasionally anxious but mostly calm and competent” is not a matter of literary interpretation. It is a measurable distortion introduced by the pipeline’s context management, and it is fixable.
Your texts don’t contain you. But they contain more of you than you think. And how much of you comes through depends, with surprising precision, on how many of those texts the AI reads before it starts writing — and on which model is doing the reading.
Comments