Where to Spend Your Context Window

Where to Spend Your Context Window

The assumption that giving a language model more context during generation improves output quality is widespread enough to function as an axiom in most pipeline designs. The reasoning is straightforward: more information available at the point of generation means better informed generation. Retrieval-augmented generation, long context models, and the steady expansion of context windows from a few thousand to 128,000 to 1,000,000 tokens all reinforce the same premise: that the bottleneck is how much the model can see while producing its output.

An ablation study on a narrative generation pipeline suggests this premise is incomplete in a specific and measurable way. The single largest factor in output quality was not how much the model saw while writing, but how much the model saw while planning what to write: all three conditions that used planning artifacts derived from the full archive massively outperformed the condition that planned from limited context. But the picture is more nuanced than “planning is all that matters.” Among the new conditions, output length emerged as a secondary factor (longer narratives scored better), and restricting writing context slightly hurt rather than helped. The context window’s value is real at every pipeline phase, but the planning phase is where restricted context does the most damage.

The Pipeline and the Problem

The experimental context is a narrative generation pipeline that converts text message archives into first person literary memoir, with the generated narrative’s personality signal measured against two benchmarks: an Opus corpus reference profile (Opus’s own Big Five inference from the writing corpus, measuring internal model consistency) and a Psyche merged profile (synthesized from 39 validated instruments plus a semi-structured interview, providing an independent accuracy test). The pipeline has been described in detail elsewhere (From Text Messages to Literary Memoir, What AI Learns About You When You’re Not Looking). What matters here is the architecture.

The original pipeline operated within a 100,000 to 150,000 token context window. For each chapter of the memoir, an LLM generated a research brief: a curated list of 5 to 15 messages per chapter along with an emotional arc description. A separate writing agent produced the chapter from that brief plus approximately 500 filtered messages. The model never saw the full message archive at once. It saw summaries of summaries, each layer of abstraction compressing the source signal and potentially amplifying distortions in the process.

When the context window expanded to 1,000,000 tokens, the full archive (19,867 messages at roughly 534,000 tokens) fit in a single context. The pipeline was rebuilt: a discovery agent read all messages and produced emotional phase maps, canonical facts, and character notes. An outline agent designed the chapter structure with full archive visibility. Chapter writers received all messages plus the discovery artifacts plus the outline plus all prior chapters. The same model (Opus 4.6), the same first person memoir format, the same citation system. The key architectural change was that every agent in the pipeline could see everything.

The personality metric is mean absolute delta (mean |Δ|): the average distance between the Big Five domain scores inferred from the narrative and the scores from the Opus corpus reference profile (results against both benchmarks are reported in the Results section). Lower is better. The old pipeline scored 11.4. The 1M pipeline scored 3.9 under Opus evaluation, representing a 66% improvement concentrated almost entirely in two dimensions: Neuroticism (inflated by +23.9 in the old pipeline, reduced to +3.9 in the new) and Agreeableness (inflated by +19.3, reduced to +4.3). Extraversion, Openness, and Conscientiousness showed minimal change across pipeline architectures, which is consistent with the hypothesis that these dimensions are relatively robust to context window size because they are visible in any sufficiently large text sample, while Neuroticism and Agreeableness are highly sensitive to which messages the model sees.

The Ablation Design

Ablation, borrowed into machine learning from neuroscience and early cognitive science (where it refers to the surgical removal of a brain region to determine what function that region served), follows an analogous methodology: remove or disable one component at a time and measure whether the system’s performance changes. If performance degrades when a component is removed, that component was contributing. If performance is unchanged, the component was either redundant or compensated for by other mechanisms.

The 1M pipeline’s improvement over the old pipeline confounded multiple variables simultaneously. The 1M pipeline used a different planning architecture (discovery agents reading all messages, rather than curated briefs). It produced shorter output (20,000 words versus 24,000). And the same model evaluated both pipelines, raising the possibility that Opus was scoring its own writing favorably. Two ablations and two validation analyses were designed to isolate each confound.

The context restriction ablation asked whether restricting the chapter writer’s message visibility to approximately 200 messages per chapter (simulating the old pipeline’s restricted access) would reproduce the old pipeline’s personality inflation. The ablation preserved the 1M pipeline’s outline and discovery artifacts (which were created with full archive visibility) but restricted each chapter writer to a random subsample of 200 messages from its time period, selected with a fixed seed for reproducibility. If context during writing is the driver, the restricted version should score closer to 11.4 than to 3.9.

The length ablation asked whether the improvement was partly an artifact of shorter text having less contradictory surface area. The ablation used the full 1M pipeline architecture but instructed the chapter writer to produce longer chapters (1,800 to 2,200 words each, targeting approximately 28,000 words total, exceeding the old pipeline’s 24,000). If brevity is the confound, the longer version should score worse than 3.9.

Confidence intervals addressed measurement noise. The Opus evaluation was run four times each on the old and 1M narratives (four re-evaluations of the same generated text, bounding evaluator scoring variance rather than pipeline generation variance) to produce 95% confidence intervals per domain. A subsequent generation variance test (regenerating the 1M narrative from identical inputs) confirmed these CIs are approximately valid for total uncertainty: the second generation scored |Δ| = 3.9, identical to the original.

A cross-model evaluation using GPT-5.4 (via Codex) addressed the self-scoring concern through two approaches: a factorial corpus evaluation (5 source registers × 2 models × 3 runs = 30 chunked runs plus 9 full-context Opus runs, 39 total) to quantify systematic evaluator biases, and a 2×2 generation confound experiment to test whether the generating model affects personality signal.

The Results

| Condition | Messages per Chapter | Total Words | Mean |Δ| | 95% CI | Runs | |-----------|---------------------|-------------|---------|--------|------| | Old pipeline (100-150K planning context) | ~500 | 24,000 | 11.4 | [9.9, 12.9] | 4 | | 1M pipeline (full context throughout) | 19,867 | 20,000 | 3.9 | [3.4, 4.4] | 4 | | Filtered (outline from 1M, 200 msgs/chapter) | 200 | 32,000* | 2.4 | [2.0, 2.8] | 4 | | Long (1M context, higher word targets) | All in time range | 28,000 | 1.7 | [1.4, 2.0] | 4 |

All four conditions now have four evaluation runs each (re-evaluations of the same generated text, bounding evaluator scoring variance). The ranking under Opus evaluation is: long (1.7) < filtered (2.4) < 1M (3.9) < old (11.4). The filtered and long confidence intervals barely overlap at 2.0, suggesting the difference between them is real rather than noise. The dramatic gap between old and the three new conditions (all of which share 1M-derived planning artifacts) is the clearest finding: planning context is the dominant factor.

*The filtered condition produced 32,000 words despite no explicit higher word-count instruction, suggesting that restricted writing context may prompt more verbose narration. This introduces an output-length confound between filtered and 1M that the analysis addresses below.

The confidence intervals for the Neuroticism dimension, which showed the largest single improvement, were non-overlapping at 95% confidence: old pipeline N = [72.4, 77.4] versus 1M pipeline N = [52.5, 55.2], against a reference profile of 51.1. Agreeableness and Conscientiousness intervals overlapped, indicating that this four-run sample does not show clear separation between conditions in those domains.

The factorial corpus evaluation revealed systematic biases between the two evaluators that are consistent across all five source registers (personal SMS, academic writing, casual messaging, AI conversations, and a stratified mix). GPT-5.4 scores higher than Opus on every domain except Openness: E +15.3, C +19.2, A +10.6, N +8.3 (mean of five per-register means, each computed from 3-run averages). Openness is the only domain where both models agree (mean bias -2.8), consistent with O being the strongest textual signal. The C inflation is the most damaging to cross-model comparison: GPT applies +17.3 to +22.8 points of Conscientiousness inflation across source registers (mean +19.2), which dominates aggregate metrics and flattens between-condition differences. Both evaluators agree the old pipeline shows the largest Neuroticism inflation and that all 1M-derived conditions reduce it. The finer-grained condition rankings (long vs filtered vs 1M) remain evaluator-specific.

A secondary finding from the factorial design: Opus full-context evaluation (all text in a single prompt) produced worse personality inference than chunked evaluation across all three tested sources (academic: 5.8 vs 15.8, mixed: 8.3 vs 12.0, SMS: 8.3 vs 9.1 mean |Δ|). Follow-up ablations clarified the mechanism: the polarization is not caused by temporal ordering (shuffling the message chronology reproduces the same bias) but by text volume itself. Full-context evaluation systematically pushes Neuroticism up and Extraversion down regardless of message order. For personal SMS data, the effect is prompt-addressable: instructing the evaluator to “weight mundane passages as heavily as emotionally intense passages” eliminates the polarization, returning all five domains to within one point of the chunked baseline. Follow-up analysis revealed that the apparent register-specificity was an artifact of comparing against the wrong baseline: the debiased prompt works on academic writing too, where the mechanism is disruption of trajectory-based narrative arc construction (the evaluator perceives a career trajectory of increasing rigor in chronologically ordered papers). Shuffling temporal order produces equivalent results to the debiased prompt across registers, confirming that the underlying mechanism is attentional anchoring on narrative arcs rather than a register-specific failure mode.

A 2×2 generation confound experiment ({Opus-generated, GPT-generated} × {Opus-evaluated, GPT-evaluated}) tested whether the generating model’s identity affects the personality signal. GPT-5.4 produced a parallel 14-chapter narrative from the same pipeline inputs (outlines, chapter briefs, voice profiles, canonical facts). Under replication (2-3 runs per cell), the evaluator effect (12.4 points mean |Δ| difference between same-generator cross-evaluator conditions) was nearly 7× the generator effect (1.8 points between same-evaluator cross-generator conditions). The two Opus-evaluated cells converged to within 0.1 points of each other, confirming that generators produce indistinguishable profiles when read by the same evaluator. A formal equivalence test places the generator effect inside a ±3-point region of practical equivalence.

Against the Psyche merged profile (the instrument-derived benchmark, synthesized from 39 instruments plus a semi-structured interview rather than Opus corpus inference), the improvement pattern holds: old pipeline 10.7, 1M pipeline 8.4, filtered 7.6, long 8.5. The deltas are generally larger for the 1M-derived conditions because the merged profile differs from the Opus inference on Agreeableness and Openness. The ranking shifts: against the merged profile, filtered (7.6) edges ahead of long (8.5), reversing their order against the Opus reference. This is because the merged profile penalizes the long condition’s N undershoot (-6.7) more heavily. Per-domain disaggregation reveals that Openness is inflated by +8 to +12 points across all conditions against the merged profile (where instrument-measured O at 79.1 is substantially below both LLM corpus inferences of ~89-91), a structural artifact of narrative generation (literary memoir is inherently ideas-oriented text). The Neuroticism correction from old (+18.3) to 1M (-1.7) remains the clearest signal against the merged profile.

What the Ablation Reveals

The initial single-run results appeared to tell a clean story: filtered and long both scored 1.8 versus 1M’s 3.9, suggesting that the planning phase was everything and writing context was irrelevant or even harmful. With four evaluation runs per condition, the story is more nuanced and more interesting.

The dominant finding survives replication: all three new conditions massively outperform the old pipeline. The old pipeline scored 11.4; the worst new condition scores 3.9. The shared factor is planning context: all three new conditions used outline and discovery artifacts derived from an agent that could see all 19,867 messages. The old pipeline’s chapter briefs were generated by an LLM working from limited context, and each layer of abstraction amplified the narrative framing. A person who occasionally expressed anxiety became “a person characterized by anxiety” in the brief, which became a chapter where anxiety was the dominant emotional register. The 1M pipeline collapsed these layers by giving the planner everything. The planner calibrated emotional peaks against the full behavioral baseline. This is why planning context matters most — though the mechanism could be the raw information advantage (seeing all 19,867 messages versus curated subsets) or the architectural shift from sequential brief compression to holistic discovery, or both.

The secondary finding changed with replication. Under single-run evaluation, filtered appeared to match long (both 1.8), suggesting writing context was irrelevant. With confidence intervals, long (1.7 ± 0.3) outperforms filtered (2.4 ± 0.4), and both outperform the base 1M pipeline (3.9). The variable that separates 1M from both ablation conditions is output length: the 1M pipeline produced approximately 20,000 words, while filtered produced approximately 32,000 and long approximately 28,000. Longer narratives provide more surface area for the personality evaluation prompt to detect behavioral patterns that match the reference profile. However, length alone does not explain the improvement: the 1M pipeline at 20,000 words dramatically outperforms the old pipeline at 24,000 words (3.9 vs 11.4). If output length were the sole mechanism, the old pipeline should score better than the shorter 1M pipeline, and it does not. Planning context contributes independently of length.

The comparison between filtered and long is the most informative. Filtered has restricted writing context (200 messages) but longer output (32,000 words). Long has full writing context (all messages in the time range) but slightly shorter output (28,000 words). Long wins. This means full writing context provides value that outweighs the 4,000-word output advantage of the filtered condition, though the decomposition assumes a length-score relationship where the marginal benefit of those additional words is smaller than the observed 0.7-point gap. Without additional length ablation points, the relative contributions of writing context and output length to this gap remain entangled. The “writing context adds noise” hypothesis that the single-run data appeared to support is refuted by the replicated data.

Three factors, ordered by effect size under Opus evaluation: (1) planning context is the dominant factor, associated with the 7.5-point gap between old and 1M (though this comparison confounds planning context with planning architecture); (2) output length is a secondary factor, with longer narratives scoring better, likely by providing richer behavioral evidence for the evaluator; (3) writing context provides a small additional benefit, visible in the long-vs-filtered comparison. None of these rankings have been confirmed by cross-model evaluation, which means they should be treated as Opus-specific observations.

Generalization: Where Context Matters Most

The specific domain (personality signal preservation in narrative generation) is narrow, and whether the architectural observations transfer is an empirical question. But the structural question applies to any multi-step LLM pipeline with distinct planning and generation phases, whether for document generation, code production, research synthesis, or agentic task execution: where in the pipeline does context quality matter most?

The standard answer is to maximize context at the point of generation. RAG systems retrieve at generation time. Long context models are marketed on generation-side promise. The ablation results suggest a more layered picture: planning context is the single largest factor, but generation context and output length both contribute measurable improvements. A comprehensive plan created from full context is the prerequisite for good output, but the generator benefits from full context too, and longer output provides more surface area for quality signal to emerge.

This framing maps onto a distinction familiar in other engineering domains: design time information versus execution time information. An architect who has surveyed the entire site before drawing plans produces better buildings than one who surveys each room individually during construction. But the architect analogy has limits: unlike a construction worker following blueprints, an LLM writer with full context can perform local calibration that a context-restricted writer cannot, and the data suggests this calibration has measurable value. The analogy also assumes that planning and generation are informationally sequential — the plan is complete before generation begins. Pipelines with planning-generation feedback loops may distribute the value of context differently.

Whether these patterns hold beyond narrative generation is an empirical question that this study cannot answer. The domain has specific properties (personality signal as a measurable outcome, a validated psychometric benchmark, a clear distinction between planning and writing phases) that may not transfer to domains where the planning/generation boundary is less defined. The finding that planning context is the dominant factor has the strongest evidential support (all three new conditions outperform old, and the gap is large). The secondary findings about output length and writing context are based on smaller effect sizes and Opus-only evaluation. The principle that planning context matters most is a hypothesis with strong support from one domain; the finer-grained observations about generation context and output length are more tentative.

Limitations

The experiment has four limitations that constrain its generalizability.

The sample is a single 8-month messaging arc: 19,867 messages between two people. The finding that planning context matters most may not hold for relationships with different emotional textures, different messaging densities, or different temporal structures.

The evaluation metric (Big Five personality inference from generated text) is specific to this pipeline. The output length effect (longer narratives score better) may partly reflect the evaluator having more text to work with rather than the narrative genuinely containing better personality signal. Whether these patterns hold for other generation quality metrics (factual accuracy, stylistic consistency, narrative coherence) is untested.

The old pipeline and the 1M pipeline differ in more than context window size. The old pipeline used a fundamentally different planning architecture (sequential brief generation rather than holistic discovery), which means the ablation isolates context restriction during writing but not during planning. A cleaner test would restrict planning context to 100,000 tokens while keeping the discovery agent architecture, which would separate the effect of context size from the effect of architectural differences in the planning phase.

The cross-model evaluation was expanded from a single corpus comparison to a 39-run factorial design covering five source registers, plus a 2×2 generation confound experiment. The factorial design quantified systematic evaluator biases (GPT inflates E +15, C +19, A +11, N +8 relative to Opus across all registers) and confirmed that the Neuroticism finding is evaluator-robust. The generation confound experiment, replicated with 2-3 runs per cell, demonstrated that the evaluator effect (12.4 pts) is nearly 7× the generator effect (1.8 pts), with the generator effect formally negligible under equivalence testing. However, the finer-grained condition rankings (long vs filtered vs 1M) remain Opus-specific observations. A separate finding from the factorial design is that full-context evaluation (sending all text in a single prompt) produced worse personality inference than chunked evaluation for Opus across all three tested sources, suggesting that the diversification benefit of chunking outweighs the coherence benefit of full-context for this task.

What Holds and What Remains Open

The finding that the old pipeline inflates Neuroticism and that 1M-derived planning artifacts substantially reduce this inflation holds across evaluators, benchmarks, and generating models. This is the robust finding: planning context is the dominant factor. With four evaluation runs per condition and confidence intervals, two secondary findings also emerge under Opus evaluation: longer output improves scores (long 1.7 vs 1M 3.9, the only difference being word count target), and full writing context slightly outperforms restricted writing context (long 1.7 vs filtered 2.4, though confounded by output length). Neither secondary finding has been confirmed by cross-model evaluation.

The generation confound — the concern that convergence between Opus-generated narratives and Opus-evaluated personality profiles was self-recognition rather than genuine signal — is substantially addressed. A parallel narrative generated by GPT-5.4 from the same pipeline inputs produced personality scores within 1.8 points of the Opus-generated narrative when read by the same evaluator (and the two Opus-evaluated cells converged to within 0.1 points under replication), while the same narrative read by different evaluators diverged by 12.4 points. The personality signal is in the text, not in the model that wrote it. The self-scoring concern, while not eliminated entirely (systematic biases may be shared across frontier models), is now quantitatively bounded: the generator effect is formally negligible under equivalence testing (inside a ±3-point bound).

The initial single-run observation that writing context restriction helps (the “planning > generation” thesis in its strong form) is refuted by the replicated data. The filtered condition scored 2.4, not 1.8, and the long condition (with full writing context) scored 1.7. Writing context contributes positively, not negatively. Separately, a factorial evaluation of direct corpus personality inference found that chunked evaluation outperformed full-context evaluation for Opus across three tested sources (academic: 5.8 vs 15.8, mixed: 8.3 vs 12.0, SMS: 8.3 vs 9.1). This suggests that for personality evaluation (as opposed to narrative generation), chunked sampling provides beneficial diversification that a single full-context pass loses.

A temporal analysis of the message archive clarifies the planning-dominance mechanism. Evaluating chronological quarter-slices independently reveals that the personality profile is highly stable across time segments (early-to-late profile correlation r=0.966) despite genuine behavioral change (Extraversion increases by 6.4 points and Agreeableness by 4.9 in the later relationship period). This stability suggests that the old pipeline’s personality inflation was driven less by insufficient message coverage (even partial samples would capture the stable profile) and more by the sequential brief architecture, which compressed and amplified emotional framing at each layer of abstraction. The planning phase’s job is to identify which segments matter for the story, not to discover a personality that varies unpredictably. Cross-model evaluator bias, after additive correction, is fully resolved for Neuroticism, Extraversion, and Openness but shows residual interactions for Agreeableness (up to 12.8 points) and Conscientiousness (up to 8.9 points), confirming that some evaluator disagreements are structural rather than correctable by simple calibration.

The generalization to multistep LLM pipelines remains a hypothesis. The finding that planning context matters most has the strongest evidential support and the clearest theoretical basis (design time versus execution time information). The secondary findings about output length and writing context are more domain-specific and may reflect properties of personality inference from text rather than general principles of LLM pipeline design.

The practical implication, for anyone building a pipeline with distinct planning and generation phases, is that investing context in the planning phase yields the largest returns, but restricting generation context is not a free optimization. A comprehensive plan created from full context remains the prerequisite for good output. Whether the generator also benefits from full context depends on the domain, and in this case, the Opus-evaluated evidence suggests it probably does.

Ask About Projects
Hi! I can answer questions about Ashita's projects, the tech behind them, or how this blog was built. What would you like to know?