AI Summary Claude Opus

TL;DR: The context window of a large language model is not merely a technical constraint but an epistemological condition that makes bounded rationality architecturally explicit, revealing that knowledge under finite capacity is necessarily governed by curation and satisficing rather than comprehension.

Key Points

Transformer attention scales quadratically with sequence length, and empirical benchmarks show systematic performance degradation when relevant information appears in the middle of long contexts, a pattern that persists even as context windows expand to one million tokens.
Position-dependent attention biases (primacy and recency effects) emerge in both transformer and non-transformer architectures, paralleling the serial position effect in human memory and suggesting these patterns are inherent to any sequential processing system operating under finite capacity.
The epistemological distinction between parametric and contextual knowledge means that a language model's relationship to in-context information is absolute: access and knowledge are identical during inference, and loss of access constitutes complete unknowing with no residual retrieval pathway.

The post argues that the context window of a large language model constitutes an epistemological condition rather than a mere engineering limitation. Drawing on empirical benchmarks demonstrating systematic attention degradation across token positions, the analysis connects quadratic scaling constraints and positional biases in transformers to Herbert Simon's theory of bounded rationality and George Miller's findings on human working memory capacity. The post contends that retrieval-augmented generation and token budgeting are instances of satisficing — procedurally rational curation strategies that emerge necessarily under finite capacity. It concludes that context windows make the epistemological constraints of bounded knowers fully transparent and measurable for the first time, suggesting that knowledge itself may be better understood as effective management of available information rather than possession of comprehensive understanding.

Context Window Epistemology

A context window is a fixed buffer of tokens that constitutes, for any given generation step, the totality of what a large language model can know. The window is not an implementation detail or an engineering constraint awaiting a sufficiently clever solution. It is an epistemological condition, one that determines what counts as available knowledge, how that knowledge degrades across positions, and why strategies of curation inevitably replace strategies of comprehension. That the most sophisticated language models ever built struggle to associate, integrate, and track multiple pieces of information distributed across their own input is not a bug to be patched but a structural revelation about the nature of bounded attention, whether computational or biological.

The self-attention mechanism at the core of transformer architectures computes relationships between every token and every other token in the context window. This computation scales quadratically with sequence length: O(n²d), where n is the number of tokens and d is the embedding dimension. In practice, causal language models compute only the lower triangular portion of the attention matrix (each token attends only to itself and preceding tokens), which halves the constant factor to n(n+1)/2 operations, but the quadratic scaling in n remains. Doubling the sequence length still quadruples the computational cost. Complexity theoretic arguments under the Strong Exponential Time Hypothesis suggest that general, exact self-attention is unlikely to improve beyond quadratic scaling. This lower bound applies to computing exact self-attention for arbitrary inputs, and extends to some approximate settings as well; approximate attention, sparse attention, and linear attention mechanisms can achieve sub-quadratic complexity, but typically at the cost of exactness or generality. FlashAttention and its successors achieve substantial speedups through memory access optimization (tiling reads and writes to reduce I/O overhead), but they do not alter the fundamental scaling relationship for exact attention. They make the quadratic cost more bearable without making it linear.

This matters because context windows have expanded dramatically in the past three years, from 2,000 to 4,000 tokens to 128,000, 200,000, and now one million. The expansion creates an illusion of approaching comprehensiveness: surely a model that can ingest an entire novel should be able to find any sentence within it. The empirical reality is less encouraging. Liu et al. (2023), in what has become one of the most widely cited studies on context utilization, demonstrated that language model performance degrades significantly when relevant information appears in the middle of long contexts, with performance highest when that information occurs at the beginning or end of the input. This is the “lost in the middle” problem, and it persists across models specifically designed for longer contexts. Gemini 1.5, with its million token window, achieves near-perfect single-needle retrieval but drops to roughly 60% recall on multi-needle tasks at that scale (100 needles distributed across the context). The evidence suggests that the degradation is primarily a count problem rather than a positional one: the difficulty appears to lie in tracking 100 simultaneous retrieval targets rather than in any particular target’s position within the window, though position and context-length effects cannot be fully ruled out. GPT-4o drops from 99.3% accuracy at short contexts to 69.7% at 32,000 tokens. According to the NoLiMa benchmark, most models tested fall below 50% of their baseline performance at that same length.

Even where raw capacity is not the bottleneck, the degradation is not random. It follows systematic patterns rooted in the architecture itself. Causal masking ensures that tokens in deeper layers attend to increasingly contextualized representations of earlier tokens, producing a primacy effect (earlier information receives disproportionate attention). Rotary positional encoding introduces a decay function that diminishes attention scores for tokens at greater distances, regardless of their semantic relevance. Training data compounds both biases: among the documents used to pretrain large language models, the most informative tokens for predicting a given token are typically the most recent ones, which means the models learn to weight recency as a proxy for relevance. When inputs occupy less than half the context window, primacy dominates; as inputs approach the maximum context length, primacy weakens while recency persists. The middle, in either configuration, receives the least reliable attention.

These patterns are not unique to transformer architectures. Recent work on Mamba, a state space model designed to achieve linear rather than quadratic complexity, demonstrates the emergence of primacy and recency effects even without the self-attention mechanism that produces them in transformers. One possible implication is that position bias may be an emergent property of sequential processing under finite capacity rather than an artifact of any particular computational approach. But this conclusion is drawn from only two architectural families, and the underlying mechanisms differ: in transformers, positional bias arises from rotary encoding decay and causal masking, while in Mamba, it stems from exponential decay in the hidden state. Same behavioral phenomenon, different causes. The hypothesis that any system processing information sequentially through a bounded representation will develop systematic positional preferences is plausible but remains undertested.

The parallel to human cognition is structural rather than metaphorical. George Miller’s 1956 paper identified two distinct capacity limits that happen to converge on the same number. The “magical number seven, plus or minus two” refers to the span of immediate memory (how many items a person can recall from a short list), while a separate finding about absolute judgment showed that humans can reliably distinguish only about two to three bits of information along a single perceptual dimension. Miller himself emphasized that “the span of absolute judgment and the span of immediate memory are quite different kinds of limitations,” despite their numerical coincidence. What matters for the present argument is the shared implication: biological cognition operates under hard capacity constraints at multiple levels simultaneously. Nelson Cowan’s subsequent reconceptualization argued that the true capacity of the focus of attention, when chunking and rehearsal are controlled for, is approximately four items. The progression from Miller to Cowan is itself instructive: both were measuring bounded systems, but they disagreed about where the bound falls and what it constrains, a disagreement that remains active in cognitive science. The serial position effect in human memory (better recall for items at the beginning and end of a list, worse for the middle) maps directly onto the primacy and recency biases observed in transformer attention distributions. The mechanism is entirely different (biological consolidation versus computational masking), but the constraint pattern is identical: bounded simultaneous access produces predictable, systematic failures of attention at intermediate positions.

Herbert Simon named this condition over a period developing from his 1955 work onward, introducing satisficing in 1956 and the term bounded rationality in 1957, though he was describing human decision makers rather than neural networks. Bounded rationality refers to the limitation of rational action by the information available, the computational capacity of the organism, and the time constraints under which decisions must be made. Simon’s response to bounded rationality was satisficing: seeking outcomes that are satisfactory rather than optimal, because the cost of searching for the optimum exceeds the benefit of finding it. The goal, as Simon framed it, was to replace the global rationality of economic man with a kind of rational behavior compatible with the access to information and the computational capacities actually possessed by organisms in the environments in which they exist. Substitute “language model” for “organism” and “context window” for “environment” and the framework applies without modification.

Retrieval augmented generation is satisficing made architectural. Rather than expanding the context window to accommodate all potentially relevant information (an approach that fails because of quadratic cost scaling and positional degradation), RAG selects a subset of information likely to be relevant and injects it into the context at inference time. A frequently used chunk size range of 512 to 1,024 tokens has emerged as common practice in long context retrieval, though the optimum varies by corpus and task, balancing sufficient context against the dilution of relevance. Anthropic’s own documentation on contextual retrieval notes that placing incorrect, irrelevant, or simply too much information into the context window can degrade rather than improve results. This is a direct instantiation of Simon’s insight: the cost of comprehensiveness exceeds its benefit, so curation becomes the rational strategy. Not all tokens contribute equal value. A 500 token legal disclaimer has near zero marginal utility for most queries, while a 10 token customer identifier may be critical. The engineering practice of token budgeting (allocating context capacity according to estimated information value) is procedural rationality in Simon’s precise sense: rationality expressed not as optimal outcomes but as effective procedures for navigating constraints.

What makes the context window epistemologically distinctive, as opposed to merely technically interesting, is the nature of the knowledge it contains. A language model possesses two categories of knowledge. Parametric knowledge is encoded in the model’s weights during training: persistent, accessible across all inferences, but frozen at the point of training. Contextual knowledge exists only during a specific inference pass, derived from the tokens currently in the window, and it vanishes entirely when that pass concludes. Within a single deployment, the model cannot consolidate contextual knowledge into parametric knowledge. Each interaction is, in that sense, a fresh encounter. Fine-tuning on conversation logs does constitute a form of consolidation (contextual knowledge migrating into weights), but it requires an explicit retraining cycle rather than the continuous, automatic consolidation that biological memory performs through sleep and rehearsal. During inference, the conversational continuity that users experience is an artifact of re-ingesting previous exchanges: the model reads the transcript of what it has already said and reconstructs a semblance of memory from that text. This is closer to rereading than to remembering, though the boundary between the two is less absolute than it first appears.

The epistemological consequence is a form of knowledge that has no analog in human experience. Human working memory is bounded, but it feeds into long term memory through consolidation: sleep, rehearsal, emotional salience, and contextual association all create pathways from temporary activation to durable storage. A human who forgets a conversation may later recall fragments through association or cue. A language model that exhausts its context window has lost access to that information within the current inference pass, with no retrieval pathway and no residual trace, unless the information happens to overlap with patterns in its training data or is later incorporated through fine-tuning, both of which constitute different kinds of knowledge acquisition entirely. The philosopher’s distinction between knowing and having access to knowledge collapses here into a single condition: for a language model during inference, to have access to information and to know it are the same thing, and to lose access is to unknow it completely.

This is why the expansion of context windows, while technically impressive, does not by itself resolve the epistemological condition. A model with one million tokens of context faces the same structural situation as a model with four thousand: bounded capacity, degradation patterns that emerge under load, and the necessity of curation over comprehension. Active research on positional bias mitigation (including calibration mechanisms and attention scaling techniques) has shown that some of the specific degradation patterns are partially correctable within existing architectures, which means the particular failure modes described here may prove more contingent than they currently appear. But even if positional bias is substantially mitigated, the deeper point survives: the bound has moved, and may continue to move, but some bound remains. The NoLiMa benchmark demonstrates that models which perform well on simple needle retrieval (finding a single planted sentence) fail dramatically on associative reasoning tasks that require latent connections without explicit textual overlap, which is what most real applications demand. A larger window does not produce a qualitatively different relationship to knowledge; it produces a quantitatively expanded version of the same epistemological constraint.

The practical implications are less interesting than the philosophical ones. Engineers already know to chunk their contexts, budget their tokens, and position critical information at the beginning or end of prompts. What the epistemological framing reveals is that these engineering decisions are, whether their practitioners recognize it or not, choices about what a system can know and how reliably it can know it. Selecting a context window size is selecting a scope of possible knowledge. Designing a RAG pipeline is designing an attention policy: which information deserves to be known during this inference, and which can be safely left unknowable. Placing a system prompt at position zero is exploiting primacy bias to ensure that certain knowledge persists more reliably than other knowledge, which is a hierarchy of epistemic privilege encoded in token position.

The deeper implication is that context windows make architecturally explicit what bounded rationality only described theoretically. Simon argued that real agents cannot optimize because they lack the information and computational capacity to evaluate all alternatives. The argument was persuasive but abstract: human cognitive limits are difficult to measure precisely, and the boundary between “could know but chose not to search” and “cannot know due to capacity constraints” is blurred by the opacity of biological cognition. In a transformer, these limits are exact. The context window has a specific token count. The attention distribution across positions can be measured. The degradation curve from position 1 to position n is empirically characterized. In a transformer architecture, we have a system whose epistemological constraints are architecturally transparent and formally measurable in principle, even when specific production deployments remain opaque. The context window is not a metaphor for bounded rationality. It is bounded rationality, instantiated in silicon, with the architectural parameters that govern it fully specified even where the trained parameters are not.

That this system, despite its constraints, produces responses that appear knowledgeable is perhaps the most unsettling finding. The appearance of comprehensive understanding emerges from radically bounded attention operating on curated fragments, which suggests that what we call knowledge may have always been less about the totality of available information and more about the effectiveness of satisficing strategies applied under constraint. If a system that knows nothing beyond its current window can perform as though it understands, the uncomfortable possibility is that understanding itself is a satisficing behavior: not the possession of complete information, but the procedurally rational management of whatever information happens to be available at the moment of inference. We have built, in the context window, a machine that makes the epistemological condition of all bounded knowers visible for the first time, and what it reveals is that knowledge was never what we thought it was.

Key Points

Context Window Epistemology

Comments