The Etymology Tax: Etymological Register Effects on LLM Multi-Step Reasoning

Ashita Orbis | March 6, 2026 | 41 min read | working-paper

Ashita Orbis

Working paper. Not peer-reviewed.

Abstract

This study tests whether the etymological register of input text affects large language model performance on multi-step reasoning tasks. Using 250 murder mystery narratives from the MuSR benchmark, eight models were evaluated across six conditions in a 2x3 factorial design crossing vocabulary register (baseline, Germanic, Latinate) with explicit clarification (present, absent). Both Germanic and Latinate register translations significantly reduced accuracy relative to baseline English (Germanic: -2.5%, p=0.019; Latinate: -3.7%, p=0.003), while explicit clarification of ambiguous narratives produced no significant effect (+0.5%, p=0.67). A stochastic validation experiment using 80 target problems with three additional translation variants per problem revealed a critical asymmetry: accuracy reductions caused by register translation were robust across wording variants (2% flip rate from harmful to helpful), whereas accuracy improvements were fragile (20% flip rate from helpful to harmful). Full register translations exhibited significantly higher variance than minimal translations (ANOVA F=2.97, p=0.038), with Latinate full translations showing the highest fragility (40% of problems classified as fragile) compared to Germanic minimal translations (0% fragile). Linguistic analysis identified pronoun count (d=1.39), sentence count (d=1.10), and causal reasoning markers (d=1.07) as significant predictors of translation sensitivity. The findings suggest that vocabulary register constitutes an uncontrolled variable in LLM benchmarking, that models are calibrated to natural English word distributions rather than to any particular register, and that the robustness asymmetry between harmful and helpful translation effects has implications for the reliability of prompt engineering strategies that depend on vocabulary choice.

1. Introduction

1.1 English as a Language with Two Registers

English occupies a distinctive position among natural languages in that its vocabulary contains two largely parallel lexical strata, a consequence of the Norman Conquest of 1066 and the subsequent Renaissance importation of Latin and Greek terminology into scholarly discourse. The Germanic substrate, inherited from Old English, provides the foundational vocabulary of everyday communication: words like "begin," "ask," "buy," "help," and "understand" that constitute the core of conversational and practical language. The Latinate superstratum, arriving through Norman French and later through direct borrowing from classical languages, provides a parallel vocabulary for formal, technical, and academic contexts: "commence," "inquire," "purchase," "facilitate," and "comprehend" occupy substantially the same semantic space as their Germanic counterparts but carry different connotations of register, formality, and domain.

This lexical duality is not merely a historical curiosity but a functional feature of the language that speakers deploy strategically, because the choice between Germanic and Latinate synonyms signals something about the communicative context beyond the semantic content of the utterance itself. A physician who says "the patient expired" rather than "the patient died" is communicating the same propositional content but signaling professional distance, clinical precision, and institutional authority through register selection. The reverse substitution, replacing Latinate vocabulary with Germanic equivalents, similarly signals informality, directness, and accessibility without changing the underlying meaning. These register effects operate below the level of conscious attention for most speakers, which makes them a potentially interesting variable in contexts where superficial linguistic features might influence processing by systems trained on the statistical regularities of natural language.

1.2 The Hypothesis: Etymological Register as a Latent Space Signal

The research presented in this paper originated from a specific prediction about how etymological register interacts with the internal representations of large language models. The hypothesis was that register functions as a latent space signal because training data creates systematic correlations between vocabulary register and content type. Latinate vocabulary clusters with academic, technical, analytical, and formal content in the training distribution, which means that a model encountering Latinate vocabulary might be drawn toward what could be described as a "reasoning basin" in its representational space, a region associated with careful, systematic, analytical processing. Germanic vocabulary, by the same logic, clusters with everyday, informal, concrete content and might draw the model toward a different processing mode that is less conducive to the sustained multi-step reasoning that complex tasks demand.

The prediction, stated plainly, was that Latinate register should improve reasoning performance and Germanic register should degrade it, because the etymological signal would activate different processing patterns learned from the distributional properties of the training data. The inverse was also considered plausible: that the informality and directness of Germanic vocabulary might reduce cognitive load and improve performance on tasks where the reasoning is already complex. Either outcome would demonstrate that etymological register functions as a meaningful variable in model behavior.

The actual results contradicted both versions of the hypothesis. Both register translations reduced accuracy relative to baseline English, with Latinate vocabulary causing a larger degradation than Germanic. The models were not drawn to a reasoning basin by academic vocabulary, nor were they helped by the simplicity of everyday vocabulary. Instead, any deviation from the natural distribution of English vocabulary imposed a cost, regardless of the direction of that deviation. This finding reframes the question from "which register helps reasoning" to "why does any register shift hurt," which is a more interesting question with more uncomfortable implications for the reliability of language model performance under natural variation in input text.

1.3 Practical Significance

The question of whether vocabulary register affects model performance has practical implications that extend beyond academic interest in the mechanisms of language processing. Legal documents, medical records, academic papers, and government communications are written in registers that diverge substantially from the conversational English that constitutes the majority of language model training data. If register systematically affects reasoning accuracy, then model performance on tasks involving formal or specialized text cannot be straightforwardly predicted from benchmark performance on naturalistic prompts, and benchmark comparisons that do not control for register may conflate model capability differences with register sensitivity differences.

Prompt engineering, the practice of optimizing input text to improve model outputs, implicitly involves register decisions that are rarely examined as such. Practitioners who instruct models to "think step by step" or who frame problems in formal academic language are making register choices that may affect performance through mechanisms unrelated to the informational content of the prompt. If the direction of these effects is unpredictable without empirical testing, then vocabulary register constitutes an uncontrolled variable in both benchmarking and deployment contexts that warrants systematic investigation.

1.4 Summary of Findings

The study proceeds through a pilot experiment on mathematical reasoning (GSM8K, 50 problems, 3 models) and a full experiment on narrative multi-step reasoning (MuSR murder mysteries, 250 problems, 8 models), followed by a stochastic validation experiment that tests the robustness of observed effects across alternative translation variants. The principal findings are that both Germanic and Latinate register translations significantly reduce accuracy (p<0.02 for both), that the effects are asymmetrically robust (harmful effects are consistent across wording variants while helpful effects are fragile), that full register translations introduce more variance than minimal translations restricted to vocabulary changes, and that problems with more pronouns, more sentences, and more causal reasoning markers are most sensitive to register manipulation.

2. Related Work

2.1 Prompt Sensitivity in Large Language Models

The observation that minor changes in prompt wording can substantially alter model outputs has been documented across a range of tasks and models since the earliest evaluations of instruction tuned language models. Zhao et al. (2021) demonstrated that GPT-3's performance on few-shot classification tasks was highly sensitive to the choice of prompt template, the order of examples, and even the label names used to describe categories, with accuracy varying by up to 30 percentage points across semantically equivalent prompt formulations. Lu et al. (2022) extended this finding to show that example ordering effects alone could shift performance from near chance to approaching the state of the art on the same task with the same model, suggesting that superficial features of the input carry substantial weight in determining model behavior independent of the informational content.

More recent work has confirmed that prompt sensitivity persists in models trained with reinforcement learning from human feedback and instruction tuning, though the specific manifestations differ. Sclar et al. (2024) systematically evaluated prompt sensitivity across 61 LLMs and found that formatting choices (whitespace, punctuation, capitalization) that carry no semantic content could shift accuracy by up to 76 percentage points on standard benchmarks, a finding that calls into question the stability of any benchmark result that does not report sensitivity analyses across prompt variants. Mizrahi et al. (2024) proposed that prompt sensitivity should itself be treated as a measurable property of models, analogous to accuracy, and developed a benchmark specifically designed to quantify the degree to which semantically equivalent reformulations of a question affect model performance.

The present study differs from this literature in that the manipulated variable is not the structure or formatting of the prompt but the etymological provenance of the vocabulary used to express the same content. Register effects occupy a middle ground between the purely superficial formatting sensitivity documented by Sclar et al. and the semantic content that is typically the object of evaluation: changing "begin" to "commence" preserves meaning and syntax but alters the distributional properties of the text in ways that may interact with learned associations between vocabulary and processing mode.

2.2 Register and Formality in Natural Language Processing

Computational approaches to register variation have a long history in corpus linguistics and stylistics, where automated register classification has been studied since Biber's (1988) multidimensional analysis of spoken and written English demonstrated that register variation is structured along identifiable linguistic dimensions including informational versus involved production, narrative versus nonnarrative orientation, and explicit versus contextually dependent reference. More recent computational work has applied these frameworks to the analysis of web text (Biber and Egbert, 2018) and to the evaluation of register awareness in language models (Kang and Hovy, 2021), though the focus has typically been on classification (can models identify register?) rather than on the effects of register on downstream task performance.

The specific question of whether etymological register, as distinct from syntactic complexity, sentence length, or other correlates of formality, affects language model reasoning has not been directly investigated in the published literature to the knowledge of the present author. The closest related work examines the effects of text simplification on model performance, where findings are mixed: some studies report that simplified text improves comprehension task performance while others find that simplification removes information that models rely on for inference, suggesting that the relationship between linguistic complexity and model performance is not monotonic.

2.3 The MuSR Benchmark

The Multi-step Soft Reasoning benchmark (MuSR) was introduced by Sprague et al. (2024) at ICLR 2024 as a Spotlight paper. MuSR was designed to evaluate language model performance on tasks requiring integrated reasoning across multiple steps within long narratives, addressing a gap in existing benchmarks that either test isolated reasoning steps or use short contexts that do not require sustained tracking of information across extended text. The murder mystery split used in the present study consists of 250 problems, each presenting a narrative of approximately 1,000 words describing a crime scene, suspect interviews, and evidence, followed by a question asking the reader to identify the most likely perpetrator from among the named suspects.

MuSR is particularly suitable for studying register effects because the narratives are long enough for vocabulary changes to accumulate (approximately 150 to 200 words are typically altered in a full register translation), because the reasoning required spans multiple inferential steps (connecting alibis, motives, evidence, and timelines), and because the binary choice format provides a clear baseline accuracy (random performance is approximately 33% given three answer choices) against which register effects can be measured.

2.4 Theoretical Frame: Surface Sensitivity and Language Understanding

The finding that vocabulary register affects reasoning accuracy engages a broader theoretical question about the nature of language understanding in large language models, a question that has been debated under various framings since Bender and Koller (2020) argued that models trained solely on the form of language cannot learn meaning. The "stochastic parrots" critique (Bender et al., 2021) holds that language models are fundamentally pattern matching systems whose apparent understanding is an artifact of statistical regularities in training data rather than evidence of genuine comprehension. Under this framing, sensitivity to vocabulary register is expected: a system that processes text through learned statistical associations would naturally be affected by changes in the distributional properties of its input, because those distributional properties are the primary signal it has learned to exploit.

The alternative view, that language models develop internal representations that capture some aspects of meaning beyond surface statistics (Li et al., 2023; Nanda et al., 2023), does not necessarily predict register invariance, because representations that are robust to some surface variations may still be sensitive to others. The present study does not resolve this debate, but it provides empirical data on a specific dimension of surface sensitivity that has not been previously quantified, and it introduces a methodology (stochastic validation of translation effects) that allows the robustness of observed effects to be assessed independently of their statistical significance in the primary analysis.

3. Methods

3.1 Pilot Study: GSM8K

The research program began with a pilot study on the GSM8K benchmark (Cobbe et al., 2021), a collection of 8,500 grade school mathematics word problems that has become a standard evaluation for mathematical reasoning in language models. Fifty problems were sampled from the test set and translated into two register conditions: Anglish (exclusively Germanic vocabulary) and Classical (Latinate vocabulary), using Claude Opus 4.5 as the translation model at temperature 0 for deterministic output.

Three models were evaluated on the pilot: DeepSeek V3.2, Gemini 2.5 Flash-Lite, and Llama 3.3 70B Instruct, selected for training diversity (Chinese lab, Google, Meta open weights) and cost efficiency through OpenRouter. The pilot used a simple three condition design (baseline, Anglish, Classical) without the clarification dimension that was added in the full study.

The pilot results were directionally suggestive but not statistically significant. Classical register showed a consistent accuracy reduction across all three models (DeepSeek: -6.0%, Gemini: -4.0%, Llama: -2.0%, average: -4.0%), while Anglish register showed minimal effect (average: +0.7%). A chi-square test across conditions yielded p=0.263, failing to reach significance. The pilot was nevertheless informative in two respects: it confirmed the directional Classical effect that motivated the full study, and it revealed confounds in the translation methodology (semantic drift, particularly the translation of common nouns into scientific nomenclature such as "kittens" becoming "felines") that were addressed in the full study design.

The pilot also demonstrated that GSM8K was not an ideal benchmark for this investigation, because the mathematical content of the problems is computationally trivial for current models (baseline accuracy exceeded 90% for two of three models), which limits the dynamic range available for detecting register effects. The decision to move to MuSR for the full study was motivated by the need for a benchmark with longer narratives (providing more vocabulary to translate), harder reasoning (providing more dynamic range for effect detection), and lower baseline accuracy (providing more room for both improvement and degradation).

3.2 Full Study: MuSR Murder Mysteries

The full study used the murder mystery split of the MuSR benchmark, comprising 250 problems with narratives averaging approximately 1,000 words each. The experimental design was a 2x3 factorial crossing two levels of clarification (original, clarified) with three levels of vocabulary register (baseline, Germanic, Latinate), yielding six conditions.

Table 1. Experimental conditions in the 2x3 factorial design.

Condition Register Clarification Description
baseline Original None Unchanged MuSR narratives
clarified_baseline Original Explicit Pronouns resolved, temporal markers added
anglish Germanic Implicit Full Germanic vocabulary with implicit clarification
anglish_minimal Germanic None Germanic vocabulary only, original ambiguity preserved
classical_v2 Latinate Implicit Full Latinate vocabulary with implicit clarification
classical_minimal Latinate None Latinate vocabulary only, original ambiguity preserved

The clarification dimension was introduced to disentangle two potential mechanisms of register effects. Translation inevitably involves some degree of rephrasing, and rephrasing can clarify ambiguous constructions (particularly ambiguous pronoun references and temporal sequences) in ways that might improve performance independent of vocabulary changes. The "minimal" conditions isolated the vocabulary effect by constraining the translator to change only individual words without restructuring sentences, while the "full" conditions (anglish and classical_v2) allowed the translator to rephrase for naturalness, which implicitly introduced some clarification. The clarified_baseline condition applied clarification without vocabulary changes, providing a direct estimate of the clarification effect.

Eight models were evaluated across all six conditions, selected to span a range of architectures, training backgrounds, and capability levels.

Table 2. Models evaluated in the full study.

Model Provider Baseline Accuracy
grok-4-fast xAI 74.8%
grok-4.1-fast xAI 73.2%
gemini-2.5-flash Google 68.0%
deepseek-chat-v3-0324 DeepSeek 63.2%
mistral-large-2512 Mistral 61.6%
claude-haiku-4.5 Anthropic 59.2%
llama-3.3-70b-instruct Meta 58.4%
nova-2-lite-v1 Amazon 55.3%

All evaluations were conducted through the OpenRouter API at temperature 0 for deterministic output, with each model receiving the same system prompt and response format instructions across all conditions. Amazon Nova exhibited intermittent API errors (42 total across all conditions, approximately 0.35% of evaluations), which were excluded from accuracy calculations. The total number of valid evaluations in the main study was 11,958.

3.3 Translation Methodology

All translations in the full study were generated by DeepSeek V3.2 at temperature 0 through the OpenRouter API, a change from the pilot study which had used Claude Opus 4.5. The switch was motivated by cost considerations (the full study required approximately 1,500 translations compared to 100 in the pilot) and by the observation that DeepSeek V3.2 produced more consistent controlled for length translations than Opus 4.5, which tended toward more aggressive rephrasing.

Translation prompts were designed to minimize semantic drift, the pilot study's primary confound. The key constraints were as follows. Proper nouns, character names, and numbers were to be preserved unchanged. Sentence count and overall narrative structure were to be maintained. The word count of the translation was to fall within 93% to 107% of the original, a constraint enforced programmatically with translations outside this range regenerated with modified prompts. Scientific nomenclature and overly technical substitutions were to be avoided (addressing the pilot's "kittens to felines" problem). For minimal conditions, only individual vocabulary items were to be changed without restructuring sentences. For full conditions, natural rephrasing was permitted, which implicitly introduced some degree of clarification.

The following example illustrates the three registers applied to the opening of a MuSR narrative.

Baseline (original): "In an adrenaline inducing bungee jumping site, Mack's thrill-seeking adventure came to a gruesome end by a nunchaku; now, it's up to Detective Winston to unravel the deadly secrets between Mackenzie and Ana."

Germanic (Anglish): "At a heart-racing bungee leap grounds, Mack's daredevil outing came to a grisly end by a nunchaku; now, it falls to Sleuth Winston to unravel the deathly hidden truths between Mackenzie and Ana."

Latinate (Classical): "In an adrenaline-inducing bungee-jumping locale, Mack's audacious escapade culminated in a gruesome demise by nunchaku; now, it fell to Detective Winston to unravel the lethal enigma betwixt Mackenzie and Ana."

3.4 Stochastic Validation

The main study used a single translation per problem per condition (generated at temperature 0), which raises the question of whether observed effects are properties of the register manipulation itself or artifacts of the specific wording choices made by the translator model. To address this question, a stochastic validation experiment generated three additional translation variants per problem at temperature 0.7 for a subset of problems selected to maximize the informativeness of the validation.

The target problems were the ten most consistently hurt and ten most consistently helped problems in each of four conditions (anglish_minimal, classical_minimal, anglish, classical_v2), yielding 80 target problems. "Most consistently hurt" was defined as the problems that showed the largest number of correct to wrong flips across all eight models when comparing the translated condition to baseline, and "most consistently helped" was defined analogously for wrong to correct flips. Each target problem was evaluated on all eight models across all four translation variants (original plus three stochastic), producing approximately 2,500 additional evaluations.

The stochastic validation allows two distinct questions to be addressed. The consistency question asks whether a problem that is hurt (or helped) by translation v1 is also hurt (or helped) by translations v2, v3, and v4, which tests whether the effect is a property of the register shift or a property of the specific translation. The fragility question asks whether some conditions produce more variable outcomes across translation variants than others, which tests whether the degree of vocabulary manipulation affects the reliability of translation effects.

4. Results

4.1 GSM8K Pilot

The pilot study on 50 GSM8K problems across three models and three conditions produced the following accuracy results.

Table 3. GSM8K pilot accuracy by model and condition.

Model Baseline Anglish Classical Classical Delta
DeepSeek V3.2 94.0% 94.0% 88.0% -6.0%
Gemini 2.5 Flash-Lite 98.0% 98.0% 94.0% -4.0%
Llama 3.3 70B 90.0% 88.0% 88.0% -2.0%
Average 94.0% 93.3% 90.0% -4.0%

The Classical register effect was directionally consistent across all three models (average: -4.0%) but did not reach statistical significance (chi-square p=0.263), which was expected given the small sample size (50 problems) and the high baseline accuracy that limits the dynamic range for detecting degradation. The Anglish register showed no consistent effect (average: +0.7%), with one model showing a small degradation and the other two unchanged. These results motivated the full study on a harder benchmark with more problems and more models.

4.2 MuSR Main Study

The full study on 250 MuSR murder mystery problems across eight models and six conditions produced the following results.

Table 4. MuSR accuracy by model and condition (full study).

Model Baseline Clarified Anglish Ang. Min. Classical Cls. Min.
grok-4-fast 74.8% 74.4% 68.0% 72.0% 68.8% 72.0%
grok-4.1-fast 73.2% 67.6% 66.4% 71.2% 69.2% 64.8%
gemini-2.5-flash 68.0% 66.4% 62.4% 61.2% 64.8% 64.0%
deepseek-v3-0324 63.2% 62.8% 58.0% 58.8% 59.6% 58.0%
mistral-large-2512 61.6% 63.2% 60.4% 59.2% 62.4% 59.2%
claude-haiku-4.5 59.2% 63.6% 58.4% 57.2% 53.6% 55.2%
llama-3.3-70b 58.4% 62.0% 56.8% 57.6% 59.2% 57.6%
nova-2-lite-v1 55.3% 57.8% 54.5% 56.3% 55.7% 53.3%
Average 64.2% 64.7% 60.6% 61.7% 61.7% 60.6%

4.3 Register Effects

The primary analysis compared the minimal translation conditions (which isolate the vocabulary effect without confounding clarification) against baseline accuracy using paired t-tests across models, treating each model as an observation.

Table 5. Register effects (minimal translations vs. baseline).

Comparison Delta Cohen's d t-statistic p-value
Anglish minimal vs. baseline -2.5% -0.052 -3.054 0.019
Classical minimal vs. baseline -3.7% -0.076 -4.456 0.003

Both register translations produced statistically significant accuracy reductions, with Latinate vocabulary causing a larger drop than Germanic vocabulary. The effect sizes (Cohen's d of -0.052 and -0.076) are small in absolute terms but represent meaningful practical differences when applied across hundreds of reasoning problems, as the delta analysis in Section 4.5 demonstrates.

4.4 Clarification Effect

The clarification effect was estimated by comparing the clarified_baseline condition against the baseline condition.

Table 6. Clarification effect (clarified baseline vs. baseline).

Metric Value
Mean delta +0.5%
Standard deviation 3.0%
t-statistic 0.448
p-value 0.668

The overall clarification effect was not statistically significant. Disaggregation by model tier revealed an interaction: weaker models (those with baseline accuracy below 60%) showed consistent improvement from clarification (claude-haiku-4.5: +4.4%, llama-3.3-70b: +3.6%, nova-2-lite-v1: +2.5%), while stronger models showed mixed or negative effects (grok-4.1-fast: -5.6%, grok-4-fast: -0.4%). This interaction suggests that clarification benefits models that struggle with ambiguity resolution but may introduce harmful verbosity for models that already handle ambiguity competently, though the interaction was not formally tested and the number of models in each tier is small.

4.5 Delta Analysis

Across all eight models and four translation conditions, the net direction of translation effects was computed by counting the number of correct to wrong flips (problems hurt by translation) and wrong to correct flips (problems helped by translation).

Total accuracy flips across all models and conditions: approximately 170 problems were hurt by translation (correct on baseline, wrong on translated version), while approximately 100 problems were helped (wrong on baseline, correct on translated version), yielding a net effect of approximately 70 more problems hurt than helped. This asymmetry confirms that translation effects are predominantly negative and is consistent with the statistically significant register effects reported in Section 4.3.

The ten problems most consistently hurt by translation showed hurt counts ranging from 7 to 12 (out of a maximum of 32, representing 8 models across 4 conditions), while the ten problems most consistently helped showed helped counts ranging from 4 to 6, indicating that the most sensitive problems were more consistently hurt than the most sensitive problems were consistently helped.

4.6 Stochastic Validation: Robustness of Effects

The stochastic validation experiment tested 80 target problems (40 hurt, 40 helped) across four translation variants each, producing the following consistency results.

Table 7. Stochastic consistency of translation effects.

Category Consistent direction Flip rate
Hurt problems (N=40) 33/40 (83%) 1/40 (2%) had at least one helpful variant
Helped problems (N=40) 22/40 (55%) 8/40 (20%) had at least one harmful variant
Overall 55/80 (69%)

The critical finding is the asymmetry between the two categories. When register translation hurts a problem, it does so robustly: across all forty hurt problems, only one (2%) had any translation variant that reversed the direction to helpful. When register translation helps a problem, the effect is fragile: eight of forty (20%) had at least one translation variant that reversed the direction to harmful. This tenfold difference in flip rates indicates that harmful register effects are a property of the register shift itself (they persist regardless of specific wording), while helpful register effects are partially a property of the specific translation (they depend on particular wording choices that happen to clarify or restructure the narrative in beneficial ways).

4.7 Fragility by Condition

The stochastic validation data also allowed comparison of fragility rates across translation conditions, where fragility was defined as standard deviation of accuracy across the four translation variants exceeding 0.15 (accuracy swinging by more than 15 percentage points depending on which translation variant was used).

Table 8. Fragility by translation condition.

Condition N Mean Variance Fragile (>0.15) Robust (<0.05)
anglish_minimal 20 0.078 0 (0%) 1 (5%)
classical_minimal 20 0.092 3 (15%) 3 (15%)
anglish 20 0.096 3 (15%) 2 (10%)
classical_v2 15 0.133 6 (40%) 0 (0%)

The difference in variance across conditions was statistically significant (one-way ANOVA: F=2.97, p=0.038). Post-hoc pairwise comparisons revealed that the contrast driving this result was between anglish_minimal (the least fragile condition) and classical_v2 (the most fragile condition), which differed significantly (t=-3.20, p=0.003). The contrast between minimal translations as a group and full translations as a group was also significant (t=-2.04, p=0.045).

The pattern is interpretable: full register translations involve more vocabulary substitutions and more rephrasing than minimal translations, which means more opportunities for specific wording choices to interact with problem content in ways that vary across translation attempts. Classical full translations (classical_v2) are the most fragile because Latinate vocabulary has a larger inventory of synonyms than Germanic vocabulary (English has more Latinate words than surviving Old English words), which increases the variability of translation outcomes.

4.8 Linguistic Predictors of Fragility

To identify what makes certain problems more sensitive to translation wording than others, the linguistic features of fragile problems (variance > 0.15, N=12) were compared to those of robust problems (variance < 0.05, N=6) using independent-samples t-tests.

Table 9. Linguistic predictors of translation fragility.

Feature Fragile Mean Robust Mean p-value Cohen's d
Pronoun count 64.3 54.5 0.014 +1.39
Sentence count 68.5 56.8 0.043 +1.10
Causal markers 1.33 0.33 0.048 +1.07
Hedge words 1.58 3.00 0.073 -0.96
Evidence words 3.00 5.50 0.065 -0.99

Three features reached statistical significance (p<0.05) with large effect sizes (d>0.80). Problems with more pronouns are more fragile, which is consistent with the hypothesis that dense in pronouns narratives provide more opportunities for translation to introduce or resolve ambiguity in referent tracking. Problems with more sentences are more fragile, which may reflect the accumulation of small translation perturbations across more textual units. Problems with more causal reasoning markers ("because," "therefore," "as a result") are more fragile, suggesting that the vocabulary used to express causal relationships is particularly sensitive to register substitution.

Two features showed nearly significant trends in the opposite direction: problems with more hedge words and more evidence words tended to be more robust. This pattern is suggestive of a trade-off between precision and ambiguity in narrative construction, where problems that explicitly hedge their claims and cite evidence may provide more redundant cues for the reasoning chain, making them less dependent on any single lexical choice.

4.9 Analysis by Model Tier

Disaggregation of the register effect by model capability tier reveals that stronger models show larger absolute accuracy drops from register translation than weaker models.

Table 10. Register effects by model tier.

Model Tier Baseline Translation Drop Clarification Effect
Top (Grok models) 74.0% -5.2% Mixed (-0.4% to -5.6%)
Mid (Gemini, DeepSeek) 65.6% -4.3% Slight negative
Lower (Haiku, Llama, Nova) 57.6% -1.9% Positive (+2.5% to +4.4%)

This pattern admits two interpretations that are not distinguishable with the present data. The first interpretation is that stronger models are more sensitive to register because they have learned more nuanced distributional patterns from their training data, which makes them more responsive to distributional shifts introduced by register translation. The second interpretation is that weaker models simply have less accuracy to lose, because their lower baseline performance means that a larger proportion of their errors are already due to factors other than register sensitivity, which creates a floor effect that limits the detectable register impact. Distinguishing these interpretations would require a controlled comparison where the same model architecture is evaluated at different training stages or scale points, which is beyond the scope of the present study.

5. Discussion

5.1 The Failed Hypothesis

The original prediction that Latinate vocabulary would activate a "reasoning basin" in model representations and improve analytical performance was wrong in both its positive and negative forms. Latinate register did not help, and Germanic register, which was predicted to hurt or show no effect, also degraded performance. The symmetry of this result is informative because it rules out explanations that depend on directional associations between register and reasoning quality. The effect is not that "fancy words confuse models" (which would predict Classical degradation but not Anglish degradation), nor that "simple words help models think" (which would predict Anglish improvement). Instead, both directions of register shift impose a cost, which suggests that models are optimized for the natural distribution of English vocabulary and that any deviation from that distribution disrupts processing.

This interpretation is consistent with a calibration account of language model performance, in which the model's internal representations are tuned to the statistical regularities of its training distribution, and perturbations that shift the input away from that distribution impose a processing cost proportional to the magnitude of the distributional shift. Under this account, the larger effect of Classical register (-3.7%) compared to Anglish register (-2.5%) could reflect the greater distributional distance of Latinate vocabulary from the modal register of training data, rather than any specific property of Latinate vocabulary that interferes with reasoning. The natural register of the MuSR narratives falls somewhere between pure Germanic and pure Latinate (as does most natural English), and Classical translation pushes the text further from this central tendency than Anglish translation does, both because Latinate vocabulary is less frequent in conversational English and because the inventory of available Latinate substitutions is larger (producing more extensive text modification).

5.2 The Reasoning Model Caveat

All eight models evaluated in this study were standard models without explicit chain of thought reasoning architectures. Models such as OpenAI's o1 and o3, Anthropic's extended thinking mode, and DeepSeek-R1 employ internal reasoning chains that decompose complex problems into intermediate steps before producing a final answer. These reasoning models might exhibit different register sensitivity for at least two reasons.

The first reason is that the internal reasoning chain provides an opportunity for the model to normalize vocabulary during processing, effectively translating shifted in register input into a standard internal register before applying reasoning operations. If reasoning models learn to do this (which is an empirical question that the present study cannot answer), then the etymology tax might not apply to them, or might apply at a reduced magnitude. The second reason is that reasoning models are trained with reinforcement learning on reasoning quality, which might select for invariant to register representations as a byproduct of optimizing for consistent reasoning across the distribution of inputs encountered during training.

The absence of reasoning models from the present study is an acknowledged limitation, but it is also, potentially, the most interesting extension of the work. A finding that reasoning models are robust to register effects while standard models are not would constitute evidence that chain of thought processing develops invariant to register representations, which would be a specific and testable prediction about the cognitive architecture of reasoning in language models.

5.3 The Fragility Asymmetry

The stochastic validation experiment's most interesting finding is not the overall consistency rate (69% of problems showed consistent effects across translation variants) but the asymmetry between harmful and helpful effects. The 2% flip rate for hurt problems, compared to the 20% flip rate for helped problems, indicates that translation induced accuracy degradation is a structural property of the interaction between register shifts and model processing, while translation induced accuracy improvement is a contingent property of specific wording choices.

This asymmetry has practical implications for the reliability of prompt engineering strategies. A practitioner who discovers that rephrasing a prompt in a particular register improves performance on a sample of tasks cannot confidently extrapolate that finding to new tasks, because the improvement may depend on specific wording choices that happen to clarify or restructure the content in beneficial ways that will not generalize. A practitioner who discovers that a register shift hurts performance can, however, be more confident that the effect will generalize, because harmful register effects are robust across wording variants.

The asymmetry also has implications for benchmark reliability. If a benchmark's problems are phrased in a register that is suboptimal for the models being evaluated, the resulting accuracy scores will systematically underestimate model capability, and this underestimation will be robust (not an artifact of specific wording choices). Conversely, if a benchmark's phrasing happens to be particularly suited to the models, the resulting advantage will be fragile and may not replicate under alternative phrasings of the same problems.

5.4 Implications

The findings of this study bear on three practical domains in which vocabulary register varies naturally and in which language model reasoning accuracy matters.

In legal texts, the specialized vocabulary of statutory language, case law, and regulatory documents represents a systematic register shift from the conversational English on which language models are primarily trained. The present results suggest that models processing legal text may incur a register tax that reduces reasoning accuracy relative to their performance on informally phrased versions of the same reasoning problems. This does not mean that models cannot process legal text effectively, but it does mean that benchmark performance on naturalistic prompts may overestimate model capability on tasks involving legal language, and that the magnitude of this overestimation is difficult to predict without specific to the domain evaluation.

In medical documentation, clinical notes and research papers employ a heavily Latinate vocabulary that the present results identify as the register most costly to reasoning accuracy. Models deployed in medical contexts may face a larger register tax than models deployed in consumer applications, and this difference would not be captured by standard benchmark evaluations that use naturalistic language.

For prompt engineering practice, the results suggest that vocabulary register should be treated as a variable to be controlled rather than optimized. The finding that both directions of register shift hurt performance implies that the optimal register for a reasoning prompt is the natural register of the training distribution, and that attempts to improve performance through formalization or simplification of vocabulary are more likely to degrade performance than to improve it.

5.5 Comparison to the Prompt Sensitivity Literature

The register effects documented in this study (2.5% to 3.7% accuracy reduction) are smaller than the extreme prompt sensitivity effects reported by Sclar et al. (2024), who found formatting based accuracy variations of up to 76 percentage points. This comparison is not straightforward, however, because the manipulations differ in kind. Formatting changes (whitespace, punctuation) alter the surface structure of text in ways that may interact with tokenization and positional encoding, while register changes alter the vocabulary while preserving surface structure. The smaller effect size of register manipulation may reflect the fact that register shifts are within the distribution of natural language variation (English speakers routinely encounter text at various register levels), while extreme formatting changes may be outside the distribution of training data entirely.

The more apt comparison is with the prompt sensitivity work of Mizrahi et al. (2024), who found that semantically equivalent reformulations of questions could shift accuracy by 5 to 15 percentage points depending on the model and task. The register effects documented here fall at the lower end of this range, which is consistent with the observation that register changes preserve more of the original text structure than full reformulations do. The contribution of the present study relative to this literature is the identification of etymological register as a specific, linguistically grounded dimension of prompt variation with quantifiable effects and measurable robustness properties.

6. Limitations

Several limitations constrain the interpretation and generalizability of these findings. The translations were all generated by a single model (DeepSeek V3.2 for the full study, Claude Opus 4.5 for the pilot), which means that the observed register effects may reflect an interaction between the specific translation style of that model and the evaluation models, rather than a property of register shifts in general. Testing with multiple translator models would be necessary to establish that the effects are robust to translator variation.

The binary choice format of the MuSR murder mystery problems means that baseline accuracy is well above chance (approximately 64% across models, compared to a random baseline of approximately 33%), but the binary outcome structure limits the sensitivity of the analysis. A problem that a model gets wrong on both baseline and translated conditions contributes no information about register effects, and a problem that the model gets right on both conditions similarly contributes nothing. The analysis is therefore restricted to the approximately 14% of model problem pairs that show differential performance across conditions, which limits statistical power for detecting small effects.

The domain specificity of the MuSR murder mysteries (fictional crime narratives with specific genre conventions) means that the register effects documented here may not generalize to other domains of multi-step reasoning. Legal reasoning, scientific inference, medical diagnosis, and mathematical proof all involve specific to the domain vocabularies and reasoning patterns that may interact with register effects in ways not captured by the present study.

No reasoning models were evaluated (o1, o3, DeepSeek-R1), which means that the findings apply to standard models only. As discussed in Section 5.2, reasoning models may exhibit different register sensitivity due to their internal chain of thought processing, and the absence of these models from the evaluation is the study's most significant limitation for practical applications in which reasoning models are increasingly deployed.

No human baseline was established, which means that the extent to which register effects on models parallel or diverge from register effects on human readers remains unknown. Human readers are also affected by register (processing speed and comprehension vary with text formality), but the mechanisms are different and the magnitudes may differ substantially.

The stochastic validation sample was limited to 80 problems (the 10 most hurt and 10 most helped per condition), which means that the fragility analysis is based on the tails of the effect distribution rather than a representative sample. Problems in the middle of the distribution (those showing small or inconsistent effects) were not included in the stochastic validation, and their fragility properties may differ from those of the extreme cases.

7. Future Work

The most immediate extension of this work is the evaluation of reasoning models (o1, o3, DeepSeek-R1) on the same experimental design, which would test whether chain of thought processing develops invariant to register representations. A finding that reasoning models are robust to register effects while standard models are not would provide evidence that explicit reasoning mechanisms can learn to normalize superficial linguistic variation, which would be a contribution to the understanding of what reasoning architectures add beyond standard next token prediction. The inverse finding, that reasoning models are equally affected by register, would suggest that register sensitivity is a more fundamental property of language model processing that persists regardless of architectural innovations in reasoning.

Testing with multiple translator models would address the single translator limitation by establishing whether the observed effects are robust to variation in translation style. If different translator models produce different magnitudes of register effect for the same underlying register shift, this would indicate that the specific translation choices (rather than the register shift per se) are doing substantial work, which would complicate the interpretation of the present findings.

Extension to other domains (legal documents, medical records, scientific papers, standardized test questions) would test the generalizability of register effects beyond fictional narratives and would address the question of whether domains with naturally Latinate vocabulary (medicine, law, academia) show different register sensitivity patterns than domains with more varied vocabulary.

Analysis of reasoning chains rather than final accuracy would provide more granular information about where in the reasoning process register effects manifest. If shifted in register text causes models to make errors at specific inferential steps (such as resolving pronoun references or tracking temporal sequences) while leaving other steps unaffected, this would provide insight into which components of language model processing are sensitive to register and which are invariant to register.

8. Conclusion

The etymology tax is a small but statistically significant cost imposed on language model reasoning by deviation from the natural vocabulary distribution of English. Both Germanic and Latinate register shifts reduce accuracy on multi-step reasoning tasks by 2.5% to 3.7% (p<0.02 for both), an effect that is robust across eight models from six different providers. The harmful effects of register translation are consistent across alternative translation wordings (2% flip rate), while helpful effects are fragile (20% flip rate), indicating that translation induced accuracy degradation is a structural property of the register shift while translation induced accuracy improvement is contingent on specific wording choices.

The finding that both directions of register shift hurt performance disconfirms the hypothesis that etymological register functions as a directional signal in model representations, drawing models toward or away from a reasoning mode associated with the training data correlates of academic or conversational vocabulary. Instead, the results support a calibration account in which models are optimized for the natural distribution of English and incur a processing cost proportional to the distributional distance of the input from that optimum. The practical consequence is that vocabulary register should be treated as an uncontrolled variable in language model evaluation and as a potential source of systematic error in applications that involve text at registers distant from conversational English, including legal, medical, and academic domains.

The stochastic validation methodology introduced in this study, which tests the robustness of translation effects across multiple independently generated translation variants, provides a tool for assessing whether observed effects of text manipulation are structural properties of the manipulation or contingent properties of specific wording choices. The application of this methodology to the register effects documented here revealed an asymmetry between harmful and helpful effects that would not have been detected by the primary analysis alone, and which has implications for the reliability of prompt engineering strategies and benchmark evaluations that depend on vocabulary choices.

References

Andreas, J. (2022). Language models as agent models. Findings of the Association for Computational Linguistics: EMNLP 2022.

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610-623.

Bender, E. M., & Koller, A. (2020). Climbing towards NLU: On meaning, form, and understanding in the age of data. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5185-5198.

Biber, D. (1988). Variation across speech and writing. Cambridge University Press.

Biber, D., & Egbert, J. (2018). Register variation online. Cambridge University Press.

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., & Schulman, J. (2021). Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.

Kang, D., & Hovy, E. (2021). Style is NOT a single variable: Case studies for cross-stylistic language understanding. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 2376-2387.

Li, K., Hopkins, A. K., Bau, D., Viégas, F., Pfister, H., & Wattenberg, M. (2023). Emergent world representations: Exploring a sequence model trained on a synthetic task. ICLR 2023.

Lu, Y., Bartolo, M., Moore, A., Riedel, S., & Stenetorp, P. (2022). Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 8086-8098.

Marks, S., Lindsey, J., & Olah, C. (2026). Persona selection in large language models. Anthropic Research.

Mizrahi, M., Kaplan, G., Malber, D., Dahan, R., Choshen, L., Shmidman, A., & Stanovsky, G. (2024). State of what art? A call for multi-prompt LLM evaluation. Transactions of the Association for Computational Linguistics, 12, 933-949.

Nanda, N., Chan, L., Liberum, T., Smith, J., & Steinhardt, J. (2023). Progress measures for grokking via mechanistic interpretability. ICLR 2023.

Sclar, M., Choi, Y., Tsvetkov, Y., & Suhr, A. (2024). Quantifying language models' sensitivity to spurious features in prompt design with a systematic approach. Transactions of the Association for Computational Linguistics, 12, 1580-1600.

Shanahan, M., McDonell, K., & Reynolds, L. (2023). Role play with large language models. Nature, 623, 493-498.

Sprague, Z., Yin, F., Rodriguez, J. D., Jiang, D., Wadhwa, S., Singhal, P., Zhao, X., Ye, X., Marasovic, A., & He, J. (2024). MuSR: Testing the limits of chain of thought with multistep soft reasoning. ICLR 2024 Spotlight.

Zhao, Z., Wallace, E., Feng, S., Klein, D., & Singh, S. (2021). Calibrate before use: Improving few-shot performance of language models. Proceedings of the 38th International Conference on Machine Learning, 12697-12706.

Appendix A: Data Summary

Category Count
Problems (main study) 250
Models 8
Conditions 6
Main study evaluations 11,958 (valid)
Stochastic target problems 80
Stochastic translation variants 4 per problem
Stochastic evaluations ~2,500
Total translations generated ~1,500

Appendix B: 95% Wilson Confidence Intervals

Condition Accuracy 95% CI
Baseline 64.2% [62.1%, 66.3%]
Clarified baseline 64.7% [62.6%, 66.8%]
Anglish (full) 60.6% [58.5%, 62.7%]
Anglish minimal 61.7% [59.6%, 63.8%]
Classical (full) 61.7% [59.5%, 63.8%]
Classical minimal 60.6% [58.4%, 62.7%]