AI Summary Claude Opus

TL;DR: Translating reasoning benchmarks into either purely Germanic or purely Latinate vocabulary reduces LLM accuracy by 2.5-3.7%, and the degradation is robust across wording variants while any improvements are fragile flukes — meaning models are calibrated to the natural distribution of English vocabulary, not to meaning.

Key Points

Both simplifying (Germanic) and formalizing (Latinate) vocabulary in reasoning tasks hurts LLM accuracy, eliminating the hypothesis that academic register activates a 'reasoning mode' in model representations.
The harmful effects of register shifts are robust: only 2% of hurt problems reversed direction across alternative translations, while 20% of helped problems did — meaning degradation is structural but improvement is a fluke of specific wording.
Stronger models (Grok: -5.2%) paid a larger etymology tax than weaker models (Haiku, Llama, Nova: -1.9%), suggesting that more capable models have learned more nuanced distributional patterns and are therefore more sensitive to distributional shifts.

The post reports on a controlled experiment testing whether the etymological register of vocabulary in reasoning tasks affects LLM accuracy. Using the MuSR benchmark (250 murder mystery problems, ICLR 2024 Spotlight), 250 problems were translated into Germanic-only (Anglish) and Latin-only (Classical) registers and evaluated across eight models in a 2x3 factorial design crossing register with clarification level. The original hypothesis predicted that Classical register would improve accuracy by activating a 'reasoning basin' in latent space, while Germanic register would degrade it. Both predictions were wrong: Classical vocabulary reduced accuracy by 3.7% (p=0.003) and Germanic by 2.5% (p=0.019). A stochastic validation generating alternative translations at higher temperature revealed a critical asymmetry: harmful register effects are robust across wording variants (2% flip rate) while helpful effects are fragile (20% flip rate). The post discusses implications for prompt engineering, benchmark reliability, and the gap between pattern matching and comprehension in language models.

The Etymology Tax: How Word Origins Break LLM Reasoning

English has two vocabularies. Not metaphorically. The Norman Conquest and the Renaissance left English with parallel word sets that map the same meanings: “begin” and “commence,” “ask” and “inquire,” “buy” and “purchase,” “understand” and “comprehend.” Most speakers switch between them unconsciously, selecting register to match context. A doctor says “the patient expired” at the hospital and “grandpa died” at the dinner table. Same meaning, different signal.

I wanted to know whether that signal matters to language models. Not whether they can detect register (they obviously can) but whether the etymological origins of words in a reasoning task affect whether the model gets the answer right.

I had a specific hypothesis, and it was wrong.

The Hypothesis

The prediction was simple and, in retrospect, too neat. Language models are trained on text where vocabulary register correlates with content type. Latinate words cluster with academic papers, technical documentation, legal briefs, scientific analysis. Germanic words cluster with everyday conversation, blog posts, casual writing, practical instruction. If models have learned these associations deeply enough, then prompting in Classical register might activate something like a “reasoning mode,” drawing the model toward a region of its internal representations associated with careful, analytical processing. Germanic vocabulary would pull the other way, toward a mode associated with informal, less rigorous thinking.

The prediction: Classical register improves accuracy, Anglish register degrades it.

The actual result: both degrade it. Classical vocabulary reduced accuracy by 3.7% (p=0.003). Germanic vocabulary reduced it by 2.5% (p=0.019). Neither direction helped. The models are not drawn to a reasoning basin by academic vocabulary; they are pushed off their calibration by any deviation from the natural distribution of English words.

The full paper contains the complete methodology, data tables, and statistical analysis. This post covers the story of the research and what the results mean.

The Experiment

Pilot: Math Problems

The research started on GSM8K, a set of grade school math word problems. I translated 50 problems into Anglish (exclusively Germanic vocabulary) and Classical (Latinate vocabulary) using Claude Opus 4.5 as the translator, then ran three models on all three versions. Classical register showed a consistent accuracy drop across all models (average: -4%), but the pilot was confounded. The translator was producing semantic drift: “kittens” became “felines,” which is not a register shift but a category change. And GSM8K was too easy for the models I was testing (baseline accuracy above 90%), leaving little room to detect degradation.

The pilot was informative enough to motivate the real study, but not clean enough to draw conclusions from.

Full Study: Murder Mysteries

For the full experiment, I switched to the murder mystery split of MuSR (Multistep Soft Reasoning), a benchmark of 756 reasoning narratives across three domains (murder mysteries, object placements, team allocations) that was presented as an ICLR 2024 Spotlight paper. The murder mystery split comprises 250 problems. Each problem is roughly 1,000 words describing a crime scene, suspects, alibis, evidence, and timelines, followed by a question asking who the most likely murderer is. The reasoning required spans multiple inferential steps, and baseline model accuracy sits around 64%, giving substantial room for movement in both directions.

The design tested three vocabulary registers (baseline, Germanic, Latinate) at two clarification levels (original and clarified). The clarification dimension was added to disentangle vocabulary effects from rephrasing effects, since translation inevitably involves some restructuring. Minimal conditions changed only vocabulary; full conditions allowed natural rephrasing. Eight models from seven providers were evaluated across all conditions, totaling nearly 12,000 evaluations.

Here is what the three registers look like on the same passage:

Baseline: “In an adrenaline inducing bungee jumping site, Mack’s thrill-seeking adventure came to a gruesome end by a nunchaku; now, it’s up to Detective Winston to unravel the deadly secrets between Mackenzie and Ana.”

Germanic: “At a heart-racing bungee leap grounds, Mack’s daredevil outing came to a grisly end by a nunchaku; now, it falls to Sleuth Winston to unravel the deathly hidden truths between Mackenzie and Ana.”

Latinate: “In an adrenaline-inducing bungee-jumping locale, Mack’s audacious escapade culminated in a gruesome demise by nunchaku; now, it fell to Detective Winston to unravel the lethal enigma betwixt Mackenzie and Ana.”

Same story. Same structure. Same proper nouns. Different words.

The Numbers

Comparison	Accuracy Change	p-value
Classical vs. baseline	-3.7%	0.003
Anglish vs. baseline	-2.5%	0.019
Clarification vs. baseline	+0.5%	0.668

Both register shifts hurt. Clarification did nothing overall (though weaker models benefited while stronger models were slightly harmed). Across all models and conditions, approximately 70 more problems were hurt by translation than were helped.

The Surprise

The result that both directions hurt is the finding that matters, because it eliminates the simplest explanations.

It is not that “fancy words confuse models.” If that were the mechanism, Anglish (simpler words) should have helped, or at least been neutral. It is not that “simple words lack precision.” If that were the mechanism, Classical (more precise vocabulary) should have helped.

What the data suggest is that models are calibrated to the natural distribution of English vocabulary, and any deviation from that distribution imposes a cost. Natural English contains a mixture of Germanic and Latinate words in proportions that reflect centuries of organic usage. Pushing the mixture toward either extreme disrupts the statistical patterns the models have learned to exploit.

There is an additional pattern worth noting. Stronger models paid a larger tax. The strongest models (Grok) showed an average translation drop of -5.2%, while the weakest models (Haiku, Llama, Nova) showed only -1.9%. One interpretation is that stronger models have learned more nuanced distributional patterns and are therefore more sensitive to distributional shifts. Another interpretation is that weaker models have less accuracy to lose. The data cannot distinguish these accounts.

The Fragility Twist

After the main study, I ran a stochastic validation. The concern was straightforward: maybe the observed effects were artifacts of the specific wording chosen by the translator, not properties of the register shift itself. To test this, I generated three additional translations at temperature 0.7 (introducing randomness) for the 80 problems that showed the strongest effects (the 10 most hurt and 10 most helped in each of four conditions), then re-evaluated all eight models on all variants.

The result was the most interesting finding in the entire study.

When translation hurts a problem, it hurts consistently. Only 1 out of 40 hurt problems (2%) had any translation variant that reversed the direction to helpful. The harmful effect is not an artifact of specific wording; it is a structural property of the register shift.

When translation helps a problem, it is a fluke. 8 out of 40 helped problems (20%) had at least one translation variant that reversed the direction to harmful. The helpful effect depends on specific wording choices that happen to clarify or restructure the narrative in beneficial ways.

This asymmetry matters for prompt engineering. If you discover that rephrasing a prompt in a particular vocabulary style improves performance on a sample of tasks, the improvement is probably fragile: it depends on specific word choices that may not generalize. If you discover that a vocabulary shift hurts performance, the degradation is probably robust and will persist regardless of how you phrase the shift.

The fragility also varied by condition. Latinate full translations were the most fragile (40% of problems classified as fragile), while Germanic minimal translations were the most stable (0% fragile). The difference was statistically significant (ANOVA F=2.97, p=0.038). More vocabulary substitutions mean more opportunities for specific wording choices to matter, and English has a larger inventory of Latinate synonyms than surviving Germanic ones, which amplifies the variability.

What I Did Not Test

The most important limitation is the absence of reasoning models. All eight models in this study were standard models. Models like o1, o3, and DeepSeek-R1 use internal chain of thought reasoning that decomposes problems into intermediate steps before answering. These models might be robust to the etymology tax for a specific and interesting reason: the internal reasoning chain provides an opportunity to normalize vocabulary, effectively translating shifted input into a standard internal register before applying reasoning operations.

If reasoning models turn out to be invariant to register, that would be evidence that chain of thought architectures learn to abstract away superficial linguistic variation, which would say something specific about what explicit reasoning adds beyond pattern matching. If reasoning models are equally affected, that would suggest register sensitivity is a more fundamental property of language model processing. Either finding would be worth knowing. I did not have the budget or API access to test it.

Other limitations: single translator model per study phase (Claude Opus 4.5 for the pilot, DeepSeek V3.2 for the full study), single reasoning domain (murder mysteries), no human baseline, and a stochastic validation limited to the 80 most extreme problems rather than a representative sample. The full paper discusses each of these in detail.

The Uncomfortable Part

I wanted the Classical register to win. The hypothesis was elegant: Latinate vocabulary activates a reasoning basin, Germanic vocabulary pulls toward casual pattern matching, and the etymological history of your word choices quietly steers the model’s cognition. If that were true, it would mean something interesting about how models organize knowledge, and it would hand prompt engineers a useful lever. Swap “begin” for “commence” and get better reasoning for free.

Instead, both directions hurt. The lever does not exist. What exists is a tax.

Vocabulary register should not matter for reasoning accuracy if models understand content rather than matching patterns. The word “commence” does not change the logical structure of a murder mystery. The word “begin” does not add or remove information. The etymological history of the vocabulary is semantically invisible; it is a property of the words’ provenance, not their meaning.

The fact that register affects accuracy by 2.5% to 3.7%, that the effect is statistically significant across eight models, and that the harmful direction of the effect is robust across multiple independently generated translations tells us something we would rather not know: models are not reasoning about content in a way that is invariant to the surface form of that content. They are sensitive to the statistical texture of the input in ways that have nothing to do with the informational content of the input, and this sensitivity is not random noise. It is a systematic, directional, replicable phenomenon.

The immediate practical consequence is narrow: if you are evaluating language models on tasks that involve text at nonstandard registers (legal, medical, academic), your benchmark results may overestimate capability, because the models were calibrated on text at a different register than the one you are deploying them into. The broader implication is harder to contain. Vocabulary register joins the growing list of surface features (formatting, example ordering, prompt template, label names) that measurably affect model behavior in ways that a system operating on meaning rather than form should not be affected by. Each item on this list makes the space between pattern matching and comprehension a little harder to ignore.