AI Summary Claude Opus

TL;DR: The intellectual lineage from behaviorist psychology to modern reinforcement learning is direct and documented, yet the field's founders explicitly distance themselves from behaviorism — a disavowal complicated by the fact that training large language models via RLHF functionally reproduces the black-box methodology behaviorism was rejected for.

Key Points

Reinforcement learning emerged from two independent threads — psychological trial-and-error learning (Thorndike, Skinner, Tolman) and mathematical optimal control (Bellman's dynamic programming) — that converged in the 1980s through a chain of personal connections between Klopf, Sutton, and Barto.
The field's founders acknowledge RL's psychological heritage while explicitly rejecting behaviorism's epistemological constraint of studying only observable inputs and outputs, yet training neural networks via RLHF recreates precisely that black-box methodology at scale.
The post-hoc discovery that dopamine neurons encode reward prediction error — mathematically equivalent to temporal difference learning — validated the convergence between behavioral psychology and computational RL, though the direction of influence ran from behavior to algorithm to biology, not the reverse.

The post traces reinforcement learning's dual heritage: a psychological thread running from Thorndike's puzzle boxes through Skinner's operant conditioning and Tolman's cognitive maps, and a mathematical thread originating in Bellman's dynamic programming at the RAND Corporation. These threads converged in the 1980s when Richard Sutton, a Stanford psychology graduate influenced by Harry Klopf's theory of hedonistic neurons, formalized temporal difference learning under Andrew Barto at UMass Amherst. The post examines what it calls the field's central tension: Sutton and Barto explicitly disclaim behaviorism's black-box methodology while acknowledging its conceptual contributions, yet large-scale RLHF training — where prompts serve as stimuli, completions as responses, and human preferences as reinforcement — functionally reproduces that same methodology. The post concludes that behaviorism provided RL's problem formulation rather than its algorithms, and that the perceived anticipation is partly retrospective illusion, with the genuine intellectual debt coexisting alongside fundamental methodological divergence.

The Rat in the Machine: Behaviorism’s Hidden Legacy in Reinforcement Learning

Reinforcement learning is the process by which an agent learns to act in an environment by receiving rewards and punishments for its behavior. The definition sounds modern, computational, precise. It also describes, almost word for word, what Edward Thorndike observed in 1898 when he put hungry cats in puzzle boxes and watched them stumble toward the latch or string that opened the box to reach food. Any behavior followed by a satisfying result is more likely to be repeated. He would later name this the Law of Effect. We call it a reward signal. The math is different. The principle is identical.

This is not merely a metaphor or a loose analogy. The intellectual lineage from Skinner boxes and Hull’s behavioral equations to modern RL algorithms is direct, documented, and acknowledged by the people who built the field. Richard Sutton, who formalized temporal difference learning in 1988, has a bachelor’s degree in psychology from Stanford. He found his way to reinforcement learning through Harry Klopf, a scientist at the Air Force Cambridge Research Laboratories at Hanscom Field who argued in the 1970s that neurons are hedonists, seeking rewards. Sutton read Klopf’s work, met him for lunch, and joined a research project at UMass Amherst designed to test Klopf’s ideas about reward and learning. That project, initially directed by Michael Arbib, William Kilmer, and Nico Spinelli, is where Sutton began working alongside Andrew Barto, who later supervised his PhD. It became the foundation of modern reinforcement learning.

But here is where the story gets uncomfortable. Sutton and Barto explicitly distance their work from behaviorism.

The Dual Heritage

Reinforcement learning did not descend from a single tradition. It emerged from two threads that developed independently for decades before someone thought to braid them together.

The first thread is psychological. Trial and error learning. Thorndike’s cats escaping puzzle boxes. Skinner’s rats in operant conditioning chambers, where the rate of lever pressing could be precisely controlled by manipulating reinforcement schedules. Tolman’s rats building cognitive maps of mazes even when no reward was offered, demonstrating that learning happens in the absence of reinforcement, which strict behaviorists said was impossible. This thread gave RL its problem formulation: an agent in an environment, learning through interaction, whether or not explicit consequences were present.

The second thread is mathematical. Richard Bellman at the RAND Corporation in the early 1950s, developing dynamic programming as an approach to optimal control. Bellman introduced the concept of a value function, formalized the Markov Decision Process framework (publishing “A Markovian Decision Process” in 1957, with Ronald Howard’s 1960 “Dynamic Programming and Markov Processes” extending the treatment), and derived the recursive equation that bears his name, which defines the optimal value of a state in terms of immediate rewards and the discounted value of successor states. In a frequently cited anecdote, Bellman later recounted that he chose the name “dynamic programming” partly to shield his work from Secretary of Defense Charles Wilson, who had what Bellman described as “a pathological fear and hatred of the word research.” The name was, in Bellman’s telling, “something not even a congressman could object to.” This thread gave RL its mathematical machinery: Bellman equations, value functions, policy optimization.

The two threads ran in parallel for thirty years. Operations research developed the resulting techniques under names like “dynamic programming” and “Markov decision processes.” Psychology called the underlying phenomenon “operant conditioning.” Few researchers worked across both traditions, and the connections were slow to develop. Minsky noted the overlap as early as 1961, Andreae in 1969, and Werbos in 1977, but the vocabulary gap kept the two communities largely separate.

Hull’s Beautiful Failure

Clark Hull published “Principles of Behavior” in 1943, one of the most cited books in psychology during that decade. He attempted something radical for the social sciences: a complete mathematical theory of behavior. Across his 1943 book and its successor, 1952’s “A Behavior System,” Hull developed equations for what he called reaction potential, combining factors like habit strength, drive, incentive motivation, and stimulus clarity multiplicatively, subtracting inhibitory factors, and adding random noise. The equations were meant to predict the probability and speed of a response, which resembles, in broad structural terms, a value function in RL where state values are computed from combinations of rewards, discounts, and learned weights.

Hull’s system was the dominant learning theory from the 1940s through the 1950s. Then it collapsed. Researchers realized the equation didn’t produce valid results even when modified. A system this simple could not account for the complexity of animal behavior, let alone human behavior. By the mid-1960s, Hull’s mathematical behaviorism was largely abandoned.

I find the failure instructive. Hull’s equations failed because they attempted closed form prediction of behavior from a handful of variables. Reinforcement learning succeeds because it replaces closed form solutions with iterative approximation. Both attempt to mathematize the relationship between stimulus, response, and reward. Hull tried to write the answer. RL learned to converge toward it. The insight was correct. The method was premature.

Skinner’s Box and the Agent Environment Loop

Skinner’s operant conditioning chamber is, functionally, a reinforcement learning environment. An animal (the agent) exists in a constrained space (the environment) where specific actions (lever pressing, key pecking) produce consequences (food delivery, shock avoidance). The experimenter controls the reward function. The animal learns a policy.

The mapping is more than approximate:

Operant Conditioning	Reinforcement Learning
Stimulus	State
Response	Action
Reinforcement	Reward signal
Reinforcement schedule	Reward structure
Extinction	Reward omission

Skinner and Charles Ferster’s 1957 work on schedules of reinforcement revealed that the pattern of reward delivery mattered as much as the reward itself. Variable ratio schedules (reinforcement after an unpredictable number of responses) produced high response rates resistant to extinction. Fixed interval schedules produced characteristic scalloping, with response rates accelerating as the interval elapsed. These findings describe how reward structure shapes learned behavior, which is precisely what RL researchers study when they design reward functions.

The Skinner box is also, inadvertently, a demonstration of why simple environments produce clean learning and complex environments don’t. A rat in a box with one lever faces a tractable optimization problem. The same rat in the wild faces a problem so large that trial and error becomes infeasible. This scaling problem is the central challenge of reinforcement learning, and it’s the same problem that killed Hull’s equations: real behavior is too complex for the frameworks that work in constrained settings.

The Heretic and the Cognitive Map

Tolman complicates the narrative. His experiments with latent learning in the 1930s demonstrated that rats could learn the layout of a maze without any reinforcement at all. After ten sessions of unrewarded exploration, rats that were then offered food navigated the maze almost as quickly as rats that had been rewarded from the start, catching up within a session or two. They had built internal representations during the unrewarded trials, what Tolman would formally conceptualize in 1948 as cognitive maps.

This was heresy. Strict behaviorism held that learning required reinforcement, that internal mental processes were either nonexistent or irrelevant, and that the only proper object of scientific study was observable behavior. Tolman’s cognitive maps implied internal representations, and internal representations were exactly what behaviorism defined itself against.

In reinforcement learning, Tolman’s heresy became orthodoxy. The distinction between model free RL (direct state to action mappings, closer to Skinner’s stimulus response framework) and model based RL (agents that build internal models of their environment, closer to Tolman’s cognitive maps) is one of the field’s fundamental divisions. Including Tolman in the “behaviorist heritage of RL” is accurate only if you acknowledge that he was a behaviorist who broke behaviorism.

The Synthesis

The threads converged in the 1980s through a chain of personal connections that feels more like intellectual accident than inevitable progress.

Harry Klopf, working at an Air Force base, published “The Hedonistic Neuron” in 1982, arguing that supervised learning was insufficient for AI or for explaining intelligence. What was needed was trial and error learning driven by hedonic impulses, the drive to achieve desired outcomes and avoid undesired ones. Neurons, he claimed, are hedonists seeking rewards.

Richard Sutton, a psychology graduate from Stanford, found Klopf’s work and joined the research project at UMass where he began working with Barto. By 1981, Barto and Sutton had shown that an adaptive temporal learning model could explain certain behaviors that the existing Rescorla-Wagner model couldn’t. By 1988, Sutton had formalized this line of work into the temporal difference learning algorithm. By 1989, Chris Watkins had published his Cambridge PhD thesis on Q-learning, which fully integrated the trial and error thread from psychology, the optimal control thread from operations research, and the temporal difference thread from AI into what we recognize as modern reinforcement learning.

Sutton and Barto’s textbook “Reinforcement Learning: An Introduction” acknowledges all of this history. The second edition includes chapters on RL’s relationship to psychology and neuroscience, describing close and widely acknowledged links between classical conditioning and temporal difference learning.

Sutton has noted elsewhere that formally, RL is unrelated to behaviorism, or at least to the aspects of behaviorism that are widely viewed as undesirable.

The Disclaimer That Reveals Everything

What are the “aspects that are widely viewed as undesirable”? Behaviorism’s refusal to consider internal mental processes. Its insistence that the only legitimate objects of study were observable inputs and outputs. Its black box methodology.

Reinforcement learning, as Sutton has put it, is “all about the algorithms and processes going on inside the agent.” Value functions, policy networks, internal representations, learned models of the environment. Everything the behaviorists refused to look at.

So the field’s founders acknowledge the psychological heritage while explicitly rejecting the psychological methodology. RL inherited the problem formulation (agents learning from rewards in environments) while discarding the epistemological constraint (you can only study what you can observe from outside). This is not a contradiction. It is precisely how intellectual progress works: you take the insight and discard the ideology that surrounded it.

But I think the disclaimer reveals something the field would rather not examine too carefully. If RL is “all about the algorithms and processes going on inside the agent,” then what happens when the agent is a neural network with 70 billion parameters and nobody can explain what the internal processes actually are? When we train a large language model through RLHF, we are, functionally, running a very expensive Skinner box. The stimulus is a prompt. The response is a completion. The reinforcement is a human preference rating. We cannot see inside the box. Interpretability researchers are trying to open it, but the RLHF practitioner works from the outside: observing inputs and outputs and adjusting the reward function.

The methodology the field explicitly rejected has become, at scale, the methodology the field implicitly practices.

The Biological Punchline

In the 1990s, neuroscientists discovered that dopamine neurons in the ventral tegmental area and substantia nigra encode reward prediction error, the difference between predicted and received rewards. More reward than expected activates the neurons. Expected reward produces baseline activity. Less reward than expected depresses activity.

The pattern closely matches temporal difference learning. The brain implements, in biological hardware, something that looks remarkably like the algorithm Sutton formalized in 1988. Evolution arrived at the same solution that computer scientists did, which either means the algorithm is fundamental or means that both systems stumbled into the same local optimum. I’m not sure which possibility is more interesting.

The discovery validated RL after the fact, not before. Sutton didn’t look at dopamine neurons and derive TD learning. He looked at psychological theories of reward and formalized them computationally. Then neuroscientists looked at the formalization and said: the brain does that too. The convergence is genuine. But the direction of influence runs from behavior to algorithm to biology, not the other way around.

What “Anticipated” Actually Means

The conventional framing is that behaviorists anticipated reinforcement learning. I’ve been arguing something like that for the last two thousand words. Now I want to push back on my own argument.

Thorndike anticipated the Law of Effect. He did not anticipate Q-learning. Skinner anticipated operant conditioning. He did not anticipate policy gradient methods. Hull anticipated the mathematical formalization of behavior. His formalization failed. The specific contributions of RL (temporal difference learning, Bellman equations applied to learning, convergent approximation of value functions) came from computer science and operations research, not from psychology.

What behaviorism provided was not algorithms. It was a way of thinking about learning that turned out to be more correct than anyone knew at the time, including the behaviorists themselves. The problem formulation survives. The methodology is dead. The insights are everywhere. The framework that generated them is discredited.

This is the normal relationship between a scientific tradition and its descendants. You don’t credit Aristotle with modern physics, even though his questions about motion were the right questions. You don’t credit the alchemists with chemistry, even though their interest in transformation was prescient. You acknowledge the conceptual debt and note that the intellectual ancestors would not recognize their ideas in their current form.

Thorndike’s cats, pawing at latches in puzzle boxes in 1898, were running a reinforcement learning algorithm. They just didn’t know it, and neither did Thorndike, and the formalization that made the connection visible required sixty years of mathematics, twenty years of computer science, and one psychologist who wandered into an Air Force scientist’s lunch meeting. The heritage is real. The anticipation is a retrospective illusion. Both of these statements are true, and the tension between them is where the interesting history lives.

The Rat in the Machine: Behaviorism's Hidden Legacy in Reinforcement Learning