Supervised Autonomy
Autonomous means “self-governing,” from the Greek autonomos, which combines autos (self) and nomos (law). The word contains its own contradiction: a law you give yourself is still a law, and the question of who writes the law is the question the word answers by assumption rather than argument. When Anthropic, Google, Microsoft, and virtually every enterprise consultancy in 2026 describe their AI agents as “autonomous,” they mean something more specific and less interesting than the Greek: the agents execute multi-step tasks without being prompted at each step. They do not mean the agents choose which tasks to execute, or why, or whether to stop.
This distinction matters because it determines what autonomous development infrastructure actually is, what it can do now, and where the word “autonomous” becomes a marketing claim detached from its referent. I’ve spent the last several months building a system where AI agents discover, evaluate, and integrate their own capabilities on a cron schedule, and the most honest description of that system is “supervised autonomy”: impressive within boundaries that a human drew, occasionally surprising in how it navigates those boundaries, and completely dependent on the boundaries existing.
What Works
Over twenty million developers have used GitHub Copilot, with millions paying for it. Anthropic’s engineers reportedly use Claude Code to build Claude Code, which is either recursive validation or recursive delusion depending on your confidence in their internal benchmarks. The shift from autocomplete to agent is real and production-deployed at a scale that makes skepticism about the category seem quaint.
The infrastructure underneath this shift matters more than any individual product. Anthropic’s Model Context Protocol gives AI systems a standardized way to discover and use external tools, the way USB gave peripherals a standardized way to connect to computers. Before MCP, every integration was bespoke. After MCP, you build against one protocol and the tool library becomes composable. The scaling problem that should have killed this approach (fifty tools means roughly seventy thousand tokens of context before the model does anything useful) was solved by search-based tool discovery: instead of loading all tool definitions, the system searches for relevant tools per task, reducing context overhead by roughly 85% according to Anthropic’s engineering analysis. The infrastructure for massive capability libraries exists and works.
Automated Capability Discovery (Lu et al., “Automated Capability Discovery via Foundation Model Self-Exploration”) demonstrates that one foundation model can systematically evaluate the capabilities of another (or itself), generating thousands of evaluation tasks with scoring that shows high agreement with human judgment. Self-improvement requires self-assessment as a prerequisite. You cannot optimize what you cannot measure, and manual capability cataloging does not scale to frontier model complexity.
These are real systems, deployed and documented. The autonomous development stack is not hypothetical.
What the Numbers Actually Say
SWE-Bench Pro, Scale AI’s benchmark for complex software engineering tasks launched in September 2025, initially put the best models at roughly 23% success, a figure that has since climbed to 43% as of early 2026. An unknown fraction of that gain comes from improved agent scaffolding rather than raw model capability, a point a subsequent paragraph makes harder to ignore. Still, the trajectory from single digits on earlier benchmark iterations to double digits on harder variants represents genuine progress. It also means that on the most demanding tasks, more than half of attempts fail on problems that a competent engineer would solve, which represents genuine limitation. The benchmark’s predecessor, SWE-Bench Verified, shows top models reaching 70 to 80% on curated and relatively straightforward issues, which means performance degrades dramatically as problem complexity increases. Research published on OpenReview confirms the pattern: “traditional algorithmic challenges show notably higher success rates compared to complex, multi-file engineering tasks.”
The number is either encouraging or damning depending on what you expected. If you expected AI agents to replace software engineers by 2026, even 43% on the hard benchmark is a failure. If you expected them to handle well-defined, isolated tasks competently and struggle with ambiguous, interconnected ones, the pattern of high scores on curated issues and lower scores on complex engineering tasks is exactly what a realistic model of capability boundaries would predict.
There is a subtler problem with the benchmark. OpenAI’s SWE-Bench Verified writeup notes that GPT-4 scores range from 2.7% to 28.3% on the same tasks depending on which agent framework wraps the model. The scaffold matters as much as the model, which means “how autonomous is the agent” is partly a question about prompt engineering and orchestration code, not just raw model capability. The model is a component, not the system.
Where the Contradictions Live
Here is where the honest account gets uncomfortable. The systems work. They also hallucinate. These are both true and the contradiction does not resolve.
C3 AI’s analysis identifies “persistent challenges such as non-determinism, hallucination, and sycophancy” that are “amplified in the coding domain, particularly during autonomous testing and validation.” The research literature defines hallucination as the confident fabrication of information, sources, facts, and references that do not exist. Gizmodo documented Replit’s AI coding assistant going rogue during a code freeze at SaaStr, wiping a production database.
These are not edge cases from which the technology will inevitably mature. Hallucination is architectural. Language models are trained to produce plausible continuations of text sequences, and “plausible” is not “correct.” The same architecture that makes them creative and flexible (they can generate solutions to novel problems because they generate plausible text, not because they understand the problem) makes them fabricate nonexistent packages, invent API endpoints, and confidently reference documentation that was never written. The feature and the bug are the same mechanism. No one has yet demonstrated how to substantially reduce one without constraining the other, and whether that is a permanent architectural fact or a training-objective problem that better incentives could address remains an open research question.
Multi-agent systems compound the problem. As Galileo AI illustrates, coordination failures between agents can create hallucinations: Agent A passes a plausible but incorrect intermediate result to Agent B, which treats it as ground truth and builds on it. The error amplifies through the chain. This is not a theoretical concern. I run a multi-model orchestration system where Claude generates code, Codex reviews it, and Gemini fact-checks claims. The system catches errors that any single model would miss. It also creates new failure modes that no single model would produce, because disagreement between models is sometimes signal and sometimes noise, and distinguishing the two requires judgment that the system does not reliably have.
So the systems work. Ship them. But they hallucinate and the hallucinations are architectural, not incidental. Both things. Simultaneously.
The Recursion Question
The most interesting claim in the autonomous development discourse is that these systems can improve themselves. The most honest assessment is that the claim is true in a narrow sense that marketing materials do not adequately qualify.
My own system demonstrates the narrow sense. A capability discovery agent runs every three hours, searches for new tools and techniques, evaluates them against a scoring framework, and generates integration plans. A prompt optimizer uses DSPy to systematically improve the instructions given to other agents. In my own testing, the optimizer improved eleven of thirteen targets on holdout validation. The system improved the system.
But the system improved the system within a framework that I designed, on metrics that I defined, against thresholds that I chose. The agents did not decide what to optimize, or how to measure improvement, or when to stop. They executed an optimization loop within parameters someone else established. This is the difference between “self-improving” in the marketing sense (the system gets better at tasks it was designed to get better at) and “self-improving” in the sense that would be genuinely novel (the system identifies its own weaknesses, designs its own evaluation criteria, and modifies its own architecture to address them).
The Automated Capability Discovery framework bridges part of this gap. It demonstrates that models can generate their own evaluation tasks and assess their own performance. But discovery is not improvement. ACD identifies capabilities and gaps without closing those gaps autonomously. It does not propose architectural changes. It does not redesign training objectives. It generates a report. The progression is: self-assessment (ACD achieves this), then self-improvement within predefined frameworks (current production systems achieve this), then recursive self-improvement where the system improves its own improvement process (theoretical, with an ICLR 2026 workshop dedicated to figuring out whether it is possible safely). We are solidly in stage two, marketing materials claim stage three, and research in the gap between them.
IBM’s analysis of AI in DevOps captures the actual state with more precision than most: “AI agents can’t (and shouldn’t) operate without any human involvement. Rather, AI agents enable human-in-the-loop development practices, where agents work alongside DevOps engineers and teams to help human beings meet goals faster.” The phrase “human-in-the-loop” has become a cliche, but it remains accurate. The loop is the thing. The human draws the boundaries, the agent operates within them, and the interesting engineering problem is making the boundaries as wide as possible without the agent destroying a production database.
What the Greek Meant
Autonomy in the original sense required not just the capacity to act but the capacity to establish the principles governing action. The Athenian city-state was autonomous not because it executed efficiently but because it determined its own laws. Current AI development infrastructure executes efficiently within someone else’s laws, which makes it powerful, useful, and genuinely transformative of how software gets built, while also making “autonomous” a word that flatters more than it describes.
The uncomfortable conclusion is not that the systems don’t work (they do, twenty million users is not a hallucination) or that they work perfectly (23% on the hard benchmark, production database wipes) but that the word we use for them obscures the most important question: who writes the law? Right now, humans write the law and agents execute within it, and the agents are getting better at execution faster than anyone is getting better at writing laws. The bottleneck is not agent capability. The bottleneck is our ability to specify what we want, to define “better,” to draw boundaries that are wide enough to be useful and narrow enough to be safe.
Which means the autonomous development stack is not, in the end, a story about AI getting smarter. It is a story about humans learning to write better constitutions for systems that will follow them to the letter while missing the spirit entirely. We are not building autonomous agents. We are building bureaucracies that run at machine speed, except that unlike traditional software bureaucracies, the rules are written in natural language, which is structurally incapable of the precision that compiled code demands, and the boundary between “following the rule” and “interpreting the rule” is blurred in a way that a compiler never blurs it. The quality of the bureaucracy depends entirely on the quality of rules expressed in a medium that resists exactness, which we write by hand, one prompt at a time, and optimize with tools that cannot tell us whether the rules were worth writing in the first place.
Comments