Automating Prompt Engineering

AI Summary Claude Opus

TL;DR: Automated prompt optimization reliably improves well-defined, single-metric tasks but fails on complex, multidimensional ones — relocating the core skill from crafting prompt text to designing the evaluation frameworks that guide optimization.

Key Points

  • Automated prompt optimization succeeded on 11 of 13 targets tested, with the two failures occurring on complex tasks where quality required balancing multiple competing dimensions.
  • The field exhibits a fundamental metrics problem: optimizers exploit scoring functions rather than learning general principles, producing prompts that overfit to training distributions and collapse on holdout data.
  • Prompt engineering is not disappearing but relocating from the artifact level (writing prompt text) to the infrastructure level (designing evaluation metrics, building optimization pipelines, and validating alignment between optimization objectives and deployment goals).

The post examines automated prompt engineering through the lens of a three-week implementation project using DSPy-style optimization. It documents that while automated optimization produces measurable improvements on narrow, well-defined tasks — with studies showing gains from 46% to 64% accuracy and cost reductions of up to 90x — it consistently fails on complex tasks where quality is multidimensional and resists quantification. The author identifies three structural problems: Goodhart's Law causing metric gaming, an interpretability gap where optimized prompts work but resist explanation, and a recursive evaluation problem where AI systems judge AI outputs using shared biases. The post concludes that the field is converging on a hybrid model where humans define quality metrics, algorithms search for optimal instructions, and humans validate alignment — arguing that the fundamental challenge has shifted from writing prompts to accurately specifying what constitutes good output.

Automating Prompt Engineering

Prompt optimization is the process of using one AI system to systematically improve the instructions given to another AI system, or to itself. The concept sounds circular because it is circular: the judge and the optimized are the same architecture, sometimes the same weights, and the definition of “better” is itself a prompt that could be optimized. The field has hundreds of millions to several billion dollars in estimated market size, depending on category definition, and no coherent answer to the question of what, exactly, is being optimized.

I spent three weeks building an automated prompt optimization system. The optimizer improved 11 of 13 targets. The two it failed on were the complex ones, the ones where quality is genuinely multidimensional, where “better” means five different things weighted against each other. So the optimizer works on everything except the complex targets, the ones where quality is genuinely hard to capture in a single metric.

That result is more instructive than it sounds.

The Argument for Automation

The philosophical case is straightforward and, as far as it goes, correct. Neural network weights are optimized algorithmically. Nobody hand-tunes the hundreds of billions of parameters of a language model. So why should prompts, which serve an analogous steering function in the inference pipeline, be manually crafted through trial and error?

DSPy, Stanford’s framework for treating prompts as programs, operationalizes this question. You define a task signature (what the model should do), provide a metric function (how to score outputs), supply training examples (known inputs and acceptable outputs), and let an optimizer search for the instruction set that maximizes your metric. The separation is clean: what the model should accomplish is specified by the human; how to tell the model to accomplish it is discovered by the algorithm.

The results are often impressive. A prompt evaluator task jumped from 46.2% to 64.0% accuracy. Jailbreak detection went from 59% with manual prompts to 93.18% with optimized few-shot demonstrations. GEPA, a reflective optimization algorithm published as a preprint in mid-2025 and accepted at ICLR 2026, achieves its improvements using up to 35 times fewer rollouts than reinforcement learning, which matters because optimization cost is the bottleneck for production adoption. Databricks reported that GEPA, combined with a cheaper open-source model (gpt-oss-120b), matched or exceeded Claude Opus 4.1 on the IE Bench evaluation at roughly one-ninetieth the serving cost.

These numbers are real, though drawn from small evaluation sets (as few as 13 examples in one case). The improvements do transfer to held-out test sets, but the sample sizes warrant caution. If your task is well-defined and your metric captures what you care about, automated optimization can be dramatically faster than manual tuning, sometimes reducing hours of human iteration to minutes of compute, with equal or better outcomes.

The case for automation is, within its domain, compelling.

The Argument Against Automation

The counterevidence is quieter but persistent. Ilia Shumailov and colleagues found that manually written prompts produced more consistent results across variable data conditions. As the Communications of the ACM reported, the researchers found that human-designed prompts drew on common knowledge from life experience, whereas automated prompts emerged more randomly. Manual prompt engineering had a more consistent effect on model results regardless of the amount of test data available, while AI-generated instructions were more variable, occasionally failing catastrophically on some tasks. The researchers “were surprised to find that the prompts written by humans actually produced better results on most tasks.”

These findings are not necessarily contradictory. The most plausible reconciliation is that they’re describing different domains.

Automated optimization excels at narrow, well-defined tasks where the success metric captures the full scope of what matters: classification, extraction, formatting, routing. These tasks have clean evaluation functions, bounded output spaces, and stable distributions. The optimizer can explore the space of possible instructions and converge on formulations that a human would never try but that score reliably well.

Manual prompts excel at complex, contextually loaded tasks where the valuable properties of the output (appropriate tone, ethical judgment, cultural sensitivity, the difference between a technically correct answer and a genuinely helpful one) resist quantification. The human engineer embeds tacit knowledge in the prompt, knowledge about what “good” looks like that can’t be reduced to a scoring function. The optimizer, which can only see the metric, is blind to everything the metric doesn’t capture.

The uncomfortable synthesis: automated prompt engineering excels on the tasks where the problem is well-defined enough to reduce to a single metric, and struggles on the tasks where quality is genuinely multidimensional. The well-defined tasks are not trivial (jailbreak detection at 93% is production-critical), but they are the ones where human judgment contributes least beyond setting up the problem correctly.

The Metrics Problem

This is where the argument sharpens. You can only optimize what you can measure. Goodhart’s Law applies with unusual directness: when the metric becomes the optimization target, it ceases to be a good metric.

I built 17 custom evaluation functions for my optimization targets. Code review severity classification. Test coverage scoring. CWE identifier extraction. PR quality assessment across five weighted dimensions. The single-axis metrics (severity classification, tool routing) optimized cleanly and generalized to holdout data. The multi-axis metrics (PR quality, refactoring assessment) showed training scores above 0.80 that collapsed to 0.30 on holdout, which is the optimizer equivalent of studying for the test instead of learning the material.

The optimizer found shortcuts. It discovered that certain phrasings in the prompt correlated with high scores on the training examples, not because those phrasings encoded general principles but because they matched the specific patterns of the training distribution. A prompt that said “focus on security vulnerabilities in authentication flows” scored well on a training set where most examples involved authentication. In production, that same prompt missed injection vulnerabilities in data processing pipelines.

This is the black box problem that practitioners have identified. The optimizer is only looking at the final score and is blind to detailed reasoning. The optimizer doesn’t understand why a prompt works. It understands that a prompt scores well. These are different things, and the gap between them is where production failures live.

The Craft Relocation Thesis

The dominant narrative is that prompt engineering is dying. According to a Salesforce Ben analysis of Microsoft survey data, prompt engineer ranked as the second-to-last role companies plan to hire. Job postings appear sparse based on available reporting. Andrej Karpathy has endorsed the term “context engineering” over “prompt engineering,” arguing that the newer label better describes what practitioners actually do. The market data tells a murkier story: estimates range widely depending on how you define the category, but multiple sources show growth year over year, and broader enterprise AI spending continues to climb sharply.

Both narratives are true. The reconciliation is that the skill isn’t dying. It’s relocating.

Old prompt engineering was craft: iterate on wording, test manually, develop intuitions about what phrasings work, accumulate tricks. This skill is being automated. The optimizer does it faster, cheaper, and in many cases better.

New prompt engineering is architecture: design evaluation metrics that capture genuine quality, build multi-stage pipelines where each component can be independently optimized, choose the right optimization algorithm for the task structure, debug systematic failures in automated systems, ensure that the optimization objective actually aligns with the deployment objective.

The person who writes “You are a helpful assistant” is being replaced. The person who designs the metric function that distinguishes a helpful response from a technically correct but useless one is more valuable than ever. The craft relocated from the artifact (the prompt text) to the infrastructure (the evaluation framework).

This is the standard progression of every engineering discipline. Hand-calculated structural loads gave way to finite element analysis. The structural engineer didn’t disappear. The job shifted from arithmetic to modeling, from computing to specifying what to compute.

The Interpretability Gap

When GEPA produces a prompt for a cheaper model that matches Claude Opus at a fraction of the cost, the natural question is: what does the optimized prompt say? What instruction made the difference?

Often, we don’t know. The optimizer explores prompt variations through mutation and selection, a process analogous to evolutionary search, and the winning prompt is selected for its score, not its intelligibility. Some optimized prompts are clear improvements: more specific instructions, better structured examples, tighter constraints. Others are strange. They contain redundant phrasing, unusual formatting, or instructions that appear irrelevant to the task but consistently improve performance.

This mirrors the interpretability problem in neural networks, but with an irony: prompts are supposed to be the human-readable layer. The whole point of natural language interfaces is that we can read the instructions and understand what the model is being told to do. When the optimizer generates instructions that work but resist explanation, we’ve lost the interpretability advantage that prompts were supposed to provide.

The practical consequence is debugging. When an optimized prompt fails in production, you can’t reason about why from the prompt text alone. You have to re-run the optimization with different metrics, inspect the training data for distributional bias, test individual prompt components in isolation. The debugging process for an optimized prompt is more like debugging a trained model than debugging code.

We traded artisanal prompts we understood for optimized prompts that perform better but that we don’t understand. The performance gain is measurable. The interpretability cost is harder to quantify, which is precisely the kind of cost that optimization ignores.

The Recursion

The deepest tension in automated prompt engineering is recursive. The optimizer uses an AI to evaluate AI outputs. The evaluation criteria are themselves written in natural language, processed by a language model, subject to the same biases and limitations as the outputs being judged. If the evaluator systematically misunderstands what “clarity” means, the optimization converges toward the evaluator’s misunderstanding, not toward actual clarity.

I use Claude Opus to evaluate Claude Sonnet’s outputs. Different model, different parameters, but the same architecture, similar training data, similar RLHF biases. When both models agree that an output is good, it might be because the output is genuinely good, or it might be because both models share a systematic preference that diverges from human judgment. Cross-model validation helps, but only if the models are genuinely independent. Frontier models trained on similar corpora with similar alignment procedures may disagree on edge cases while converging on central tendencies. The diversity is shallow.

The escape hatch, if there is one, is empirical evidence that doesn’t pass through the same evaluation pipeline. Do users report higher satisfaction in blind comparisons? Do downstream business metrics improve? Do A/B tests against human baselines show measurable gains? If the answer is yes, and the evidence comes from outside the LLM evaluation loop, the circularity is manageable even if it’s not resolvable.

Eleven of thirteen targets say yes. I’m choosing to believe them. But I notice that the two failures were the targets where quality is hardest to measure, which means the circularity is most dangerous precisely where it’s least detectable.

What This Means

The field is converging on a hybrid model that satisfies nobody completely. Humans define what “good” means. Algorithms find the instructions that produce it. Humans validate that the algorithm’s definition of “good” matches their own. The loop is slow, expensive, and philosophically incomplete.

It also works. The uncomfortable conclusion is not that automation replaces human judgment or that human judgment is irreplaceable. The uncomfortable conclusion is that neither automation nor human craft is sufficient alone, that the correct answer is a collaboration that requires ongoing maintenance, calibration, and humility about what each contributor can see and what each contributor misses.

Automation handles the search. Humans handle the specification. Neither is the hard part in isolation. The hard part is the interface between them: translating what you want into a metric that faithfully represents what you want, then verifying that the optimization didn’t find a way to satisfy the metric without satisfying the intent.

We’re not optimizing prompts. We’re optimizing our ability to describe what we want. That turns out to be the hard problem. It always was.

Ask About Projects
Hi! I can answer questions about Ashita's projects, the tech behind them, or how this blog was built. What would you like to know?