How to Benchmark Conversation Extraction Quality

AI Summary Claude Opus

TL;DR: A single F1 score cannot capture the ways structured extraction from conversations fails; this post describes a three-layer evaluation methodology (per-field F1, LLM-as-judge, downstream propagation) that reveals how extraction errors cluster in the fields that matter most to downstream consumers.

Key Points

  • The methodology uses three evaluation layers weighted 60/25/15: per-field F1 with semantic matching across seventeen schema fields, an LLM judge evaluating against raw text rather than reference output, and downstream propagation measuring agreement in drift detection, loop tracking, and report generation.
  • Cheaper models scored 0.50–0.56 on extraction composites but only 0.24–0.27 on downstream propagation, demonstrating that extraction errors concentrate in high-impact fields rather than distributing uniformly across the schema.
  • The V3 schema was selected through a systematic three-phase pipeline of 579 Opus calls that screened nineteen candidate fields on consistency, information gain, discrimination, and redundancy before co-optimizing the schema and extraction prompt.

The post presents a three-layer benchmark methodology for evaluating structured extraction quality from conversational data, developed for the ChatLedger system. Layer 1 computes per-field F1 with semantic matching across seventeen extraction fields, using field-specific scorers and weights derived from downstream consumer importance. Layer 2 employs an LLM judge to evaluate extractions against raw conversation text, addressing the limitation that field-level F1 treats a model-generated reference as ground truth. Layer 3 measures whether extraction differences propagate into downstream pipeline outputs such as drift detection and daily reports, revealing that models scoring adequately on extraction can produce dramatically different pipeline results. The methodology was applied to four frontier models, producing a clean separation in composite scores, and the V3 schema selection process used a systematic screening and optimization pipeline across approximately 579 model calls.

How to Benchmark Conversation Extraction Quality

Structured extraction from conversational data is one of those problems that sounds solved until you try to measure whether the solution actually works. I can extract claims, decisions, emotional tone, and negotiation patterns from a conversation using any frontier language model, and the output will look reasonable on inspection, but “looks reasonable on inspection” is not a benchmark, and without a benchmark I have no way to answer the question that matters: does switching to a cheaper model break anything downstream?

I built ChatLedger to extract structured metadata from text message conversations: claims with polarity and subject attribution, action items with ownership, decisions with evidence chains, emotional arcs, implicit assumptions, and negotiation patterns. The V3 schema defines seventeen independently scored extraction dimensions (the original ten V1 fields, with compound scoring on fields like claims that evaluates text match, polarity, and subject as separate components, plus seven new V3 fields), each with its own structure and evidence requirements. The extraction works well enough that I use it as the foundation for drift detection, conversation loop tracking, and automated reporting. Which means that the quality of the extraction propagates through everything built on top of it, and the question of how to evaluate that quality becomes the question of how to evaluate the entire pipeline.

This post describes the methodology I developed to answer that question. As of March 3, 2026, the results are preliminary (four models so far, with three more planned), but the methodology itself is the interesting part, because it addresses a problem that anyone building structured extraction pipelines will eventually face: a single quality metric can’t capture the ways extraction fails.

The Problem with Single Metrics

Everyone knows that more metrics give you more truth. Except they don’t. More metrics give you more numbers, and more numbers give you more ways to convince yourself that you understand something you don’t.

The simplest approach to evaluating extraction quality is to compute F1 against a gold standard. F1 is the harmonic mean of precision (what fraction of the items I extracted are correct?) and recall (what fraction of the items that should have been extracted did I find?), which means it penalizes a system that finds everything but hallucinates extras just as it penalizes a system that is precise but misses half the content. I have reference extractions, I have challenger extractions, I measure overlap, I get a number. This works well for tasks where the output space is constrained, such as named entity recognition where entities are either present or absent and the boundary conditions are reasonably well defined relative to conversational extraction.

Conversational extraction is not that kind of task. When I tried to extract “claims” from a text message conversation, I quickly discovered the complexity. The model must identify that a claim was made, determine who made it, classify its polarity (pro, con, or neutral), link it to specific source messages, and assign a calibrated confidence score. Two models might identify the same claim using different phrasing, attribute it to the same person using a nickname versus a full name, agree on polarity but disagree on which messages constitute the evidence. A single F1 score collapses all of these dimensions into one number, which means I lose the ability to diagnose where and how the extraction is failing.

The problem compounds when I consider that extraction quality is not an end in itself. Nobody looks at raw JSON extractions. My extractions feed into downstream consumers: drift detectors that track how conversation topics evolve over time, loop detectors that identify recurring unresolved patterns, report generators that produce daily summaries. An extraction error that changes a claim’s polarity from “pro” to “neutral” might be invisible at the field level, because the claim text matched and F1 is unaffected, but catastrophic for my drift detector, where a position that was evolving toward commitment now appears to have stalled.

This is why a single metric fails. I need at least three layers of evaluation, each measuring a different kind of quality, because the errors that matter at each layer are different errors.

Layer 1: Per-Field F1 with Fuzzy Text Matching

My first evaluation layer compares each challenger model’s extractions against a reference baseline, Opus 4.6, on a field by field basis, using metrics appropriate to each field type.

For text fields like claims, action items, open questions, and decisions, I use token set ratio from RapidFuzz with a 0.60 threshold for fuzzy text matching. This is more lenient than exact string matching because two models can express the same claim in substantially different words, and exact matching would penalize semantically correct extractions that happen to use different phrasing. I chose the 0.60 threshold empirically: below that, false matches become common; above that, legitimate paraphrases get missed.

For claims specifically, the evaluation is compound: 60% weight on claim text F1, 25% on polarity accuracy (did the model get the pro/con/neutral classification right for matched claims?), and 15% on subject consistency (does the model cluster claim subjects the same way the reference does?). I weight polarity accuracy heavily because a model that identifies the right claims but misclassifies their polarity is worse than useless for my drift detection, where the direction of a claim is the signal.

For entities, I use a lenient NER F1 that requires fuzzy name matching (0.70 threshold) plus exact type matching. A model that identifies “Ness” where the reference says “Ness Alderton” should get partial credit, but a model that identifies “Ness” as a “location” instead of a “person” should not.

For confidence scores, I compute Expected Calibration Error (ECE) across ten bins (a design choice; the original Naeini et al. paper, “Obtaining Well Calibrated Probabilities Using Bayesian Binning Quantization,” AAAI 2015, uses fifteen), where 1.0 after conversion means perfect calibration. I care about this because a model that assigns 0.90 confidence to claims it gets wrong is worse than a model that assigns 0.50 confidence to claims it gets wrong; overconfident errors are more dangerous than uncertain ones. The ECE score is converted to a zero through one scale (1 minus ECE) so higher numbers are better, consistent with the other metrics.

The V3 schema adds seven fields beyond the original V1 schema: summary, question/answer pairs, emotional tone, emotional arc, conversation phases, implicit assumptions, and negotiation patterns. Each requires its own scoring approach, and I had to design scorers for all seven. My emotional tone scorer matches by speaker and emotion pairs. My emotional arc scorer compares opening mood, closing mood, trajectory, and turning points as four equally weighted components. My conversation phase scorer matches by phase enum plus position agreement.

I combine all field scores into a weighted composite. The weights reflect my assessment of each field’s relative importance to the downstream consumers: claims receive the highest weight at 0.15 because they drive drift detection; action items and open questions each receive 0.09 because they drive loop detection and the daily report; newer V3 fields receive proportionally less weight, between 0.04 and 0.08, because they have fewer downstream consumers currently.

Layer 2: LLM as Judge

So what happens when the reference is wrong? What happens when it missed a claim the challenger found, or invented one the challenger correctly ignored? What happens when the gold standard is brass? These are not edge cases. They are the default condition when your reference is itself a model’s output.

I recognized early that evaluation at the field level has a fundamental limitation: it can only measure what the reference captured. If the reference missed a legitimate claim, and the challenger found it, the challenger gets penalized. If the reference invented a claim, and the challenger correctly omitted it, the challenger gets penalized. F1 at the field level treats the reference as ground truth, but the reference is itself a model’s output.

The second evaluation layer addresses this by using a separate LLM (Opus 4.6, running as a judge through claude -p) to evaluate each challenger’s extraction against the raw conversation text, not the reference extraction. The judge sees the conversation chunk and the challenger’s extraction but does not know which model produced it. It evaluates three dimensions on a scale from 0 to 100:

Content accuracy: how accurately does the extraction reflect what is actually in the text? A model that extracts claims that are genuinely present in the conversation scores high; a model that invents claims or misattributes them scores low.

Completeness: how much of the important content was captured? A model that extracts the three most salient claims but misses five minor ones scores moderately; a model that captures everything including the noise scores high.

Hallucination check: 100 means no hallucinations, 0 means heavily hallucinated. The judge also classifies each extraction into an issue taxonomy: none, weak support, fabricated, misattributed, over_interpreted (interpretation beyond the evidence), or partially supported. This taxonomy allows us to distinguish between a model that fabricates claims (dangerous) and a model that reads too much into ambiguous statements (less dangerous, potentially even useful).

The judge composite is the average of these three dimensions, normalized to a scale from zero to one.

Using the same model family, Opus, as both the reference and the judge introduces a known bias: the judge may preferentially reward extractions that are stylistically similar to its own output. I consider this acceptable for two reasons. First, the judge evaluates against the raw text, not the reference extraction, so it is measuring absolute quality rather than similarity to the reference. Second, the alternative of using a different model family as judge introduces a different bias without eliminating the first one. The correct solution is human ground truth annotations, which are planned but not yet completed. A multi-judge panel using models from different families would also reduce this bias, and is planned for the final benchmark.

Layer 3: Downstream Propagation

The third evaluation layer is the one that I actually care about most. It measures whether extraction quality differences propagate into the downstream consumers that I interact with daily.

I run the full analysis pipeline (drift detection, loop tracking, escalation detection, emotional arc analysis, and daily report generation) on both the reference extractions and each challenger’s extractions, then compare the pipeline outputs using agreement metrics.

I measure drift agreement F1 to see whether the challenger identifies the same topics as drifting over time that the reference does. I measure loop agreement F1 to see whether the challenger identifies the same recurring conversation patterns. I measure escalation agreement to see whether critical events like arguments and emotional peaks are detected consistently. And I measure report overlap to see whether the generated daily reports surface the same highlights.

The downstream composite combines these metrics. Its function is diagnostic: if a model scores well on extraction and judge quality but poorly on downstream propagation, the implication is that the errors it makes, despite being individually small, are concentrated in fields that matter most for the pipeline.

A naive assumption would be that extraction errors are distributed uniformly: a model that gets 80% right should produce a pipeline that is 80% correct. This assumption does not hold in practice.

In practice, this layer revealed something I did not expect. All three cheaper models scored between 0.24 and 0.27 on the downstream composite, which is dramatically lower than their extraction composites in the range of 0.496 to 0.558. This gap means that extraction errors are not uniformly distributed: they cluster in the fields and dimensions that my downstream consumers rely on most heavily. A model can get 80% of the extraction right by volume and still produce a pipeline output that looks almost nothing like what I get from the reference.

The Composite: 60/25/15

I want to be precise about where the subjectivity lives in this composite, because the answer is not “everywhere” and it is not “nowhere.” The weighting scheme has two layers, and they sit at different points on the spectrum between empirical grounding and judgment.

Within the extraction layer, the field weights are empirically grounded at two levels. The V1 field weights were set by a consumer importance assessment that ranked each field by downstream module dependency counts and degradation severity. The V3 field weights were derived from the Phase 1 screening composite scores, which measured consistency, information gain, discrimination, and redundancy across the Phase 1 extraction runs (nineteen fields, eight chunks, three runs each). Neither of these is a pure judgment call, but neither is a fully optimized quantity in the machine learning sense either: the screening metrics themselves involve design choices (why 0.30 weight on consistency rather than 0.25?), and the consumer importance assessment was a structured human judgment, not an automated optimization. I would place both on the “informed by evidence, decided by a person” part of the spectrum.

The cross-layer weights of 60% extraction, 25% judge, 15% downstream are a different matter. These are judgment calls, and I am telling you this because most benchmark papers bury their subjective choices inside equations, which makes them look more objective than they are. I would rather show you where the subjectivity lives.

I give the extraction layer the highest weight because it is the most granular and repeatable measurement. It evaluates seventeen fields independently, producing a diagnostic fingerprint of where each model succeeds and fails. The judge layer receives 25% because it provides a complementary perspective, evaluating against raw text rather than reference output, but is less granular with three scores versus seventeen. The downstream layer receives 15% because it is the most meaningful measurement but also the most volatile: small extraction differences can produce large downstream divergences, making it a noisy signal.

The four models evaluated so far produce a clean separation under this composite. Opus 4.6, the baseline evaluated against itself as a sanity check, scores 0.833. This score has a ceiling note worth explaining: because the downstream layer compares challenger outputs against Opus reference outputs, Opus evaluated against itself should logically score 1.0 on downstream agreement. The current implementation scores it as 0.0 instead, because the downstream comparison code treats same-model runs as having no valid comparison rather than perfect agreement. This means the Opus baseline composite is suppressed to roughly 0.833 (on an effective ceiling of about 0.85, given the 0.60 + 0.25 extraction and judge weights with a zeroed 0.15 downstream weight) rather than the approximately 0.983 it would reach with downstream scored correctly. I flag this because it means the gap between Opus and the challengers is narrower than a naive reading of the scores suggests. GPT-5.2 scores 0.600. Haiku 4.5 scores 0.590. Gemini 3 Flash Preview scores 0.553. The average gap between the baseline and the three challengers is approximately 0.252 on the overall composite. At the extraction layer alone, where Opus scores 0.996 and the challengers average roughly 0.536, the gap is approximately 0.460. The composite gap confirms that the benchmark discriminates, that the models are not scoring similarly, and the ordering is plausible: the most capable model scores highest, the cheapest model scores lowest.

What Changed from V1 to V3

The V1 schema defined ten top-level extraction fields (counting nested per-item properties like polarity and subject as part of their parent field): topics, entities, decisions, action items, open questions, claims, links, warnings, evidence message IDs, and confidence scores. These fields were not chosen arbitrarily. Each proven field was already validated by the downstream consumer modules that depended on it: arcs.py for interchunk arc linking, report.py for daily summaries, drift.py for opinion-drift detection, loops.py for recurring pattern detection. The V1 field weights were set by a consumer importance assessment that ranked each field by how many pipeline modules depended on it and how severely their output degraded when the field was noisy. Claims received the highest weight at 0.25 because they are the sole input to drift detection, and polarity errors in claims directly create false positive drifts (a position that was evolving toward commitment appearing to stall, or vice versa). Confidence received the lowest at 0.07 because it affected calibration reporting but not the core analytical pipeline. This was not hand tuning in the sense of adjusting numbers until the benchmark produced satisfying results, but it was not fully empirical optimization either. The weights reflected a deliberate assessment of downstream importance.

The V3 schema selection used a fully systematic pipeline that parallels how DSPy’s core abstractions (Signatures, Modules, Optimizers) optimize, evaluate, and select configurations (and was in fact designed around the same intuition that drove my prompt optimization work in the DSPy post). The pipeline ran across three phases.

Phase 1 began with twenty-five candidate fields across four tiers: six proven V1 fields (skipped because already validated), five latent fields that previous models had extracted faithfully but that the V1 pipeline had stripped, eight structural fields capturing conversation mechanics, and six derived composite fields. With the six proven fields skipped, nineteen candidates remained for screening. Each candidate was extracted three times across eight conversation chunks (24 extractions per field), then scored on a composite of four metrics: consistency at 0.30 weight (does the model extract the same structure across repeated runs?), information gain at 0.25 (does this field surface content not already captured by proven fields?), discrimination at 0.25 (do scores on this field distinguish between chunks that humans would rate differently?), and redundancy at 0.20 (does this field overlap with existing fields, measured by inverse cosine similarity of extracted content?). Fields passing a 0.45 composite threshold advanced. Question/answer pairs scored highest at 0.81 composite, driven by near perfect consistency (0.97) and strong discrimination (0.81). Implicit assumptions scored lowest among survivors at 0.59, dragged down by low consistency (0.22), which makes intuitive sense: different runs of the same model will identify different unstated assumptions from the same conversation, because the space of reasonable inferences is large.

Phase 1.5 ran prompt optimization for extraction consistency on fields that had passed screening but scored below 0.50 on the consistency metric. This was a targeted intervention: if a field carries useful information but the model extracts it inconsistently, the problem might be prompt ambiguity rather than inherent field difficulty.

Phase 2 ran a schema plus prompt co-optimization tournament across five prompt styles, evaluating approximately 328 Opus calls (eight schemas times five prompt styles times eight chunks, plus retries) to find the prompt and schema combination that maximized extraction quality across the surviving fields simultaneously. The tournament structure tested eight candidate schemas (nine tournament files including the eventual winner), each combining the six proven V1 fields with different subsets of the screened candidates, varying both which fields were included and how the extraction prompt described them.

Phase 3 validated the winning schema on the full thirty chunk benchmark set (approximately 60 Opus calls), producing the final V3 schema of seventeen fields: the original ten plus summary, question/answer pairs, emotional tone, emotional arc, conversation phases, implicit assumptions, and negotiation patterns.

The V3 field weights within the extraction layer were then rebalanced to accommodate the new fields. V1 field weights were rescaled to sum to approximately 60% of the total (claims dropping from 0.25 to 0.15, action items from 0.15 to 0.09, and so on), with the remaining 40% distributed across V3 fields based on their Phase 1 screening scores. The screening scores informed the weighting but were not the sole factor; downstream consumer dependencies and field complexity also influenced the final allocation. Summary and question/answer pairs received 0.08 and 0.07 respectively; conversation phase and negotiation patterns received 0.04 each.

The impact on benchmark scores is measurable. All models improved their extraction composites from V1 to V3: Haiku from 0.463 to 0.558, GPT-5.2 from 0.472 to 0.554, Gemini from 0.476 to 0.496. This suggests that the additional fields are easier for models to extract than the original fields, or that the V3 extraction prompt provides better guidance, or both. The gains are not uniform. Models improved most on emotional tone and question/answer pairs, which have relatively constrained output spaces, and least on negotiation patterns and implicit assumptions, which require genuine inferential reasoning about what participants are doing without explicitly saying so.

What This Methodology Does Not Solve

Three limitations are worth stating explicitly, because they constrain how far I can trust my own results.

First, the reference baseline is a model’s output, not human annotations. Opus 4.6 is an excellent extractor, but it is not perfect, and every error in the reference propagates as bias through the entire evaluation. I plan to add human ground truth for fifteen to twenty chunks to calibrate how far the reference diverges from a genuine gold standard.

Second, the corpus is private, consisting of personal text message conversations, which means the benchmark is not reproducible by others. My methodology is reproducible, but the specific scores are tied to this particular dataset. I am investigating openly shareable alternatives such as meeting transcripts and public dialogue corpora that would allow independent replication.

Third, the weighting scheme occupies a middle ground between arbitrary and empirically optimized. The cross-layer weights (60/25/15) are judgment calls with no derivation beyond the reasoning I described above. The field weights within the extraction layer are more grounded: V1 weights were set by a consumer importance assessment that counted downstream module dependencies, and V3 weights were derived from Phase 1 screening scores. But “derived from screening scores” is not the same as “optimized to maximize benchmark validity,” and a different composite formula for the screening metrics would have produced different field weights. A different weighting scheme at either level would produce a different ranking, and I have not tested whether alternative weights would change the relative positions of the models.

What Comes Next

Three additional models are planned for the benchmark: Gemini 3.1 Pro Preview, GPT-5.3-Codex (high reasoning effort), and Sonnet 4.6, which will bring the total to seven. The comparison post, with detailed breakdowns of where each model excels and fails, will follow once all seven models have been evaluated. A research paper describing the full methodology and results will use a separate publication pipeline that includes automated data verification against the source JSON files.

The methodology I describe here is general enough to apply to any structured extraction task where you have a schema, a reference, and downstream consumers. The specific scoring functions are tied to the ChatLedger schema: fuzzy text matching for claims, speaker plus emotion matching for emotional tone, ECE for confidence calibration. But the architecture of three evaluation layers transfers to any pipeline where extraction quality matters beyond the extraction itself.

If you are building structured extraction and evaluating it with a single F1 score, you are measuring something, but what you are measuring is not what you think you are measuring. I learned this the hard way. The errors that score well on F1 are not the errors that break your pipeline.

Ask About Projects
Hi! I can answer questions about Ashita's projects, the tech behind them, or how this blog was built. What would you like to know?