A recurring class of high-stakes problems asks us to compress a qualitative record — a court opinion, a clinical note, a field dispatch — into an empirical verdict: violation or not, cured or not, escalation or not. Large language models read such records fluently, but two failures block their use as trustworthy predictors here: they fabricate, and they degrade when allowed to judge their own improvements. We present an architecture that confronts both. An LLM is confined to two roles — proposing a feature schema and performing verbatim-quote-grounded extraction — while a small, calibrated classical model makes the actual prediction; any extracted value lacking a verbatim evidence span is dropped and counted against a measured hallucination rate. The system improves itself only when a domain expert and a frozen gold set the loop never fits to both endorse a change, adjudicated by a permutation test with a hard subgroup-fairness veto. We evaluate not by chasing a benchmark but as a known-answer test: across four real legal corpora spanning six decades of empirical judicial-politics findings, the system independently re-derives the established signals — judge party predicts decision direction, court ideology adds incremental lift, litigant “repeat-player” status predicts settlement — and, equally importantly, returns the established nulls: judge attributes do not predict whether parties settle, and appellate panel composition washes out. It also exhibits the honesty its design targets: it vetoes one of its own expert-approved escalations on held-out evidence, self-rejects features that don't help, and on real European Court of Human Rights texts extracts ~1,200 features with a measured-zero hallucination rate yet lands at chance on the outcome — the correct answer to a documented leakage trap in which prior text-only work reported ~79% accuracy by reading court-authored facts written after the decision.
Consider three questions from three professions. A human-rights lawyer asks whether a set of facts will be found to violate the Convention. An oncologist asks whether a patient described across pages of notes will be readmitted. An analyst asks whether a region's dispatches portend escalation.
Each compresses a rich qualitative record into a small empirical verdict, under conditions that make naive machine learning treacherous: the data are scarce (hundreds to thousands of labeled cases, not millions), the stakes are high, the output must be auditable by a domain expert who will be accountable for it, and that expert's judgment is itself the scarcest signal available.
Large language models are obvious candidates — they read qualitative text well — but two documented failures stand in the way. First, they fabricate. Second, and less widely appreciated, they cannot reliably improve themselves by judging their own work: a growing literature shows that “intrinsic” self-correction without an external signal does not help and often degrades performance (Huang et al. 2023), that apparent self-improvement is usually an external verifier recognizing an already-present answer (Stechly et al. 2024), and that LLM judges systematically favor their own generations (Panickssery et al. 2024). A system that rewrites its own logic and grades the result with the same model family is building an echo chamber.
The legal-prediction literature supplies a sharp cautionary instance. Text-only prediction of European Court of Human Rights outcomes reported ~79% accuracy (Aletras et al. 2016), with the case “Facts” section as the strongest predictor — but those facts are written by the Court after it decides, so a later survey reframed much of the field as retrospective identification rather than genuine forecasting (Medvedeva et al. 2023). The headline numbers partly measure leakage. The more honest, leakage-resistant benchmark — predicting U.S. Supreme Court outcomes from features available before decision — sits around 70% (Katz et al. 2017).
We take these failures as design constraints rather than caveats.
data ─► EXTRACT ─► VALIDATE ─► MODEL ─► DECISION
record LLM proposes grounding / small prediction +
(text) schema + leakage / calibrated calibrated proba,
grounded fairness classifier abstain / escalate,
extraction gates + conformal named-feature reasons
│ (every quote verified verbatim) ▲
└─────────── governed by a versioned Policy ────┘
▲
improve: diagnose → propose → test on a FROZEN gold set
→ expert AND gold must agree → accept / roll backThe LLM does two things and is trusted for neither without verification. It proposes a schema — quantifiable, domain-meaningful features, each tagged with the section of the record it may legally be read from — and it extracts each feature, returning a value, a calibrated confidence, and one or more verbatim evidence spans. The validator then checks each span is an exact substring of the source; any value whose span is not found is dropped and counted against a per-run hallucination rate. The LLM never sees the outcome label and never makes the prediction. This is stricter than its closest predecessors (CHiLL, FeatLLM): the verbatim-quote requirement makes grounding a hard structural precondition, not a post-hoc citation or an LLM-judged faithfulness score with its own error.
The prediction is made by a small, calibrated classical model over the named features — a sparse logistic regression by default, yielding signed coefficients we can show the user, with evidence-gated escalation to gradient boosting only when a cross-validated probe shows it earns its keep. Calibration is a gate: expected calibration error above 0.10 is treated as not-deployable regardless of AUC (Guo et al. 2017). Abstention is first-class — a conformal layer (Angelopoulos & Bates 2021) emits “uncertain” rather than a forced label when both outcomes are plausible, and genuinely hard or high-stakes cases are escalated to a human. That a small model on a few legible features can match an opaque proprietary one is not new (Dressel & Farid 2018 on COMPAS); the contribution is where the features come from and how the system is gated.
The defining choice is who is allowed to approve a self-modification. Given the evidence that LLMs cannot reliably self-evaluate, the system keeps the LLM in the proposal role but moves every accept/reject verdict to two non-LLM arbiters:
A hard subgroup-fairness veto sits over both: a change that worsens a protected subgroup's calibration or error gap past threshold is rejected even if overall accuracy improves — because the fairness criteria provably trade off (Kleinberg et al. 2017; Chouldechova 2017), the chosen point is made an auditable, vetoed decision rather than an implicit one. Crucially, the evaluator is not in the loop's edit set: the system improves the engine, never rewrites the judge to agree with itself.
Leakage is the trap that silently destroys these systems. The architecture enforces, mechanically: a provenance firewall (features written after the outcome are quarantined from the model); a univariate leakage audit (any single feature that alone separates the classes too well is flagged); a grounding-span scan (features whose evidence contains outcome-revealing language are dropped); and split hygiene — real data is cross-validated chronologically or grouped by entity (judge, court, party), never naively, because random splits leak temporal and entity correlations. Every empirical result below uses such a leakage-safe split.
We test not by topping a benchmark but by asking whether the system behaves honestly where the answers are independently known. Six decades of empirical judicial politics supply ground truth: party shapes how judges decide (the attitudinal model; Martin-Quinn ideal points), capability shapes who wins and settles (Galanter's repeat-player theory), and panel effects are real but domain-specific (Sunstein et al. 2004). A trustworthy adaptive agent must recover the real effects, decline the absent ones, and be honest about its own machinery. The corpora are four real legal datasets — the FJC Integrated Database (federal civil terminations), the Carp-Manning U.S. District Court Database (~111k coded decisions), the Supreme Court Database joined to Martin-Quinn ideology, and the Songer Courts of Appeals Database — plus real ECtHR texts.
| Relationship (corpus) | CV split | Effect | Verdict |
|---|---|---|---|
| Judge party → decision direction (Carp-Manning, civil-liberties) | unseen-judge | ΔAUC +0.034 [0.015, 0.053], p=0.002 | signal |
| Court ideology → SCOTUS outcome, incremental (SCDB + Martin-Quinn) | chronological | +0.026 (linear) / +0.012 (boosted), p=0.002 | signal |
| Litigant repeat-player status → settlement (FJC) | district-grouped | +0.016 [0.007, 0.025], p=0.002 | signal |
| Stakes × case type → duration (FJC, capacity gain) | district-grouped | +0.076 [0.064, 0.089] paired error↓, p=0.002 | signal |
| Judge attributes → settlement (FJC) | judge-grouped | ΔAUC ≈ 0, p > 0.4 | null |
| Appellate panel composition → reversal (Songer) | lead-judge-grouped | −0.003, p=0.84 | null |
| Appellate panel composition → direction (Songer) | lead-judge-grouped | +0.001, p=0.37 | null |
The pattern is coherent judicial politics: ideology shows up where an individual judge decides alone, attenuates on three-judge panels (the panel-moderation effect; the nulls match Sunstein et al.'s finding that panel effects concentrate in a few hot-button domains and wash out across a broad corpus), aggregates at the court level — and never reaches into whether litigants settle, which is litigant behavior, not judicial ideology. A system that reported only its hits would have buried the nulls; here the significance gates make them first-class results.
| Behavior | Result |
|---|---|
| Frozen gold vetoes an expert-approved model escalation (Carp-Manning) | Δ=+0.000, vetoed — no churn |
| Capacity probe flips as informative features accrue (FJC duration) | boosted≈linear early → +0.076 paired gain later |
| Financial-status features self-rejected (SEC-EDGAR match) | 3.6% coverage, marginal −0.013 — declined |
| Session-LLM extraction hallucination rate (real ECtHR, ~1,200 values) | 0.0000 — every quote verbatim |
| ECtHR outcome prediction (the honest answer to the leakage trap) | AUC 0.504 — chance |
| Conformal interval coverage on sealed real data (FJC duration) | 80.7% vs 80% target |
The last two rows are the paper's spine. On real ECtHR texts the extraction was faithful by measurement — zero fabricated quotes across ~1,200 extracted values — yet the downstream prediction landed at chance. We present this as the correct result, not a failure. It is the honest answer to the leakage trap: the architecture cleanly separates two questions the field routinely conflates — did the model extract honestly? (yes, measurably) and do those features predict the outcome? (no, on the pre-judgment framing) — and refuses to manufacture the second from the first. Where prior text-only work reported ~79% by reading court-authored facts, an honest, leakage-careful pipeline lands at chance, and says so.
Every number above is re-derived from scratch by an automated verifier that recomputes each claim deterministically and diffs it against a traceability table; at the time of writing it reports 8/8 headline numbers reproduced with zero drift, and it caught two of our own secondary-number slips during drafting.
The pieces exist separately. LLM-as-feature-engineer for a small interpretable model (CHiLL, FeatLLM); verbatim-quote grounding as “verifiable by design” (Quote-Tuning) and fine-grained faithfulness measurement (AIS, FActScore, ALCE); the now-robust finding that self-correction needs an external signal (STaR and Reflexion work because they have one; intrinsic self-correction fails); calibration, conformal abstention, learning-to-defer, and the fairness-impossibility results. To our knowledge no prior system combines LLM-proposed grounded extraction, a calibrated classical decider, and an automated held-out statistical + fairness gate on its own self-modifications, and validates the whole as a known-answer test across real corpora. The contribution is less a higher accuracy than an honesty-and-calibration layer the benchmark-chasing branch of this field has lacked, and the external-gate answer the self-correction-fails literature demands.
Outcome prediction reproduces historical patterns, including injustice; the fairness veto, abstention, and escalation reduce but do not remove this. The system is decision-support only — an architectural refusal layer blocks prohibited contexts (bail, sentencing, recidivism, immigration enforcement) regardless of confidence. The real-data samples are in the thousands, not the full corpora; some party names in the source data are truncated, capping one financial join; the expert gate is only as good as the expert, and authoring the frozen gold set is the binding human cost — not eliminated, but bounded and made auditable. As noted at the top, a human audit of a sample of machine extractions and a final citation pass are still pending; this is a working paper.
The recipe is domain-agnostic. Any setting that compresses a qualitative record into a categorical or continuous verdict under scarcity, high stakes, a need for auditability, and available expert judgment fits the same four disciplines and the same known-answer validation. Law is the worked example here; medicine (readmission, response — the CHiLL setting) and forecasting (escalation, default) are the natural next targets. The valuable and dangerous problems all share a shape — violation or not, cured or not, war or not — and the question worth answering is not “can a model predict X” but “can we trust what it says about X,” demonstrated where the answers are already known.
This write-up shares an ongoing research system for public knowledge. Numbers are reproducible from a deterministic verifier; the system is decision-support and research only, not a tool for predictions about individuals. Comments welcome — see the portfolio for contact.