Adaptive Domain Intelligence: An LLM-as-Feature-Engineer Architecture for Compressing Qualitative Records into Calibrated Empirical Predictions

Dave Liu · working paper · June 2026 · PDF

Working paper. This is a public write-up of an ongoing research system. The empirical results below are each re-derived from scratch by an automated verifier (8/8 headline numbers reproduce with zero drift), but the work has not been peer-reviewed: a human audit of a sample of machine extractions is still pending. Treat it as a preprint shared for public knowledge, not a finished publication. The extraction-audit interface (the actual tool for that human review) is public. Reference implementation (“metis”) available on request.

Abstract

A recurring class of high-stakes problems asks us to compress a qualitative record — a court opinion, a clinical note, a field dispatch — into an empirical verdict: violation or not, cured or not, escalation or not. Large language models read such records fluently, but two failures block their use as trustworthy predictors here: they fabricate, and they degrade when allowed to judge their own improvements. We present an architecture that confronts both. An LLM is confined to two roles — proposing a feature schema and performing verbatim-quote-grounded extraction — while a small, calibrated classical model makes the actual prediction; any extracted value lacking a verbatim evidence span is dropped and counted against a measured hallucination rate. The system improves itself only when a domain expert and a frozen gold set the loop never fits to both endorse a change, adjudicated by a permutation test with a hard subgroup-fairness veto. We evaluate not by chasing a benchmark but as a known-answer test: across four real legal corpora spanning six decades of empirical judicial-politics findings, the system independently re-derives the established signals — judge party predicts decision direction, court ideology adds incremental lift, litigant “repeat-player” status predicts settlement — and, equally importantly, returns the established nulls: judge attributes do not predict whether parties settle, and appellate panel composition washes out. It also exhibits the honesty its design targets: it vetoes one of its own expert-approved escalations on held-out evidence, self-rejects features that don't help, and on real European Court of Human Rights texts extracts ~1,200 features with a measured-zero hallucination rate yet lands at chance on the outcome — the correct answer to a documented leakage trap in which prior text-only work reported ~79% accuracy by reading court-authored facts written after the decision.

1. The problem: qualitative in, empirical out

Consider three questions from three professions. A human-rights lawyer asks whether a set of facts will be found to violate the Convention. An oncologist asks whether a patient described across pages of notes will be readmitted. An analyst asks whether a region's dispatches portend escalation.

Each compresses a rich qualitative record into a small empirical verdict, under conditions that make naive machine learning treacherous: the data are scarce (hundreds to thousands of labeled cases, not millions), the stakes are high, the output must be auditable by a domain expert who will be accountable for it, and that expert's judgment is itself the scarcest signal available.

Large language models are obvious candidates — they read qualitative text well — but two documented failures stand in the way. First, they fabricate. Second, and less widely appreciated, they cannot reliably improve themselves by judging their own work: a growing literature shows that “intrinsic” self-correction without an external signal does not help and often degrades performance (Huang et al. 2023), that apparent self-improvement is usually an external verifier recognizing an already-present answer (Stechly et al. 2024), and that LLM judges systematically favor their own generations (Panickssery et al. 2024). A system that rewrites its own logic and grades the result with the same model family is building an echo chamber.

The legal-prediction literature supplies a sharp cautionary instance. Text-only prediction of European Court of Human Rights outcomes reported ~79% accuracy (Aletras et al. 2016), with the case “Facts” section as the strongest predictor — but those facts are written by the Court after it decides, so a later survey reframed much of the field as retrospective identification rather than genuine forecasting (Medvedeva et al. 2023). The headline numbers partly measure leakage. The more honest, leakage-resistant benchmark — predicting U.S. Supreme Court outcomes from features available before decision — sits around 70% (Katz et al. 2017).

We take these failures as design constraints rather than caveats.

2. The architecture: four disciplines

  data  ─►  EXTRACT  ─►  VALIDATE  ─►  MODEL  ─►  DECISION
 record    LLM proposes   grounding /   small        prediction +
 (text)    schema +       leakage /     calibrated   calibrated proba,
           grounded       fairness      classifier   abstain / escalate,
           extraction     gates         + conformal  named-feature reasons
              │            (every quote verified verbatim)      ▲
              └───────────  governed by a versioned Policy  ────┘
                                     ▲
        improve:  diagnose → propose → test on a FROZEN gold set
                  → expert AND gold must agree → accept / roll back

2.1 The LLM as an untrusted feature-engineer

The LLM does two things and is trusted for neither without verification. It proposes a schema — quantifiable, domain-meaningful features, each tagged with the section of the record it may legally be read from — and it extracts each feature, returning a value, a calibrated confidence, and one or more verbatim evidence spans. The validator then checks each span is an exact substring of the source; any value whose span is not found is dropped and counted against a per-run hallucination rate. The LLM never sees the outcome label and never makes the prediction. This is stricter than its closest predecessors (CHiLL, FeatLLM): the verbatim-quote requirement makes grounding a hard structural precondition, not a post-hoc citation or an LLM-judged faithfulness score with its own error.

2.2 The calibrated classical decider

The prediction is made by a small, calibrated classical model over the named features — a sparse logistic regression by default, yielding signed coefficients we can show the user, with evidence-gated escalation to gradient boosting only when a cross-validated probe shows it earns its keep. Calibration is a gate: expected calibration error above 0.10 is treated as not-deployable regardless of AUC (Guo et al. 2017). Abstention is first-class — a conformal layer (Angelopoulos & Bates 2021) emits “uncertain” rather than a forced label when both outcomes are plausible, and genuinely hard or high-stakes cases are escalated to a human. That a small model on a few legible features can match an opaque proprietary one is not new (Dressel & Farid 2018 on COMPAS); the contribution is where the features come from and how the system is gated.

2.3 The dual gate (and why it matters most)

The defining choice is who is allowed to approve a self-modification. Given the evidence that LLMs cannot reliably self-evaluate, the system keeps the LLM in the proposal role but moves every accept/reject verdict to two non-LLM arbiters:

A frozen gold set the loop never fits to or diagnoses on. A change must improve held-out performance by a margin, with a paired-bootstrap confidence interval above zero, and survive a permutation significance test (corrected for the many candidates tested per round). A parsimony drop need only be non-inferior.
A domain expert, who receives clustered failure cards — misses grouped by recurring pattern, each in named-feature terms with the decision rule that fired and typed candidate repairs. A repair is applied only when the expert accepts and the gold confirms; either party's dissent blocks it.

A hard subgroup-fairness veto sits over both: a change that worsens a protected subgroup's calibration or error gap past threshold is rejected even if overall accuracy improves — because the fairness criteria provably trade off (Kleinberg et al. 2017; Chouldechova 2017), the chosen point is made an auditable, vetoed decision rather than an implicit one. Crucially, the evaluator is not in the loop's edit set: the system improves the engine, never rewrites the judge to agree with itself.

2.4 Leakage discipline

Leakage is the trap that silently destroys these systems. The architecture enforces, mechanically: a provenance firewall (features written after the outcome are quarantined from the model); a univariate leakage audit (any single feature that alone separates the classes too well is flagged); a grounding-span scan (features whose evidence contains outcome-revealing language are dropped); and split hygiene — real data is cross-validated chronologically or grouped by entity (judge, court, party), never naively, because random splits leak temporal and entity correlations. Every empirical result below uses such a leakage-safe split.

3. Evaluation as a known-answer test

We test not by topping a benchmark but by asking whether the system behaves honestly where the answers are independently known. Six decades of empirical judicial politics supply ground truth: party shapes how judges decide (the attitudinal model; Martin-Quinn ideal points), capability shapes who wins and settles (Galanter's repeat-player theory), and panel effects are real but domain-specific (Sunstein et al. 2004). A trustworthy adaptive agent must recover the real effects, decline the absent ones, and be honest about its own machinery. The corpora are four real legal datasets — the FJC Integrated Database (federal civil terminations), the Carp-Manning U.S. District Court Database (~111k coded decisions), the Supreme Court Database joined to Martin-Quinn ideology, and the Songer Courts of Appeals Database — plus real ECtHR texts.

Data and how much of each

Every corpus is free and public; experiments use fixed-seed deterministic samples so the exact rows are reproducible. ECtHR = European Court of Human Rights (Strasbourg); its “texts” are the Facts sections of published judgments, with a label for whether any Article of the European Convention on Human Rights was found violated — via the public lex_glue corpus.

**Table 0.** The four real prediction corpora (plus the ECtHR extraction-faithfulness set) and the deterministic sample used from each.
Corpus (free / public)	scale available	N used	powers
FJC Integrated Database (federal civil terminations, FY2021–26)	full 5-year bundle	4,000 (1,916 w/ judge)	settlement, duration, amount; repeat-player; pacing
Carp-Manning U.S. District Court DB (1927–2012)	110,977	8,000 (civ. lib. 3,292)	judge party → decision direction
Supreme Court Database 2025 (+ Martin-Quinn)	9,092 usable	9,032	court ideology → SCOTUS outcome
Songer Courts of Appeals DB (1925–1996)	18,195 → 16,710 merits	6,017	panel composition → reversal/direction
ECtHR facts (lex_glue ecthr_a)	~11k	160 / 150 extracted	grounded-extraction faithfulness

Five auxiliary joins enrich these rather than standing alone: the FJC Biographical Directory (4,059 judges → appointer party, tenure); CourtListener bulk dockets + people-db (judge assignment + party fallback); Martin-Quinn court medians (1937–2022); the AO's Federal Court Management Statistics Table C-5 (94 districts → congestion prior); and SEC EDGAR (10,400 filers + CY2020 revenue), the financial join rejected at 3.6% name-match. Three synthetic corpora carry the planted pathologies and back no real-world claim.

**Table 1.** Known-answer results. All under leakage-safe cross-validation with paired significance tests. ΔAUC = paired change in area-under-ROC from adding the feature group; positive = improvement.
Relationship (corpus)	CV split	Effect	Verdict
Judge party → decision direction (Carp-Manning, civil-liberties)	unseen-judge	ΔAUC +0.034 [0.015, 0.053], p=0.002	signal
Court ideology → SCOTUS outcome, incremental (SCDB + Martin-Quinn)	chronological	+0.026 (linear) / +0.012 (boosted), p=0.002	signal
Litigant repeat-player status → settlement (FJC)	district-grouped	+0.016 [0.007, 0.025], p=0.002	signal
Stakes × case type → duration (FJC, capacity gain)	district-grouped	+0.076 [0.064, 0.089] paired error↓, p=0.002	signal
Judge attributes → settlement (FJC)	judge-grouped	ΔAUC ≈ 0, p > 0.4	null
Appellate panel composition → reversal (Songer)	lead-judge-grouped	−0.003, p=0.84	null
Appellate panel composition → direction (Songer)	lead-judge-grouped	+0.001, p=0.37	null

Figure 1. Paired ΔAUC with 95% confidence intervals, every estimate under a leakage-safe split. The three signals sit cleanly above zero; the three nulls straddle it. A trustworthy adaptive agent must produce both columns — recover the real effects and decline the absent ones. (Stakes×type→duration is omitted here as it is measured in error rather than AUC; see Table 1.)

The pattern is coherent judicial politics: ideology shows up where an individual judge decides alone, attenuates on three-judge panels (the panel-moderation effect; the nulls match Sunstein et al.'s finding that panel effects concentrate in a few hot-button domains and wash out across a broad corpus), aggregates at the court level — and never reaches into whether litigants settle, which is litigant behavior, not judicial ideology. A system that reported only its hits would have buried the nulls; here the significance gates make them first-class results.

**Table 2.** Honest self-behavior and extraction faithfulness.
Behavior	Result
Frozen gold vetoes an expert-approved model escalation (Carp-Manning)	Δ=+0.000, vetoed — no churn
Capacity probe flips as informative features accrue (FJC duration)	boosted≈linear early → +0.076 paired gain later
Financial-status features self-rejected (SEC-EDGAR match)	3.6% coverage, marginal −0.013 — declined
Session-LLM extraction hallucination rate (real ECtHR, ~1,200 values)	0.0000 — every quote verbatim
ECtHR outcome prediction (the honest answer to the leakage trap)	AUC 0.504 — chance
Conformal interval coverage on sealed real data (FJC duration)	80.7% vs 80% target

The last two rows are the paper's spine. On real ECtHR texts the extraction was faithful by measurement — zero fabricated quotes across ~1,200 extracted values — yet the downstream prediction landed at chance. We present this as the correct result, not a failure. It is the honest answer to the leakage trap: the architecture cleanly separates two questions the field routinely conflates — did the model extract honestly? (yes, measurably) and do those features predict the outcome? (no, on the pre-judgment framing) — and refuses to manufacture the second from the first. Where prior text-only work reported ~79% by reading court-authored facts, an honest, leakage-careful pipeline lands at chance, and says so.

Every number above is re-derived from scratch by an automated verifier that recomputes each claim deterministically and diffs it against a traceability table; at the time of writing it reports 8/8 headline numbers reproduced with zero drift, and it caught two of our own secondary-number slips during drafting.

3b. Statistical evaluation in detail

Two estimators are reported per experiment and should not be conflated. Model AUC is the mean of per-fold ROC-AUCs, with a 95% interval from the 2.5/97.5 percentiles across folds (how well the full model ranks). Paired ΔAUC is the incremental value of a feature group, computed per-case on pooled out-of-fold predictions (AUC-with minus AUC-without), with a 95% CI from a 2,000-resample paired bootstrap and a p from a 500-permutation paired test (does the feature group help). The paired statistic is the hypothesis test; because it cancels per-case variance the fold-mean cannot, it can be significantly positive even when the fold-averaged AUC barely moves (the SCDB and FJC-settlement rows). Inside the self-improvement loop, the acceptance permutation test's α is Bonferroni-corrected over the round's candidate revisions — the multiple-comparisons guard.

Cross-validation is leakage-safe by construction, chosen per corpus, never naively random (random k-fold leaks entity/temporal correlation and can inflate AUC by an estimated 0.05–0.30; an entity-overlap diagnostic flags >30% recurrence): entity-grouped 5-fold GroupKFold where a judge/district/court never straddles a fold (Carp-Manning, FJC, Songer); chronological forward-chaining (TimeSeriesSplit, 5 folds) where every evaluation fold is strictly later than its training data (SCDB, by term); and repeated stratified 5×2 where cases are exchangeable with no recurring entity (ECtHR).

**Table 3.** Per-experiment N, grouping entities, CV scheme, fold structure, and both estimators. Wide model-AUC fold intervals on the grouped 5-fold rows (e.g. FJC settlement) reflect genuine between-entity variance on a hard task, not instability of the effect — the paired ΔAUC, the tested quantity, is tight. The two nulls (N=1,916 and 6,017) are *powered* nulls, not silence from too little data.
Experiment	N	entities	CV scheme (folds)	test fold (min/med/max)	model AUC [fold 95% CI]	paired ΔAUC [95% CI], p
Judge party → direction (Carp-Manning, civ. lib.)	3,292	1,109 judges	entity-grouped (5)	658/658/659	0.565 [0.541, 0.587]	+0.034 [.015,.053], .002
Court ideology → SCOTUS (SCDB+Martin-Quinn)	9,032	78 terms	chronological (5)	1,505/1,505/1,505	0.520 [0.463, 0.597]	+0.026 [.010,.041], .002
Repeat-player → settlement (FJC)	4,000	92 districts	entity-grouped (5)	739/739/1,043	0.771 [0.605, 0.948]	+0.016 [.007,.025], .002
Judge attrs → settlement (FJC)	1,916	730 judges	entity-grouped (5)	383/383/384	0.770 [0.754, 0.792]	+0.000 [−.005,.005], .45
Panel composition → reversal (Songer)	6,017	607 judges	entity-grouped (5)	1,203/1,203/1,204	0.597 [0.563, 0.636]	−0.003 [−.006,.000], .84
ECtHR outcome (session-LLM, at chance)	150	—	repeated stratified (5×2)	30/30/30	0.508 [0.353, 0.669]	—

On the regression side, prediction intervals are split-conformal (a per-fold calibration slice sets the residual quantile, giving finite-sample coverage under exchangeability): on the sealed FJC duration test (n≈1,000 held out) measured coverage was 80.7% against an 80% target, with error beating the predict-the-mean baseline (MAE 0.879 vs 1.115 in log1p space). Numbers re-derived by stats_detail.py (deterministic, seed 13).

4. Why this is new

The pieces exist separately. LLM-as-feature-engineer for a small interpretable model (CHiLL, FeatLLM); verbatim-quote grounding as “verifiable by design” (Quote-Tuning) and fine-grained faithfulness measurement (AIS, FActScore, ALCE); the now-robust finding that self-correction needs an external signal (STaR and Reflexion work because they have one; intrinsic self-correction fails); calibration, conformal abstention, learning-to-defer, and the fairness-impossibility results. To our knowledge no prior system combines LLM-proposed grounded extraction, a calibrated classical decider, and an automated held-out statistical + fairness gate on its own self-modifications, and validates the whole as a known-answer test across real corpora. The contribution is less a higher accuracy than an honesty-and-calibration layer the benchmark-chasing branch of this field has lacked, and the external-gate answer the self-correction-fails literature demands.

5. Limitations and ethics

Outcome prediction reproduces historical patterns, including injustice; the fairness veto, abstention, and escalation reduce but do not remove this. The system is decision-support only — an architectural refusal layer blocks prohibited contexts (bail, sentencing, recidivism, immigration enforcement) regardless of confidence. The real-data samples are in the thousands, not the full corpora; some party names in the source data are truncated, capping one financial join; the expert gate is only as good as the expert, and authoring the frozen gold set is the binding human cost — not eliminated, but bounded and made auditable. Per-feature extraction confidences are model self-reported (the prompt asks for 0.9+ only on explicit statements, 0.5–0.7 on inferences) and gate abstention/escalation, but they are not externally calibrated — the downstream model calibrates its own probabilities separately and does not consume them. As noted at the top, a human audit of a sample of machine extractions is still pending; this is a working paper.

6. Generalization

The recipe is domain-agnostic. Any setting that compresses a qualitative record into a categorical or continuous verdict under scarcity, high stakes, a need for auditability, and available expert judgment fits the same four disciplines and the same known-answer validation. Law is the worked example here; medicine (readmission, response — the CHiLL setting) and forecasting (escalation, default) are the natural next targets. The valuable and dangerous problems all share a shape — violation or not, cured or not, war or not — and the question worth answering is not “can a model predict X” but “can we trust what it says about X,” demonstrated where the answers are already known.

Selected references

Aletras, Tsarapatsanis, Preoţiuc-Pietro & Lampos (2016). Predicting Judicial Decisions of the European Court of Human Rights. PeerJ Computer Science 2:e93.
Medvedeva, Vols & Wieling (2020). Using Machine Learning to Predict Decisions of the ECtHR. Artificial Intelligence and Law 28(2).
Medvedeva, Wieling & Vols (2023). Rethinking the Field of Automatic Prediction of Court Decisions. Artificial Intelligence and Law 31.
Katz, Bommarito & Blackman (2017). A General Approach for Predicting the Behavior of the Supreme Court. PLOS ONE 12(4):e0174698.
Chalkidis, Androutsopoulos & Aletras (2019). Neural Legal Judgment Prediction in English. ACL.
McInerney et al. (2023). CHiLL: Zero-shot Custom Interpretable Feature Extraction from Clinical Notes. Findings of ACL.
Han, Yoon, Arik & Pfister (2024). FeatLLM: LLMs Can Automatically Engineer Features for Few-Shot Tabular Learning. ICML.
Zhang et al. (2025). Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data. NAACL.
Rashkin et al. (2023). Measuring Attribution in Natural Language Generation Models (AIS). Computational Linguistics 49(4).
Min et al. (2023). FActScore: Fine-grained Atomic Evaluation of Factual Precision. EMNLP.
Gao, Yen, Yu & Chen (2023). Enabling LLMs to Generate Text with Citations (ALCE). EMNLP.
Manakul, Liusie & Gales (2023). SelfCheckGPT. EMNLP.
Madaan et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS.
Shinn et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS.
Zelikman, Wu, Mu & Goodman (2022). STaR: Bootstrapping Reasoning With Reasoning. NeurIPS.
Huang et al. (2023/2024). Large Language Models Cannot Self-Correct Reasoning Yet. ICLR.
Stechly, Valmeekam & Kambhampati (2024). On the Self-Verification Limitations of LLMs.
Kamoi, Zhang, Zhang, Han & Zhang (2024). When Can LLMs Actually Correct Their Own Mistakes? TACL 12.
Panickssery, Bowman & Feng (2024). LLM Evaluators Recognize and Favor Their Own Generations. NeurIPS.
Guo, Pleiss, Sun & Weinberger (2017). On Calibration of Modern Neural Networks. ICML.
Angelopoulos & Bates (2021). A Gentle Introduction to Conformal Prediction. arXiv:2107.07511.
Madras, Pitassi & Zemel (2018). Predict Responsibly: Learning to Defer. NeurIPS.
Kleinberg, Mullainathan & Raghavan (2017). Inherent Trade-Offs in the Fair Determination of Risk Scores. ITCS.
Chouldechova (2017). Fair Prediction with Disparate Impact. Big Data 5(2).
Dressel & Farid (2018). The Accuracy, Fairness, and Limits of Predicting Recidivism. Science Advances 4(1).
Galanter (1974). Why the ‘Haves’ Come Out Ahead. Law & Society Review 9(1).
Martin & Quinn (2002). Dynamic Ideal Point Estimation for the U.S. Supreme Court. Political Analysis 10(2).
Segal & Spaeth (2002). The Supreme Court and the Attitudinal Model Revisited. Cambridge University Press.
Sunstein, Schkade & Ellman (2004). Ideological Voting on Federal Courts of Appeals. Virginia Law Review 90(1).
Songer, Sheehan & Haire (1999). Do the ‘Haves’ Come Out Ahead Over Time? Law & Society Review 33(4).

This write-up shares an ongoing research system for public knowledge. Numbers are reproducible from a deterministic verifier; the system is decision-support and research only, not a tool for predictions about individuals. Comments welcome — see the portfolio for contact.