Publications — Dave Liu

2026 · Working Paper · Independent Research

Adaptive Domain Intelligence: An LLM-as-Feature-Engineer Architecture for Compressing Qualitative Records into Calibrated Empirical Predictions

David C. Liu

A recurring class of high-stakes problems compresses a qualitative record—a court opinion, a clinical note, a field dispatch—into an empirical verdict: violation or not, cured or not, escalation or not. LLMs read such records fluently but fabricate, and degrade when allowed to judge their own improvements. We present an architecture that confronts both: an LLM is confined to proposing a feature schema and performing verbatim-quote-grounded extraction, a small calibrated classical model makes the call, and self-improvement is gated by a domain expert and a frozen gold set (permutation test + fairness veto). We validate it as a known-answer test across four real legal corpora spanning six decades of empirical judicial politics—recovering the established signals and reporting the established nulls.

Key Results: Recovers known signals—judge party → decision direction (ΔAUC +0.034, p=0.002), litigant repeat-player status → settlement (+0.016, p=0.002) · Reports known nulls—judge attributes do NOT predict settlement; appellate panel composition washes out · Measured-zero hallucination on ~1,200 real ECtHR extractions, yet chance-level outcome AUC—the honest answer to a documented leakage trap (prior text-only work: ~79%) · The system vetoes its own expert-approved escalation on held-out evidence

LLM Feature Engineering Grounded Extraction Self-Improving Systems Calibration Legal NLP Hallucination

Read (Working Paper) PDF

2026 · Independent Research · CC BY 4.0

Weighted Multi-Expert Synthesis for High-Stakes Decision Support: A Multi-Agent LLM Framework with Dissent Preservation

David Liu

We present Meta Council, a multi-agent LLM framework in which N expert agents—each with a unique professional persona and analytical framework—analyze queries in parallel, then a weighted synthesis step produces structured decision documents with confidence scores, dissent preservation, and risk matrices. Evaluated across 750+ benchmark runs spanning six domains and five models (3B to frontier-class), we find that weighted synthesis outperforms single-best selection by 29–58% on freetext tasks (p<0.0001, d=2.16), that synthesis amplifies model quality non-linearly, and that the optimal aggregation method is domain-dependent.

Key Results: Synthesis outperforms single-best by 29–58% (p<0.0001) · 80% categorical accuracy vs 50% for single-best · Synthesis amplification: 2.99x for mid-tier models · Domain-dependent: synthesis wins in business (100%), single-best wins in legal (75%)

Multi-Agent Systems LLM Decision Support Dissent Preservation Weighted Synthesis Confidence Calibration

PDF Live Demo

2026 · Independent Research · CC BY 4.0

Stability Bonus Regularization for Model Selection Under Positive-Class Distribution Shift

David C. Liu

When positive-class training data is a biased subset of the true positive population—common in hiring, medical screening, and credit scoring—standard cross-validation selects models that overfit to the observed cluster. We propose Stability Bonus (SB) regularization, which rewards hyperparameter configurations with small train–validation gaps, favoring wider decision boundaries that generalize to unseen positive subgroups. On hard synthetic benchmarks with controlled distribution shift, SB improves test AUC by +6.9% (p<0.0001, d=3.48) and unseen-subgroup AUC by +7.1%. Class weighting, the standard remedy, is the worst performer under shift.

Key Results: +6.9% test AUC on hard benchmarks (p<0.0001) · +7.1% on unseen positive subgroups · Class weighting is worst under distribution shift · SB selects more regularized models with wider decision boundaries · Does NOT help when signal is strong (honest negative result)

Model Selection Distribution Shift Cross-Validation Class Imbalance Regularization Fairness

PDF