Extraction Audit

The interface used to human-audit a random sample of the session-LLM extractions behind the paper.

The machine already confirmed every quote is a verbatim substring of its case (hallucination rate 0.000). The step it can't do is the semantic one: does the green quote actually justify the extracted value? Read each, pick a verdict. Everything is local — your verdicts autosave in this browser and never leave your machine; use Export to download them as a JSON record (and Import to resume on another device). This is a static page: there is no server, by design.

Note on “conf”: the confidence shown per item is the extracting model's self-reported score, following the extraction prompt's rubric (0.9+ only for explicit statements, 0.5–0.7 for inferences). It is not an externally calibrated probability and the downstream classifier does not consume it as one — treat it as the model's own hedge, useful context for your judgment, not ground truth.