Researchers at Stanford University have developed a platform called VeriFact that pulls clinical data from a patient's EHR and uses an large language model to determine whether AI-generated documentation about that patient is accurate.
According to a study published in NEJM AI, researchers sought to test the accuracy of text generated by LLMs in the clinical setting compared with a patient's real medical record.
Researchers created VeriFact, a system that pulls relevant data from the EHR and analyzes it, using an "LLM-as-a-judge" approach to evaluate whether the generated statements are factually supported by the EHR data.
The researchers also introduced a clinician‑annotated benchmark dataset, VeriFact‑Brief Hospital Course (VeriFact‑BHC), that analyzes hospital discharge narratives into individual claims and labels whether each claim is supported by the actual EHR.
VeriFact achieved 93.2% agreement with clinicians.
The highest interrater agreement among clinicians was 88.5%, indicating that VeriFact can produce more consistent fact verification than humans.
【MORE】