Here, we conduct a highly-powered study to determine whether LLMs can provide clinical recommendations for three tasks (admission status, radiological investigation(s) request status, and antibiotic prescription status) using clinical notes from the Emergency Department. For each of the three decisions, the team compiled a set of 1,000 ED visits to analyze from an archive of more than 251,000 visits. We first evaluate performance on balanced (i.e., equal numbers of positive and negative outcomes) datasets to examine the sensitivity and specificity of GPT recommendations before determining overall model accuracy on an unbalanced dataset that reflects real-world distributions of patients presenting to the Emergency Department. We found that both GPT-4-turbo and GPT-3.5-turbo performed poorly compared to a resident physician, with accuracy scores 8% and 24%, respectively, lower than physician on average. Both LLMs tended to be overly cautious in its recommendations, with high sensitivity at the cost of specificity.
【MORE】Evaluating the use of large language models to provide clinical recommendations in the Emergency Department
刊登時間