Identifying who has long COVID in the USA: a machine learning approach using N3C data

Updated 05/23/2022

Because the definition, clinical guidelines, and documentation practices for long COVID are still evolving, there is no gold standard to validate computable phenotypes and to train machine learning models.

Using the National COVID Cohort Collaborative's (N3C) electronic health record repository in this research, authors developed XGBoost machine learning models to identify potential patients with long COVID.

Alhough not every feature can be easily categorised, four themes emerged across the features and models in the research: (1) post-COVID-19 respiratory symptoms and associated treatments, (2) non-respiratory symptoms widely reported as part of long COVID and associated treatments, (3) pre-existing risk factors for greater acute COVID severity, and (4) proxies for hospitalisation.

This study has several limitations. Electronic health record data is skewed towards patients who make more use of health-care systems, and is further skewed towards high utilisers. It is essential to acknowledge whose data is less likely to be represented; for example, uninsured patients, patients with restricted access to or ability to pay for care, or patients seeking care at small practices or community hospitals with scarce data exchange capabilities.

【MORE】