Development and Performance of Text-Mining Algorithms to Extract Socioeconomic Status from De-Identified Electronic Health Records

Brittany M. Hollister1, Nicole A. Restrepo2, Eric Farber-Eger3, Dana C. Crawford2, Melinda C. Aldrich4, Amy Non5


1Vanderbilt Genetics Institute, Vanderbilt University
2Institute for Computational Biology, Department of Epidemiology and Biostatistics, Case Western Reserve University
3Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center
4Department of Thoracic Surgery and Division of Epidemiology, Vanderbilt University Medical Center
5Department of Anthropology, University of California, San Diego
Email: Brittany.M.Hollister@Vanderbilt.edu, nrestrepo@case.edu, eric.h.farber-eger@vanderbilt.edu, dana.crawford@case.edu, melinda.aldrich@vanderbilt.edu, alnon@ucsd.edu

Pacific Symposium on Biocomputing 22:230-241(2017)

© 2017 World Scientific
Open Access chapter published by World Scientific Publishing Company and distributed under the terms of the Creative Commons Attribution (CC BY) 4.0 License.


Abstract

Socioeconomic status (SES) is a fundamental contributor to health, and a key factor underlying racial disparities in disease. However, SES data are rarely included in genetic studies due in part to the difficultly of collecting these data when studies were not originally designed for that purpose. The emergence of large clinic-based biobanks linked to electronic health records (EHRs) provides research access to large patient populations with longitudinal phenotype data captured in structured fields as billing codes, procedure codes, and prescriptions. SES data however, are often not explicitly recorded in structured fields, but rather recorded in the free text of clinical notes and communications. The content and completeness of these data vary widely by practitioner. To enable gene-environment studies that consider SES as an exposure, we sought to extract SES variables from racial/ethnic minority adult patients (n=9,977) in BioVU, the Vanderbilt University Medical Center biorepository linked to de-identified EHRs. We developed several measures of SES using information available within the de-identified EHR, including broad categories of occupation, education, insurance status, and homelessness. Two hundred patients were randomly selected for manual review to develop a set of seven algorithms for extracting SES information from de-identified EHRs. The algorithms consist of 15 categories of information, with 830 unique search terms. SES data extracted from manual review of 50 randomly selected records were compared to data produced by the algorithm, resulting in positive predictive values of 80.0% (education), 85.4% (occupation), 87.5% (unemployment), 63.6% (retirement), 23.1% (uninsured), 81.8% (Medicaid), and 33.3% (homelessness), suggesting some categories of SES data are easier to extract in this EHR than others. The SES data extraction approach developed here will enable future EHR-based genetic studies to integrate SES information into statistical analyses. Ultimately, incorporation of measures of SES into genetic studies will help elucidate the impact of the social environment on disease risk and outcomes.


[Full-Text PDF] [PSB Home Page]