Automated disease cohort selection using word embeddings from Electronic Health Records

Benjamin S. Glicksberg1,2,†, Riccardo Miotto1,2,†, Kipp W. Johnson1,2, Khader Shameer1,2, Li Li1,2, Rong Chen1, Joel T. Dudley1,2

1Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai
2Institute for Next Generation Healthcare, Icahn School of Medicine at Mount Sinai
Authors contributed equally to this work

Pacific Symposium on Biocomputing 23:145-156(2018)

© 2018 World Scientific
Open Access chapter published by World Scientific Publishing Company and distributed under the terms of the Creative Commons Attribution (CC BY) 4.0 License.


Accurate and robust cohort definition is critical to biomedical discovery using Electronic Health Records (EHR). Similar to prospective study designs, high quality EHR-based research requires rigorous selection criteria to designate case/control status particular to each disease. Electronic phenotyping algorithms, which are manually built and validated per disease, have been successful in filling this need. However, these approaches are timeconsuming, leading to only a relatively small amount of algorithms for diseases developed. Methodologies that automatically learn features from EHRs have been used for cohort selection as well. To date, however, there has been no systematic analysis of how these methods perform against current gold standards. Accordingly, this paper compares the performance of a state-of-the-art automated feature learning method to extracting researchgrade cohorts for five diseases against their established electronic phenotyping algorithms. In particular, we use word2vec to create unsupervised embeddings of the phenotype space within an EHR system. Using medical concepts as a query, we then rank patients by their proximity in the embedding space and automatically extract putative disease cohorts via a distance threshold. Experimental evaluation shows promising results with average F-score of 0.57 and AUC-ROC of 0.98. However, we noticed that results varied considerably between diseases, thus necessitating further investigation and/or phenotype-specific refinement of the approach before being readily deployed across all diseases.

[Full-Text PDF] [PSB Home Page]