Alejandro Schuler1, Vincent Liu1, Joe Wan2, Alison Callahan1, Madeleine Udell3, David E. Stark1, Nigam H. Shah1
1Center for Biomedical Informatics Research, Stanford University
2Computer Science, Stanford University
3Center for the Mathematics of Information, California Institute of Technology
Email: aschuler@stanford.edu, vinliu@stanford.edu, joewan@stanford.edu, acallaha@stanford.edu, udell@caltech.edu, dstark@stanford.edu, nigam@stanford.edu
Pacific Symposium on Biocomputing 21:144-155(2016)
© 2016 World Scientific
Open Access chapter published by World Scientific Publishing Company and distributed under the terms of the Creative Commons Attribution (CC BY) 4.0 License.
The practice of medicine is predicated on discovering commonalities or distinguishing characteristics among patients to inform corresponding treatment. Given a patient grouping (hereafter referred to as a phenotype), clinicians can implement a treatment pathway accounting for the underlying cause of disease in that phenotype. Traditionally, phenotypes have been discovered by intuition, experience in practice, and advancements in basic science, but these approaches are often heuristic, labor intensive, and can take decades to produce actionable knowledge. Although our understanding of disease has progressed substantially in the past century, there are still important domains in which our phenotypes are murky, such as in behavioral health or in hospital settings. To accelerate phenotype discovery, researchers have used machine learning to find patterns in electronic health records, but have often been thwarted by missing data, sparsity, and data heterogeneity. In this study, we use a flexible framework called Generalized Low Rank Modeling (GLRM) to overcome these barriers and discover phenotypes in two sources of patient data. First, we analyze data from the 2010 Healthcare Cost and Utilization Project National Inpatient Sample (NIS), which contains upwards of 8 million hospitalization records consisting of administrative codes and demographic information. Second, we analyze a small (N=1746), local dataset documenting the clinical progression of autism spectrum disorder patients using granular features from the electronic health record, including text from physician notes. We demonstrate that low rank modeling successfully captures known and putative phenotypes in these vastly different datasets.