Nicole Tignor1, Pei Wang1, Nicholas Genes1,2, Linda Rogers3, Steven G. Hershman4, Erick R. Scott1, Micol Zweig1, Yu-Feng Yvonne Chan1,2, Eric E. Schadt1
1Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai
2Department of Emergency Medicine, Icahn School of Medicine at Mount Sinai
3Department of Medicine, Pulmonary, Critical Care and Sleep Medicine, Icahn School of Medicine at Mount Sinai
4LifeMap Solutions, Inc
Email: pei.wang@mssm.edu, eric.schadt@mssm.edu
Pacific Symposium on Biocomputing 22:300-323(2017)
© 2017 World Scientific
Open Access chapter published by World Scientific Publishing Company and distributed under the terms of the Creative Commons Attribution (CC BY) 4.0 License.
In our recent Asthma Mobile Health Study (AMHS), thousands of asthma patients across the country contributed medical data through the iPhone Asthma Health App on a daily basis for an extended period of time. The collected data included daily self-reported asthma symptoms, symptom triggers, and real time geographic location information. The AMHS is just one of many studies occurring in the context of now many thousands of mobile health apps aimed at improving wellness and better managing chronic disease conditions, leveraging the passive and active collection of data from mobile, handheld smart devices. The ability to identify patient groups or patterns of symptoms that might predict adverse outcomes such as asthma exacerbations or hospitalizations from these types of large, prospectively collected data sets, would be of significant general interest. However, conventional clustering methods cannot be applied to these types of longitudinally collected data, especially survey data actively collected from app users, given heterogeneous patterns of missing values due to: 1) varying survey response rates among different users, 2) varying survey response rates over time of each user, and 3) non-overlapping periods of enrollment among different users. To handle such complicated missing data structure, we proposed a probability imputation model to infer missing data. We also employed a consensus clustering strategy in tandem with the multiple imputation procedure. Through simulation studies under a range of scenarios reflecting real data conditions, we identified favorable performance of the proposed method over other strategies that impute the missing value through low-rank matrix completion. When applying the proposed new method to study asthma triggers and symptoms collected as part of the AMHS, we identified several patient groups with distinct phenotype patterns. Further validation of the methods described in this paper might be used to identify clinically important patterns in large data sets with complicated missing data structure, improving the ability to use such data sets to identify at-risk populations for potential intervention.