Haoran Zhang1,2,3,*, Elisa Candido3, Andrew S. Wilton3, Raquel Duchen3, Liisa Jaakkimainen3, Walter Wodchis3,4,5, Quaid Morris1,2,6,7,*
1Department of Computer Science, University of Toronto
2Vector Institute for Artificial Intelligence, Toronto
3ICES
4Institute of Health Policy, Management, and Evaluation, University of Toronto
5Institute for Better Health, Trillium Health Partners
6Terrence Donnelly Center for Cellular and Biomolecular Research, University of Toronto
7Department of Molecular Genetics, University of Toronto
*Corresponding author
Email: haoran@cs.toronto.edu, quaid.morris@utoronto.ca
Pacific Symposium on Biocomputing 25:127-138(2020)
© 2020 World Scientific
Open Access chapter published by World Scientific Publishing Company and distributed under the terms of the Creative Commons Attribution (CC BY) 4.0 License.
Identification and subsequent intervention of patients at risk of becoming High Cost Users (HCUs) presents the opportunity to improve outcomes while also providing significant savings for the healthcare system. In this paper, the 2016 HCU status of patients was predicted using free-form text data from the 2015 cumulative patient profiles within the electronic medical records of family care practices in Ontario. These unstructured notes make substantial use of domain-specific spellings and abbreviations; we show that word embeddings derived from the same context provide more informative features than pre-trained ones based on Wikipedia, MIMIC, and Pubmed. We further demonstrate that a model using features derived from aggregated word embeddings (EmbEncode) provides a significant performance improvement over the bag-of-words representation (82.48±0.35% versus 81.85±0.36% held-out AUROC, p = 3.2 x 10-4), using far fewer input features (5,492 versus 214,750) and fewer non-zero coefficients (1,177 versus 4,284). The future HCUs of greatest interest are the transitional ones who are not already HCUs, because they provide the greatest scope for interventions. Predicting these new HCU is challenging because most HCUs recur. We show that removing recurrent HCUs from the training set improves the ability of EmbEncode to predict new HCUs, while only slightly decreasing its ability to predict recurrent ones.