Evaluation of patient re-identification using laboratory test orders and mitigation via latent space variables

Kipp W. Johnson1,†, Jessica K. De Freitas1,†, Benjamin S. Glicksberg2, Jason R. Bobe1, Joel T. Dudley1,*


1Institute for Next Generation Healthcare, Department of Genetics and Genomics Sciences, Icahn School of Medicine at Mount Sinai
2Bakar Computational Health Sciences Institute, The University of California San Francisco
Authors contributed equally to this work
*Corresponding author
Email: joel.dudley@mssm.edu

Pacific Symposium on Biocomputing 24:415-426(2019)

© 2019 World Scientific
Open Access chapter published by World Scientific Publishing Company and distributed under the terms of the Creative Commons Attribution (CC BY) 4.0 License.


Abstract

Anonymized electronic health records (EHR) are often used for biomedical research. One persistent concern with this type of research is the risk for re-identification of patients from their purportedly anonymized data. Here, we use the EHR of 731,850 de-identified patients to demonstrate that the average patient is unique from all others 98.4% of the time simply by examining what laboratory tests have been ordered for them. By the time a patient has visited the hospital on two separate days, they are unique in 72.3% of cases. We further present a computational study to identify how accurately the records from a single day of care can be used to re-identify patients from a set of 99 other patients. We show that, given a single visit's laboratory orders (even without result values) for a patient, we can re-identify the patient at least 25% of the time. Furthermore, we can place this patient among the top 10 most similar patients 47% of the time. Finally, we present a proof-of-concept technique using a variational autoencoder to encode laboratory results into a lower-dimensional latent space. We demonstrate that releasing latent- space encoded laboratory orders significantly improves privacy compared to releasing raw laboratory orders (<5% re-identification), while preserving information contained within the laboratory orders (AUC of >0.9 for recreating encoded values). Our findings have potential consequences for the public release of anonymized laboratory tests to the biomedical research community. We note that our findings do not imply that laboratory tests alone are personally identifiable. In the attack scenario presented here, reidentification would require a threat actor to possess an external source of laboratory values which are linked to personal identifiers at the start.


[Full-Text PDF] [PSB Home Page]