Poster Board # Presenting Poster Author First Name Presenting Poster Author Last Name Session/Workshop Area Abstract Type Last Name of First Author Abstract Title Authors  Authors Affiliations/Institutions Abstract (300 words or less) Poster DOI or URL
1 Michelle Holko Digital health technology data in biocomputing: Research efforts and considerations for expanding access Accepted proceedings paper with poster presentation Master How Fitbit data are being made available to registered researchers in All of Us Research Program Hiral Master, Aymone Kouame, Kayla Marginean, Melissa Basford, Paul Harris, Michelle Holko Vanderbilt University Medical Center, Vanderbilt University Medical Center, Vanderbilt University Medical Center, Vanderbilt University Medical Center, Vanderbilt University Medical Center, Google Public Sector The National Institutes of Health’s (NIH) All of Us Research Program aims to enroll at least one million US participants from diverse backgrounds; collect electronic health record (EHR) data, survey data, physical measurements, biospecimens for genomics and other assays, and digital health data; and create a researcher database and tools to enable precision medicine research [1]. Since inception, digital health technologies (DHT) have been envisioned as essential to achieving the goals of the program [2]. A “bring your own device” (BYOD) study for collecting Fitbit data from participants’ devices was developed with integration of additional DHTs planned in the future [3]. Here we describe how participants can consent to share their digital health technology data, how the data are collected, how the data set is parsed, and how researchers can access the data.

2 Krithika  Krishnan Graph Representations and Algorithms in Biomedicine Accepted proceedings paper with poster presentation Krishnan Integrated Graph Propagation and Optimization with Biological Applications Krithika Krishnan,

Tiange Shi,

Han Yu,

Rachael Hageman Blair

Institute of Artificial Intelligence and Data Science

Department of Biostatistics

University at Buffalo

Buffalo, NY 14214, USA,

Department of Biostatistics

University at Buffalo

Buffalo, NY 14214, USA,

Department of Biostatistics and Bioinformatics

Roswell Park Comprehensive Cancer Center

Buffalo, NY 14263, USA,

Institute of Artificial Intelligence and Data Science

Department of Biostatistics

University at Buffalo

Buffalo, NY 14214, USA
Mathematical models that utilize network representations have proven to be valuable tools for investigating biological systems. Often dynamic models are not feasible due to their complex functional forms that rely on unknown rate parameters. Network propagation has been shown to accurately capture the sensitivity of nodes to changes in other nodes; without the need for dynamic systems and parameter estimation. Node sensitivity measures rely solely on network structure and encode a sensitivity matrix that serves as a good approximation to the Jacobian matrix. The use of a propagation-based sensitivity matrix as a Jacobian has important implications for network optimization. This work develops Integrated Graph Propagation and OptimizatioN (IGPON), which aims to identify optimal perturbation patterns that can drive networks to desired target states. IGPON embeds propagation into an objective function that aims to minimize the distance between a current observed state and a target state. Optimization is performed using Broyden's method with the propagation-based sensitivity matrix as the Jacobian. IGPON is applied to simulated random networks, DREAM4 in silico networks, and over-represented pathways from STAT6 knockout data and YBX1 knockdown data. Results demonstrate that IGPON is an effective way to optimize directed and undirected networks that are robust to uncertainty in the network structure.
3 Chang Liu Graph Representations and Algorithms in Biomedicine Accepted proceedings paper with poster presentation Liu Improving target-disease association prediction through a graph neural network with credibility information Chang Liu, Cuinan Yu2, Yipin Lei,

Kangbo Lyu, Tingzhong Tian, Qianhao Li,

Dan Zhao, Fengfeng Zhou, and Jianyang Zeng.
Tsinghua University, Jilin University, Tsinghua University, Tsinghua University, Tsinghua University, Silexon AI Technology Co. Ltd., Tsinghua University, Jilin University, Tsinghua University. Identifying effective target-disease associations (TDAs) can alleviate the tremendous cost incurred by clinical failures of drug development. Although many machine learning models have been proposed to predict potential novel TDAs rapidly, their credibility is not guaranteed, thus requiring extensive experimental validation. In addition, it is generally challenging for current models to predict meaningful associations for entities with less information,

hence limiting the application potential of these models in guiding future research. Based on recent advances in utilizing graph neural networks to extract features from heterogeneous biological data, we develop CreaTDA, an end-to-end deep learning-based framework that effectively learns latent feature representations of targets and diseases to facilitate TDA prediction. We also propose a novel way of encoding credibility information obtained from literature to enhance the performance of TDA prediction and predict more novel TDAs with real evidence support from previous studies. Compared with state-of-the-art baseline methods, CreaTDA achieves substantially better prediction performance on the whole TDA network and its sparse sub-networks containing the proteins associated with few known diseases. Our results demonstrate that CreaTDA can provide a powerful and helpful tool for identifying novel target-disease associations, thereby facilitating drug discovery.
4 Karl Keat Overcoming health disparities in precision medicine Accepted proceedings paper with poster presentation Keat Leveraging Multi-Ancestry Polygenic Risk Scores for Body Mass Index to Predict Antiretroviral Therapy-Induced Weight Gain Karl Keat, Daniel Hui, Brenda Xiao, Yuki Bradford, Zinhle Cindi, Eric S. Daar, Roy Gulick, Sharon A. Riddler, Phumla Sinxadi, David W. Haas, Marylyn D. Ritchie University of Pennsylvania, University of Pennsylvania, University of Pennsylvania, University of Pennsylvania, University of Cape Town, Lundquist Institute at Harbor-UCLA Medical Center, Weill Cornell Medicine, University of Pittsburgh, University of Cape Town, Vanderbilt University Medical Center and Meharry Medical College, University of Pennsylvania Widespread availability of antiretroviral therapies (ART) for HIV-1 have generated considerable interest in understanding the pharmacogenomics of ART. In some individuals, ART has been associated with excessive weight gain, which disproportionately affects women of African ancestry. The underlying biology of ART-associated weight gain is poorly understood, but some genetic markers which modify weight gain risk have been suggested, with more genetic factors likely remaining undiscovered. To overcome limitations in available sample sizes for genome-wide association studies (GWAS) in people with HIV, we explored whether a multi-ancestry polygenic risk score (PRS) derived from large, publicly available non-HIV GWAS for body mass index (BMI) can achieve high cross-ancestry performance for predicting baseline BMI in diverse, prospective ART clinical trials datasets, and whether that PRS_BMI is also associated with change in BMI over 48 weeks on ART. We show that PRS_BMI explained ~5-7% of variability in baseline (pre-ART) BMI, with high performance in both European and African genetic ancestry groups, but that PRS_BMI was not associated with change in BMI on ART. This study argues against a shared genetic predisposition for baseline (pre-ART) BMI and ART-associated weight gain.

Keywords: HIV; AIDS; Polygenic Risk Scores; BMI; Pharmacogenomics.

5 Kristel Van Steen Overcoming health disparities in precision medicine Accepted proceedings paper with poster presentation Chaichoompu Fine-scale subpopulation detection via an SNP-based unsupervised method:

A case study on the 1000 Genomes Project resources

Kridsadakorn Chaichoompu, Alisa Wilantho, Pongsakorn Wangkumhang, Sissades Tongsima, Bruno Cavadas, Luísa Pereira, Kristel Van Steen National Biobank of Thailand, Universidade do Porto, University of Liege SNP-based information is used in several existing clustering methods to detect shared genetic ancestry or to identify population substructure. Here, we present a methodology, called IPCAPS for unsupervised population analysis using iterative pruning. Our method, which can capture fine-level structure in populations, supports ordinal data, and thus can readily be applied to SNP data. Although haplotypes may be more informative than SNPs, especially in fine-level substructure detection contexts, the haplotype inference process often remains too computationally intensive. In this work, we investigate the scale of the structure we can detect in populations without knowledge about haplotypes; our simulated data do not assume the availability of haplotype information while comparing our method to existing tools for detecting fine-level population substructures. We demonstrate experimentally that IPCAPS can achieve high accuracy and can outperform existing tools in several simulated scenarios. The fine-level structure detected by IPCAPS on an application to the 1000 Genomes Project data underlines its subject heterogeneity.  
6 Saurav Aryal Precision Medicine: Using computation and artificial intelligence to improve healthcare and public health Accepted proceedings paper with poster presentation Aryal Acoustic-Linguistic Features for Modeling Neurological Task Score in Alzheimer's Saurav K. Aryal, Howard Prioleau, Legand Burge,, The average life expectancy is increasing globally due to advancements in medical technology, preventive health care, and a growing emphasis on gerontological health. Therefore, developing technologies that detect and track aging-associated disease in cognitive function among older adult populations is imperative. In particular, research related to automatic detection and evaluation of Alzheimer's disease (AD) is critical given the disease's prevalence and the cost of current methods. As AD impacts the acoustics of speech and vocabulary, natural language processing and machine learning provide promising techniques for reliably detecting AD. We compare and contrast the performance of ten linear regression models for predicting Mini-Mental Status Exam scores on the ADReSS challenge dataset. We extracted 13000+ handcrafted and learned features that capture linguistic and acoustic phenomena. Using a subset of 54 top features selected by two methods: (1) recursive elimination and (2) correlation scores, we outperform a state-of-the-art baseline for the same task. Upon scoring and evaluating the statistical significance of each of the selected subset of features for each model, we find that, for the given task, handcrafted linguistic features are more significant than acoustic and learned features.  
7 Tiffany Callahan Precision Medicine: Using computation and artificial intelligence to improve healthcare and public health Accepted proceedings paper with poster presentation Callahan RE: Knowledge-Driven Mechanistic Enrichment of the Preeclampsia Ignorome Tiffany J. Callahan, Adrianne L. Stefanski, Jin-Dong Kim, William A. Baumgartner Jr., Jordan M. Wyrwa, Lawrence E. Hunter Columbia University, University of Colorado Anschutz Medical Campus, Research Organization of Information and Systems, University of Colorado Anschutz Medical Campus, University of Colorado Anschutz Medical Campus, University of Colorado Anschutz Medical Campus Preeclampsia is a leading cause of maternal and fetal morbidity and mortality. Currently, the only definitive treatment of preeclampsia is delivery of the placenta, which is central to the pathogenesis of the disease. Transcriptional profiling of human placenta from pregnancies complicated by preeclampsia has been extensively performed to identify differentially expressed genes (DEGs). The decisions to investigate DEGs experimentally are biased by many factors, causing many DEGs to remain uninvestigated. A set of DEGs which are associated with a disease experimentally, but which have no known association to the disease in the literature are known as the ignorome. Preeclampsia has an extensive body of scientific literature, a large pool of DEG data, and only one definitive treatment. Tools facilitating knowledge-based analyses, which are capable of combining disparate data from many sources in order to suggest underlying mechanisms of action, may be a valuable resource to support discovery and improve our understanding of this disease. In this work we demonstrate how a biomedical knowledge graph (KG) can be used to identify novel preeclampsia molecular mechanisms. Existing open source biomedical resources and publicly available high-throughput transcriptional profiling data were used to identify and annotate the function of currently uninvestigated preeclampsia-associated DEGs. Experimentally investigated genes associated with preeclampsia were identified from PubMed abstracts using text-mining methodologies. The relative complement of the text-mined- and meta-analysis-derived lists were identified as the uninvestigated preeclampsia-associated DEGs (n=445), i.e., the preeclampsia ignorome. Using the KG to investigate relevant DEGs revealed 53 novel clinically relevant and biologically actionable mechanistic associations.
8 Pengfei Zhang Precision Medicine: Using computation and artificial intelligence to improve healthcare and public health Accepted proceedings paper with poster presentation Zhang PiTE: TCR-epitope Binding Affinity Prediction Pipeline using Transformer-based Sequence Encoder Pengfei Zhang, Seojin Bang, Heewook Lee Arizona State University, Arizona State University, Arizona State University Accurate prediction of TCR binding affinity to a target antigen is important for development of immunotherapy strategies. Recent computational methods were built on various deep neural networks and used the evolutionary-based distance matrix BLOSUM to embed amino acids of TCR and epitope sequences to numeric values. A pre-trained language model of amino acids is an alternative embedding method where each amino acid in a peptide is embedded as a continuous numeric vector. Little attention has yet been given to summarize the amino-acid-wise embedding vectors to sequence-wise representations. In this paper, we propose PiTE, a two-step pipeline for the TCR-epitope binding affinity prediction. First, we use an amino acids embedding model pre-trained on a large number of unlabeled TCR sequences and obtain a real-valued representation from a string representation of amino acid sequences. Second, we train a binding affinity prediction model that consists of two sequence encoders and a stack of linear layers predicting the affinity score of a given TCR and epitope pair. In particular, we explore various types of neural network architectures for the sequence encoders in the two-step binding affinity prediction pipeline. We show that our Transformer-like sequence encoder achieves a state-of-the-art performance and significantly outperforms the others, perhaps due to the model's ability to capture contextual information between amino acids in each sequence. Our work highlights that an advanced sequence encoder on top of pre-trained representation significantly improves performance of the TCR-epitope binding affinity prediction.  
9 Daniel Hui SALUD: Scalable Applications of cLinical risk Utility and preDiction Accepted proceedings paper with poster presentation Hui Quantifying factors that affect polygenic risk score performance across diverse ancestries and age groups for body mass index Daniel Hui 1*, Brenda Xiao 1*, Ozan Dikilitas 2, Robert R. Freimuth 3, Marguerite R. Irvin 4, Gail P. Jarvik 5, Leah Kottyan 6, Iftikhar Kullo 7, Nita A. Limdi 8, Cong Liu 9, Yuan Luo 10, Bahram Namjou 11, Megan J. Puckelwartz 12, Daniel Schaid 13, Hemant Tiwari 14, Wei-Qi Wei 15, Shefali Verma 16, Dokyoon Kim 17, Marylyn D. Ritchie 18** 1 Graduate Program in Genomics and Computational Biology, University of Pennsylvania, Philadelphia, PA, USA

2 Department of Internal Medicine, Department of Cardiovascular Medicine, Clinician-Investigator Training Program, Mayo Clinic, Rochester MN

3 Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, USA

4 Department of Epidemiology, University of Alabama at Birmingham, Birmingham, AL, United States

5 Departments of Medicine and Genome Sciences, University of Washington, Seattle WA, USA

6 Center for Autoimmune Genomics and Etiology, Department of Pediatrics, University of Cincinnati, Cincinnati, OH, USA

7 Division of Cardiovascular Diseases, Mayo Clinic, Rochester, MN 55905, USA

8 Department of Neurology & Epidemiology, University of Alabama at Birmingham, Birmingham, AL, USA

9 Department of Biomedical Informatics, Columbia University, New York, NY, USA

10 Department of Preventive Medicine (Health and Biomedical Informatics), Northwestern University, Chicago, IL USA

11 Department of Pediatrics, University of Cincinnati, Cincinnati, OH, USA

12 Department of Pharmacology, Northwestern University, Chicago, IL USA

13 Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55905, USA

14 Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, United States

15 Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA

16 Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA

17 Department of Biostatistics, Epidemiology and Informatics, Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA

18 Department of Genetics, Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
Polygenic risk scores (PRS) have led to enthusiasm for precision medicine. However, it is well documented that PRS do not generalize across groups differing in ancestry or sample characteristics e.g., age. Quantifying performance of PRS across different groups of study participants, using genome-wide association study (GWAS) summary statistics from multiple ancestry groups and sample sizes, and using different linkage disequilibrium (LD) reference panels may clarify which factors are limiting PRS transferability. To evaluate these factors in the PRS generation process, we generated body mass index (BMI) PRS in the Electronic Medical Records and Genomics (eMERGE) network (N=75,661). Analyses were conducted in two ancestry groups (European and African) and three age ranges (adult, teenagers, and children). For PRS calculations, we evaluated five LD reference panels and three sets of GWAS summary statistics of varying sample size and ancestry. PRS performance increased for both African and European ancestry individuals using cross-ancestry GWAS summary statistics compared to European-only summary statistics (6.3% and 3.7% relative R2 increase, respectively, p African=0.038, p European=6.26x10-4). The effects of LD reference panels were more pronounced in African ancestry study datasets. PRS performance degraded in children; R2 was less than half of teenagers or adults. The effect of GWAS summary statistics sample size was small when modeled with the other factors. Additionally, the potential of using a PRS generated for one trait to predict risk for comorbid diseases is not well understood especially in the context of cross-ancestry analyses – we explored clinical comorbidities from the electronic health record associated with PRS and identified significant associations with type 2 diabetes and coronary atherosclerosis. In summary, this study quantifies the effects that ancestry, GWAS summary statistic sample size, and LD reference panel have on PRS performance, especially in cross-ancestry and age-specific analyses.
10 Ji Ae  Park Towards Ethical Biomedical Informatics Accepted proceedings paper with poster presentation Park VdistCox: Vertically distributed Cox proportional hazards model with hyperparameter optimization

Ji Ae Park, Yu Rang Park Yonsei University College of Medicine  Vertically partitioned data is distributed data in which information about a patient is distributed across multiple sites. In this study, we propose a novel algorithm (referred to as VdistCox) for the Cox proportional hazards model (Cox model), which is a widely-used survival model, in a vertically distributed setting without data sharing. VdistCox with a single hidden layer feedforward neural network through extreme learning machine can build an efficient vertically distributed Cox model. VdistCox can tune hyperparameters, including the number of hidden nodes, activation function, and regularization parameter, with one communication between the master site, which is the site set to act as the server in this study, and other sites. In addition, we explored the randomness of hidden layer input weights and biases by generating multiple random weights and biases. The experimental results indicate that VdistCox is an efficient distributed Cox model that reflects the characteristics of true centralized vertically partitioned data in the model and enables hyperparameter tuning without sharing information about a patient and additional communication between sites.
11 Karthik Soman Graph Representations and Algorithms in Biomedicine Accepted proceedings paper with oral presentation Soman Time-aware Embeddings of Clinical Data using a Knowledge Graph Karthik Soman, Charlotte A. Nelson, Gabriel Cerono, Sergio E. Baranzini University of California San Francisco Meaningful representations of clinical data using embedding vectors is a pivotal step to invoke any machine learning (ML) algorithm for data inference. In this article, we propose a time-aware embedding approach of electronic health records onto a biomedical knowledge graph for creating machine readable patient representations. This approach not only captures the temporal dynamics of patient clinical trajectories, but also enriches it with additional biological information from the knowledge graph. To gauge the predictivity of this approach, we propose an ML pipeline called TANDEM (Temporal and Non-temporal Dynamics Embedded Model) and apply it on the early detection of Parkinson’s disease. TANDEM results in a classification AUC score of 0.85 on unseen test dataset. These predictions are further explained by providing a biological insight using the knowledge graph. Taken together, we show that temporal embeddings of clinical data could be a meaningful predictive representation for downstream ML pipelines in clinical decision-making.  
12 Oluwatosin Oluwadare Graph Representations and Algorithms in Biomedicine Poster only Hovenga 3D Chromosome Structure Reconstruction Using Graph Convolutional Neural Networks Van Hovenga, Jugal Kalita, Oluwatosin Oluwadare University of Colorado Colorado Springs Chromosome conformation capture (3C) is a method of measuring chromosome topology in terms of loci interaction. The Hi-C method is a derivative of 3C that allows for genome-wide quantification of chromosome interaction. From such interaction data, it is possible to infer the three-dimensional (3D) structure of the underlying chromosome. In this paper, we developed a novel method, HiC-GNN, for predicting the three-dimensional structures of chromosomes from Hi-C data. HiC-GNN is unique from other methods for chromosome structure prediction in that the models learned by HiC-GNN can be generalized to unseen data. To the authors’ knowledge, this generalizing capability is not present in any existing methods. HiC-GNN uses a node embedding algorithm and a graph neural network to predict the 3D coordinates of each genomic loci from the corresponding Hi-C contact data. Unlike other chromosome structure prediction methods, our method allows for the storage of pre-trained parameters which allows for prediction to occur on un-seen data. We show that these predictions on unseen data are accurate, thereby showing that our method is generalizable. We show that our method can generalize a single model across Hi-C resolutions, multiple restriction enzymes, and multiple cell populations while maintaining reconstruction accuracy across three Hi-C datasets. Our algorithm outperforms the state-of-the-art methods in accuracy of prediction and runtime and introduces a novel method for 3D structure prediction from Hi-C data. All our source codes and data are available at , and is made available as a containerized application that can be run on any platform.
13 John Van Horn Graph Representations and Algorithms in Biomedicine Poster only Newman Extracellular water characterizes sex-specific differences in Autism Spectrum Disorder  Benjamin T. Newman, Zachary Jacokes, John Van Horn University of Virginia Autism spectrum disorder (ASD) is a complex, multifaceted disorder with a significant sex bias in both diagnosis rate and behavioral symptom profile. Previous studies have found structural differences in neuronal white matter (WM) with ASD being associated with lower FA and higher diffusivity. Studies on sex differences have suggested that ASD males and females may lack sex-specific differences seen in typically developing individuals. This study aims to characterize sex-specific differences in ASD using advanced models of brain microstructure including a metric of extracellular water.
14 Sagi Shaier Graph Representations and Algorithms in Biomedicine Poster only Shaier Extreme Multi-hop Question Answering on a Massive Biomedical Knowledge Graph Sagi Shaier, Kevin Bennett, Lawrence Hunter, Katharina Kann CU Boulder Biomedical questions are often complex and require multiple facts to answer. However, existing multi-hop question answering (MHQA) datasets focus on much more simplistic questions in comparison, which are limited to roughly three facts. Additionally, such datasets use generic knowledge graphs (KGs), such as ConceptNet or DBpedia. These in turn hinder the development of much needed biomedical QA systems that can handle realistically complex questions that people ask. Here, we present an extreme MHQA dataset which contains questions with up to 15 hops on a massive biomedical KG (120M triples). We create such questions using 60 unique graph structures, followed by sub-graph matching over the KG to find matching paths for each structure. These paths were then sent to medical experts for filtering and questions creation. Our questions are unique, biomedically relevant, and require models to reason over long, diverse, and complex chains of knowledge.   
15 Hoyin Chu Overcoming health disparities in precision medicine Accepted proceedings paper with oral presentation Chu Using Association Rules to Understand the Risk of Adverse Pregnancy Outcomes in a Diverse Population Hoyin Chu, Rashika Ramola, Shantanu Jain, David M. Haas, Sriraam Natarajan, Predrag Radivojac Northeastern University Racial and ethnic disparities in adverse pregnancy outcomes (APOs) have been well-documented in the United States, but the extent to which the disparities are present in high-risk subgroups have not been studied. To address this problem, we first applied association rule mining to the clinical data derived from the prospective nuMoM2b study cohort to identify subgroups at increased risk of developing four APOs (gestational diabetes, hypertension acquired during pregnancy, preeclampsia, and preterm birth). We then quantified racial/ethnic disparities within the cohort as well as within high-risk subgroups to assess potential effects of risk-reduction strategies. We identify significant differences in distributions of major risk factors across racial/ethnic groups and find surprising heterogeneity in APO prevalence across these populations, both in the cohort and in its high-risk subgroups. Our results suggest that risk-reducing strategies that simultaneously reduce disparities may require targeting of high-risk subgroups with considerations for the population context.
16 Emily Bartusiak Overcoming health disparities in precision medicine Poster only Bartusiak Machine learning approaches for diverse and admixed populations Emily Bartusiak, Miriam Barrabes i Torrella, Margarita Geleta, Maria Perera Baro, Benet Oriol Sabat, Richa Rastogi, Daniel Mas Montserrat, Xavi Giro-i-Nieto, Alexander Ioannidis Emily Recent advances in artificial intelligence, particularly in the fields of machine learning and deep learning, have significantly improved the accuracy and efficiency of population genetics techniques. In this work, we examine the application of these techniques to the characterization of population structure and phenotypic information from genotype data in both humans and canids. We demonstrate that neural networks are highly effective for providing high-resolution estimates of ancestry or breed composition, as well as for predicting phenotypic labels. These methods not only offer improved accuracy, but also provide a significant speed increase compared to traditional techniques, allowing for efficient processing of large biobank data sets. Additionally, we discuss the potential use of neural networks to simulate genomic sequences and the vulnerability of such systems to adversarial attacks.  
17 Francisco De La Vega Overcoming health disparities in precision medicine Poster only De La Vega Genetic ancestry correlates of the somatic mutational landscape from tumor profiling data of 100,000 cancer patients Francisco M. De La Vega, Brooke Rhead, Yannick Pouliot, Justin Guinney Tempus Labs Inc. The incidence and mortality of cancer vary widely across race and ethnicity. This is attributed to an interplay of socioeconomic factors, environmental exposures, and genetic background. Cancer genomic studies have underrepresented minorities and individuals of non-European descent, thus limiting a comprehensive understanding of disparities in the diagnosis, prognosis, and treatment of cancer among these populations. Furthermore, the social constructs of race and ethnicity are far from precise categories to understand the biological underpinnings of such differences. In this study, we use a large real-world data (RWD) patient cohort to examine associations of genetic ancestry with somatic alterations in cancer driver genes. We inferred genetic ancestry from approximately 100,000 de-identified records from cancer patients of diverse histology who underwent tumor genomic profiling with the Tempus xT 648-gene next-generation sequencing (NGS) assay. We used 654 ancestry informative markers selected to overlap the capture regions of the assay to infer global ancestry proportions at the continental level: Africa (AFR), America (AMR), Europe (EUR), East Asia (EAS), and South Asia (SAS). Logistic regression was used to directly test for associations between continental ancestry proportions and presence of nonsynonymous somatic mutations in cancer genes. We identify 7 significant associations with small somatic mutations and 15 with CNAs with non-European ancestries (per 20% increase in each ancestry proportion; all p<0.0001). Among others, we found associations between small somatic mutations in CTNNB1 with EAS ancestry (OR=1.44); EGFR with EAS (OR=1.49), and AMR (OR=1.78) ancestries in lung cancer, and ASXL1 with AMR ancestry in brain cancer (OR=2.48). Furthermore, we identified several associations between ancestry and CNAs in MTAP with AMR (OR=1.45) and EGFR and SAS (OR=1.46) in lung cancer, among others. Our results support the use of genetic ancestry inference on RWD to understand the impact of ancestry into cancer incidence, progression, and outcomes.  
18 Charleston Chiang Overcoming health disparities in precision medicine Poster only Lo Transferability of polygenic scores for anthropometric traits to the Native Hawaiian population Ying-Chu Lo, Soyoung Jeon, Lynne R. Wilkens, Loic Le Marchand, Christopher A. Haiman, Charleston W.K. Chiang University of Southern California Polygenic scores (PS) are promising in stratifying individuals based on the genetic susceptibility to complex diseases or traits. However, while PS models trained European-centric GWAS studies are shown to transfer poorly to other ethnic minority populations, the transferability efficacy has not been evaluated for Native Hawaiians. Native Hawaiians are the second fastest-growing ethnic minority currently making up 0.5% of the US census. They are largely admixed with ancestry components from both Europeans and East Asians, among others, which may alleviate some of the transferability issues observed in other ethnic populations. Using height and BMI as examples of highly polygenic traits, we evaluated the transferability of PS to Native Hawaiians. We trained genome-wide PS models based on the largest available GWAS for each trait from European-ancestry and East Asian-ancestry cohorts using the pruning-and-thresholding approach. We then evaluated each model in an out-of-sample cohort of Japanese, White, and Native Hawaiian individuals from the Multiethnic Cohort (MEC). For PS models trained with European-ancestry GWAS (the GIANT consortium and UK Biobank), the models performed better in out-of-sample MEC non-Latino white individuals (partial r2 = 0.25 and 0.057 for height and BMI, respectively) compared to Native Hawaiians (partial r2 = 0.14 and 0.032) or Japanese (partial r2 = 0.13 and 0.021). Similarly, PS model trained with East Asian-ancestry GWAS (Biobank Japan) performed better in MEC Japanese individuals (partial r2 = 0.16 and 0.039 for height and BMI, respectively) compared to Native Hawaiians (partial r2 = 0.036 and 0.012). Our results thus confirm some loss of information when transferring PS models to Native Hawaiians for highly polygenic traits. By increasing representation of Native Hawaiians in genetic studies, better modeling of the differences in ancestry or linkage disequilibrium patterns, or further incorporating of non-genetic risk factors will be helpful to improve risk predictions.  
19 Mark Penjueli Overcoming health disparities in precision medicine Poster only Penjueli Effects of Admixture and Selection in Hawaiians Mark Penjueli, Javier Blanco-Portillo, Jan Sokol, Obed Garcia, Feiyang Liu, Carmina Barberena Jonas, Keolu Fox, Marcus Feldman, Aashish Jha, Alexander Ioannidis New York University Selection, admixture, and founder effects have shaped the genomes of modern Hawaiians. Here, we use ancestry-specific approaches to interrogate the genomic variation of 3,944 modern Hawaiians and inhabitants from other Polynesian islands uncovering related Polynesian populations and differential signals of selection. We also characterize Hawaiian admixture patterns, highlighting a very early Native American component in Hawaii. We also are able to reconstruct the origin of the Native Hawaiians and their population bottleneck history. Our results point the way to employing machine learning tools to characterize one of the most complex and diverse populations in the United States.  
20 Jordan Rossen Overcoming health disparities in precision medicine Poster only Rossen Multi-population fine-mapping under the sum of single effects model Jordan Rossen, Huwenbo Shi, Masahiro Kanai, Zachary R. McCaw, Liming Liang, Omer Weissbrod, Alkes L. Price  Harvard School of Public Health Statistical fine-mapping aims to identify biologically causal genetic variants at disease-associated loci. The sum of single effects (SuSiE) model has provided a powerful and versatile approach for fine-mapping causal variants in a single population (Wang et al. 2020 J. R. Stat. Soc., B: Stat.) and accommodating functional priors (Weissbrod et al. 2020 Nat. Genet.). Incorporating data from multiple populations can greatly improve fine-mapping due to differences in linkage disequilibrium (LD) patterns across populations (Schaid et al. 2018 Nat. Rev. Genet.), strongly motivating multi-population methods.

We propose MultiSuSiE, an extension of SuSiE to multiple populations. SuSiE sums across multiple single-effect models (each involving a single causal variant), fitting and residualizing phenotypes for each single-effect model in turn. In MultiSuSiE, each single-effect model still assumes a single causal variant, but effect sizes are allowed to vary across populations via a multivariate normal prior informed by cross-population genetic correlations. MultiSuSiE likewise fits and residualizes phenotypes to estimate population-specific effect sizes of each single-effect model in turn. Like SuSiE, MultiSuSiE accepts either individual-level genotype/phenotype data or summary association statistics and reference LD (Zou et al. 2022 PLoS Genet.). We additionally adapt the PolyFun framework (Weissbrod et al. 2020 Nat. Genet.) to multiple populations, integrating genome-wide functional annotations to construct functional priors. 

We evaluated MultiSuSiE using simulations involving real genotypes from 100,000 British and 7,000 African individuals from UK Biobank. MultiSuSiE identified 25% more true causal variants with posterior causal probability >0.5 compared to SuSiE applied to 100,000 British individuals, while maintaining approximately correct calibration, analogous to SuSiE; MultiSuSiE attained similar improvements relative to existing methods for multi-population fine-mapping. We will present results from applying MultiSuSiE to 16 diseases and complex traits from the UK Biobank while incorporating functional priors.

21 Yile Chen Precision Medicine: Using computation and artificial intelligence to improve healthcare and public health Accepted proceedings paper with oral presentation Chen, Jain Multi-Objective Prioritisation of Genes for High-Throughput Functional Assays Towards Improved Clinical Variant Classification Yile Chen, Shantanu Jain, Daniel Zeiberg, Lilia Iakoucheva, Sean D. Mooney, Predrag Radivojac, Vikas Pejaver

University of Washington, Northeastern University Providing accurate interpretation for variants is essential for clinical actionability. However, a majority of variants remain of uncertain significance. Multiplexed assays of variant effects (MAVEs), can help to provide functional evidence for variants of uncertain significance (VUS) at the scale of entire genes. Although the systematic prioritisation of genes has been of great interest from the clinical perspective, existing strategies have rarely emphasised this motivation. Here, we propose three objectives for quantifying the importance of genes each satisfying a specific clinical goal: (1) Movability scores to prioritise genes with the most VUS moving to non-VUS categories, (2) Correction scores to prioritise genes with the most pathogenic and/or benign variants that could be reclassified, and (3) Uncertainty scores to prioritise genes with VUS for which variant pathogenicity predictors used in clinical classification exhibit the greatest uncertainty. We demonstrate that existing strategies for gene prioritisation are sub-optimal when considering these explicit clinical objectives. For instance, current knowledge-driven strategies reflected by the genes that have already been assayed using MAVEs performed only slightly better than a random selection of genes for most of these objectives. Furthermore, the selection of genes based on the number of publications associated with them or the number of variants found in them yielded score distributions similar to the assayed set of genes. We also propose a combined weighted score that optimises the above three objectives simultaneously and demonstrates that despite a drop in performance when compared to individual objective optimization, this combined score still fares better than all existing strategies. Finally, we demonstrate that ranking genes based on the combined score leads to clinically relevant prioritisation through functional and phenotypic enrichment analyses. Our work has broad implications for systematic efforts that iterate between predictor development, experimentation, and translation to the clinic.

22 M. Clara De Paolis Kaluza Precision Medicine: Using computation and artificial intelligence to improve healthcare and public health Accepted proceedings paper with oral presentation De Paolis Kaluza An Approach to Identifying and Quantifying Bias in Biomedical Data M. Clara De Paolis Kaluza, Shantanu Jain, Predrag Radivojac Northeastern University Data biases are a known impediment to the development of trustworthy machine learning models and their application to many biomedical problems. When biased data is suspected, the assumption that the labeled data is representative of the population must be relaxed and methods that exploit a typically representative unlabeled data must be developed. To mitigate the adverse effects of unrepresentative data, we consider a binary semi-supervised setting and focus on identifying whether the labeled data is biased and to what extent. We assume that the class-conditional distributions were generated by a family of component distributions represented at different proportions in labeled and unlabeled data. We also assume that the training data can be transformed to and subsequently modeled by a nested mixture of multivariate Gaussian distributions. We then develop a multi-sample expectation-maximization algorithm that learns all individual and shared parameters of the model from the combined data. Using these parameters, we develop a statistical test for the presence of the general form of bias in labeled data and estimate the level of this bias by computing the distance between corresponding class-conditional distributions in labeled and unlabeled data. We first study the new methods on synthetic data to understand their behavior and then apply them to real-world biomedical data to provide evidence that the bias estimation procedure is both possible and effective.
23 Cigdem Ak Precision Medicine: Using computation and artificial intelligence to improve healthcare and public health Poster only Cigdem Ak EpiConfig: Multimodal topic modeling for single-cell multiomics Cigdem Ak, Nicole Szczepanski, Aaron Doe, Alex Chitsazan, Andrew Nishida, Yahong Wen, Hisham Mohammed, Galip Gurkan Yardımcı Oregon Heath & Science University Recently emerging single-cell (sc) multiomics assays can measure transcriptomic, genomic, epigenomic, and proteomic modalities of a single-cell jointly, including the scNMT-seq and the s3-GCC assays, technologies that were developed by researchers at OHSU. Sc-multiomics assays offer an important opportunity to deeply and accurately characterize cell populations compared to single-omic single-cell technologies, since they offer a more complete picture of the cell state based on multiple modalities. Additionally, joint profiling has the potential to unravel regulatory links between measured modalities, such as links between the epigenome and the transcriptome at single-cell resolution. However, while a multitude of computational methods to analyze  single-cell data have already been developed, there is a dearth of interpretable methods for analyzing sc-multiomics datasets. Recently published algorithms to analyze sc-multiomics datasets include matrix factorization and deep learning methodologies such as MOFA+ and scVI, which are difficult or impossible to interpret the biological insights that differentiate cell types and states. We have developed EpiConfig, a topic modeling method for unsupervised clustering and characterization of single-cells; this model can accurately cluster cells based on cell types/states, and can yield biological insights by interpreting transcriptomic and epigenetic states. We benchmarked EpiConfig on publicly available SHARE-seq and 10x Multiome (scRNA+ATAC) datasets applied to heterogeneous human and mouse cell populations. Epi-Config outperforms single-omic single-cell clustering methods such as Seurat and cisTopic, and allows for interpretation of transcriptomic and epigenomic features that define cell types/states. Currently, we are working on applying EpiConfig to various novel cancer sc-multiomics datasets and anticipate that interpretable modeling sc-multiomics data on cancerous heterogeneous cell populations will enable identification of novel cell states and configurations that can be utilized for early cancer studies.
24 Karissa Dieseldorff Jones Precision Medicine: Using computation and artificial intelligence to improve healthcare and public health Poster only Dieseldorff Jones MicroArray-based Methylation2Activity: Advancing biological knowledges from DNA methylation profiles of patient tumors Karissa M. Dieseldorff Jones, Waise Quarni, Daniel Putnam, Shivendra Singh, Qiong Wu , Jun Yang, and Xiang Chen

St. Jude Children's Research Hospital DNA methylation (DNAm) is a relatively stable regulatory epigenetic mechanism. Although accurate DNAm classifiers have been built for early detection of cancer and subtype classifications, its transcriptional regulatory roles remain unclear. To address this challenge, we developed a deep-learning framework MethylationToActivity (M2A) that accurately predicts individual promoter activity from WGBS methylomes. As array-based methylome is the most common epigenetic data from patients’ tumor samples, we redesigned M2A’s raw features and model topology for array methylomes, which shows an accurate prediction of the promoter activity (R2 = 0.74), approaching its WGBS counterpart (R2 = 0.79). Transfer learning improves the prediction accuracy to 0.77. In a primary rhabdomyosarcoma cohort, M2A prioritized candidate genes with strong subtype-specific alternative promoter usage (APU) and identified distinct isoforms of a subgroup of APU genes including NAV2 expressed in the two major subtypes (aRMS and eRMS). Molecular experiments revealed that PAX3-FOXO1 directly binds to the aRMS-specific promoter and is required for its active transcription. Genetic deletion of PAX3-FOXO1 leads to NAV2 APU. Overexpression of aRMS-specific NAV2 isoform significantly promoted cell proliferation, supporting its oncogenic function. These data indicate that PAX3-FOX1 is critical for APU in aRMS. We conclude that the new M2A provides vast insight into downstream functional interpretation of differential DNAm patterns.
25 Juan Inda-Díaz Precision Medicine: Using computation and artificial intelligence to improve healthcare and public health Poster only Inda-Diaz Antibiotic susceptibility prediction using transformers Juan S., Inda-Díaz

Anna Johnning

Anders Sjöberg

Anna Lokrantz

Mats Jirstrand

Magnus Hessel

Lisa Helldal

Lennar Svensson

Erik Kristiansson
Department of Mathematical Sciences, Chalmers University of Technology and University of Gothenburg Antibiotic susceptibility prediction using transformers

Antimicrobial resistance is a rapidly growing challenge for the healthcare sector where multi-resistant pathogens have compromised our ability to treat bacterial infections. Correct antibiotic treatment is dependent on the susceptibility of the bacteria, which is typically estimated using cultivation-based methods that are slow, time-consuming, and costly. If the diagnostic information is incomplete, physicians may be required to resort to empirical treatment, which has a considerable chance of failing or being unnecessarily broad. There is, consequently, a growing need for novel diagnostic solutions that enable the correct antibiotic treatment to be administered as fast as possible.

We present a deep learning model based on transformers that predict antibiotic susceptibility from partial diagnostic information and patient data. We train the model to predict the susceptibility of over 400,000 Escherichia coli isolates for sixteen antibiotics based on blood samples collected in 30 European countries.

The model can predict susceptibility with a major error rate below 5% on average for quinolones and cephalosporins, and below 15% for penicillins and aminoglycosides. Furthermore, the model predicts resistance of cephalosporins, quinolones, and aminoglycosides with an average very major rate between 10% and 20%, and 24% for penicillins. We implemented an uncertainty control algorithm based on inductive conformal prediction that allowed us to fix the empirical error rate for both susceptible and resistant isolates to a pre-specified level. The implementation of the uncertainty control led to the prediction region sizes positively correlated to the chosen confidence level.

We conclude that the diagnostics chain of infectious diseases can be improved by the implementation of artificial intelligence methods for the prediction of the susceptibility of bacterial isolates to antibiotics. This has the potential to reduce the morbidity and mortality of patients and to reduce the large economic costs associated with antibiotic-resistant bacteria.
26 Seung Mi Lee Precision Medicine: Using computation and artificial intelligence to improve healthcare and public health Poster only Jung A machine learning-based prediction model for adverse outcomes  in women with preeclampsia Young Mi Jung, Seohyun Choi, Ji Hye Bae, Jeesun Lee, Sun Min Kim, Byung Jae Kim,

Dong Wook Kwak, Chan-Wook Park, Joong Shin Park, Jong Kwan Jun, Sungyoung Lee,  Seung Mi Lee

Seoul National University College of Medicine Background: Preeclampsia is a major obstetrical disease that can increase maternal and fetal morbidity and mortality. Clinical guidelines have suggested several criteria for differentiation ‘severe’ from ‘non-severe’ preeclampsia to determine the need for indicated preterm delivery. However, these traditional criteria can be misleading because ‘non-severe’ preeclampsia based on these criteria may rapidly progress to ‘severe’ disease in clinical circumstances. In the current study, we developed machine learning based prediction model to accurately predict the need for early preterm birth (<34 weeks).

Methods: This prospective cohort study included singleton pregnant women who received antenatal care at Seoul National University Hospital. The initial dataset consists of 315 patients with 47 clinical variables at the time of admission. Primary outcome was delivery before 34 weeks of gestation. The prediction model was constructed using all applicable predictors and an ensemble approach, which aggregates eight machine learning (ML) models. The ensemble prediction model was evaluated using both augmented external validation and independent external validation.

Results: Using predictive Ensemble ML modelling, we developed prediction model for predicting delivery before 34 weeks of gestation with the use of clinical variables at admission. The prediction model showed good discrimination (area under the ROC curve (AUROC) = 0.86 (95% CI 0.75-0.98)) with 81% of sensitivity and 83% of specificity, whereas traditional criteria (severe vs non-severe PE) showed 84% of sensitivity and 48% of specificity (p = 0.72 against sensitivity and <0.0001 against specificity). Among the clinical variables included in the prediction model, alkaline phosphatase, systolic blood pressure and fibrinogen were the top three influential variables. External validation using Boramae Medical Center also showed good discrimination (AUROC 0.78, 95% CI 0.71-0.84).

Conclusions: The combined use of maternal factors and common antenatal laboratory data could effectively predict adverse pregnancy outcomes using machine learning algorithms.
27 Peter Lindholm Precision Medicine: Using computation and artificial intelligence to improve healthcare and public health Poster only Lindholm An echo from the past; building a Doppler repository for big data in diving research Peter Lindholm, S. Lesley Blogg, Arian Azarang, David Le, Rachel Lance, Frauke Tillmans, Kaighley Brett, Fethi Bouak, Emanuel Dugrenout, Bernard Gardette, Koshiar Medson, Richard Moon, and Virginie Papadopoulou  University of California San Diego Background: Decompression sickness (DCS) remains a major concern in diving, high-altitude flight, and extravehicular activity in space. Venous gas emboli (VGE) detected with ultrasound post-dive are often used as a marker of decompression stress. Ultrasound (Doppler and echocardiography) is suited for regular monitoring and could help elucidate inter- and intra-subject variability in DCS susceptibility. Analyzing these recordings is cumbersome, pointing to a need for an automated process enabling “in-suit” monitoring.

Materials and methods: The development of machine learning algorithms requires well-curated datasets for training, testing and validation. The field of diving medicine has amassed Doppler recordings from past studies that could be congregated and annotated. In line with other ‘big data’ projects, we are collecting de-identified Doppler data for publication in an accessible repository.

Results: Data amounting to boxes of analog tapes and to date 6000 digitized files, with around 31,000 potential clips with their associated metadata (describing the dive profile, gas, diver demographics etc.) are available. Currently, the repository contains 1,177 clips from 20 different profiles and 42 individuals with dive depths from 80 ft (24 m) to 250 ft (76 m), with breathing gases including air and heliox. Doppler Kisman-Masurel grades of the dives within the repository span the full range of the grading system, from 0 to Grade IV. Our team has already leveraged subsets of this data by automating heart rate estimation using autocorrelation, classifying Doppler VGE grades using deep learning with performance on par to humans, and developing a graphical user interface to leverage automatic speech recognition for accelerating the cutting of digitized tapes into individual recordings. 

Summary: This collaboration will include contributions from COMEX, Duke University, DAN, DRDC Canada, Karolinska Institute and QinetiQ UK; we welcome additional collaborations to strengthen the value of this tool.
28 Davide  Placido Precision Medicine: Using computation and artificial intelligence to improve healthcare and public health Poster only Muse Seasonally adjusted health data improve ML performance in diagnosis classification Victorine P. Muse*, Davide Placido*, Amalie Dahl Haue, and Søren Brunak University of Copenhagen Seasonal changes in laboratory data, attributable to yearly weather and dietary variation, is a well-known and studied phenomenon. This study investigates the potential benefits of applying a newly developed seasonal laboratory data adjustment method to a large Danish cohort population of ~575 thousand patients. We developed and trained 4 basic machine learning models to classify 35 cardiovascular diagnoses with the only input features being 23 laboratory tests and patient’s sex. The machine learning models trained were AdaBoost, Decision Tree, Neural Net, and Random Forest. Model performance gains were assessed before and after the seasonal adjustment method was applied using AUROC and AUPRC metrics. Feature contributions were quantified using SHAP values. Classifications improved for most of the 35 ICD-10 circulatory disease chapter codes assessed in this study. In summary, this study stresses the clinical value of adjusting for seasonality when conducting EHR based studies.
29 Vikas Pejaver Precision Medicine: Using computation and artificial intelligence to improve healthcare and public health Poster only Pejaver Calibration of computational tools for missense variant pathogenicity classification and ClinGen recommendations for PP3/BP4 criteria Vikas Pejaver, Alicia B. Byrne, Bing-Jian Feng, Kymberleigh A. Pagel, Sean D. Mooney, Rachel Karchin, Anne O’Donnell-Luria, Steven M. Harrison, Sean V. Tavtigian, Marc S. Greenblatt, Leslie G. Biesecker, Predrag Radivojac, Steven E. Brenner, ClinGen Sequence Variant Interpretation Working Group Icahn School of Medicine at Mount Sinai Recommendations from the American College of Medical Genetics and Genomics and the Association for Molecular Pathology (ACMG/AMP) for interpreting sequence variants specify the use of computational predictors as Supporting level of evidence for pathogenicity or benignity using criteria PP3 and BP4, respectively. The 2015 ACMG/AMP recommendations assume that score intervals defined by tool developers are already calibrated for clinical use and require the consensus of multiple predictors. However, these score intervals and, thus, the PP3 and BP4 criteria lack quantitative support. Previously, we described a probabilistic framework that quantified the strengths of evidence (Supporting, Moderate, Strong, Very Strong) within ACMG/AMP recommendations. We have extended this framework to computational predictors and introduce a new standard that maps a tool’s scores to these evidence strengths with respect to the PP3 and BP4 criteria. Our approach is based on estimating the local positive predictive value (precision) and can calibrate any computational tool or other continuous-scale evidence on any variant type. We estimate thresholds (score intervals) corresponding to each strength of evidence for pathogenicity and benignity for thirteen missense variant interpretation tools, using carefully assembled independent data sets. Most tools achieved Supporting evidence level for both pathogenic and benign classification using newly established thresholds. Multiple tools reached score thresholds justifying Moderate and several reached Strong evidence levels. One tool reached Very Strong evidence level for benign classification on some variants. Taken together, our results suggest a more prominent role for computational predictors in the clinical classification of variants. Based on these findings, we provide recommendations for evidence-based revisions of the PP3 and BP4 ACMG/AMP criteria using individual tools and future assessment of computational methods for clinical interpretation.  
30 Anton Kodochygov Workshop: Accessing clinical-grade genomic classification data through the ClinGen Data Platform Poster only Kodochygov The ClinGen Allele Registry: Improvements in Content, Stability and Scalability Anton Kodochygov, Andrew R. Jackson, Kevin Riehle, Aleksandar Milosavljevic Molecular and Human Genetics Department / Baylor College of Medicine The ClinGen Allele Registry ( is a genomic variant web service that enables effective exchange of information from different genomic variant sources such as gnomAD, dbSNP, ClinVar, etc. by providing canonical identifiers for variants and linking back to their original sources.  Users are able to register variants directly with Allele Registry (free, open registration) individually or via large batch payloads (VCF or HGVS formats).  Alignments from standard genomic builds, transcript sequence alignments, and shiftable representations are uniquely represented on disk. The Registry currently stores over 2.5 billion globally unique canonical variant identifiers (CA IDs). To ensure high throughput at large scales, the database needs to be scalable as well as have good asymptotic efficiency.

As the Registry gains public adoption and grows in size, it requires continuous stability as well as scalability improvements. The previous 32-bit identifier had a 4.29B variant limit, so the Registry has recently converted to 64 bit identifiers. Scalability is achieved via a design that takes advantage of paging architecture and parallel processing. Page data is stored on disk rather than fully in memory to make the solution feasible.  For maximum efficiency, storage is highly compressed, with some types of variants such as SNVs taking up only a few bytes in memory. Processing tasks are divided among pages and each page can be processed independently in a highly parallel fashion. To increase high turnaround time and asymptotic efficiency, pages are stored in sorted order and caching is leveraged for recently accessed pages. The Registry has been optimized to accept input that is pre-sorted by genomic coordinate, which decreases the number of page loads, increases the utilization of the cache, and minimizes disk I/O, resulting in O(n) asymptotic performance. 

31 Mark Mandell Workshop: Accessing clinical-grade genomic classification data through the ClinGen Data Platform Poster only Mandell ClinGen’s Variant Curation Interface (VCI) Mark Mandell, Christine Preston, Matt Wright, Karen Dalton, Gloria Cheung, Bryan Wulf, Marina DiStefano, Steven Harrison, Natasha Strande, Justyne Ross, Hannah Dziadzio, Clarissa Klein, Rachel Shapira, Ingrid Keseler, Deborah Ritter, Neethu Shah, Kevin Riehle, Aleksandar Milosavljevic, Sharon Plon, Teri Klein Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA  The NIH-funded Clinical Genome Resource Consortium Variant Curation Interface (ClinGen VCI) is a global, open-source variant classification platform for supporting the application of evidence criteria and classification of variants based on the ACMG/AMP sequence variant classification guidelines. To facilitate evidence-based improvements in human variant classification, the VCI is publicly available to the worldwide genomics community. The VCI is among a suite of tools developed by ClinGen, and supports an FDA-recognized human variant curation process of ClinGen Variant Curation Expert Panels (VCEPs). The variant curation workflow is intended to support dissemination of variant curations into two public repositories: the ClinGen Evidence Repository (for approved ClinGen VCEPs) and ClinVar.

ClinGen is expanding to involve more curators and teams of curators working together (affiliations) as part of an increasing scale of activities that will increase genetic variant curations across a greater number of genes. The VCI platform is built on a serverless architecture and is designed to support future enhancements and enable scalable curation workflows. Features that support curation accuracy, scaling, and transparency include an audit trail, a variant prioritization feature, and integration with ClinGen’s Criteria Specification Registry (CSpec). These developments support the ClinGen mission of curating the continually-expanding space of the clinical human genome.

32 Oleksandr Savytskyi Workshop: High-Performance Computing Meets High-Performance Medicine Poster only Savytskyi THE NEW ROLE OF CONNECTIVE PEPTIDE 1  IN HUMAN  TYROSYL-tRNA SYNTHETASE AND ITS MUTANT FORMS, ASSOCIATED  WITH CHARCOT-MARIE-TOOTH NEUROPATHY, STUDIED BY IN SILICO METHODS Oleksandr V. Savytskyi, Alexander I. Kornelyuk, Thomas R. Caulfield 1) Drug Discovery Laboratory, Department of Neuroscience, Mayo Clinic, Jacksonville, Florida, USA. 2) Protein Engineering and Bioinformatics Department, Institute of Molecular Biology and Genetics, National Academy of Sciences of Ukraine, Kyiv, Ukraine  Aminoacyl-tRNA synthetases are key enzymes of protein biosynthesis which are also implicated in other cellular processes. Two heterozygous missense mutations (G41R, E196K)  and one de novo deletion (153-156delVKQV)  in Homo sapiens TyrRS were identified in different families of patients with Charcot-Marie-Tooth disorder type C (DI-CMTC) – a group of heterogeneous inherited disorders that are characterized by degeneration of peripheral nerve fibers, loss of muscle tissue and touch sensation. G41R and 153-156delVKQV mutations could block L-tyrosine binding (decrease rate of catalytic activity >100-fold) (Froelich, 2011). The common mechanism is still unknown (Niehues, 2015).

All three DI-CMTC-related mutants of HsTyrRS, their structural complexes with substrates, cognate tRNA(Tyr) and translation elongation factor eEF1A2 are still unknown and were constructed in silico using Modeller 9 and PIPER software. In order to study the conformational dynamics of DI-CMTC-related mutants of HsTyrRS, the 100 ns MD simulations were performed in GROMACS package using the grid environment and services of MolDynGrid virtual laboratory ( The Distributed Analyzer Script (DAS) was used for analytical tools automation.

The melting of H9 helix (T141-A148) and subsequent partial melting of H11 helix were observed in MD simulation of 153-156delVKQV mutant. A novel β-sheet formation was revealed in K147-E157 region of G41R and 153-156delVKQV TyrRS mutants. In the both cases the β-sheets are quite stable (20-100 ns of time interval). It was found that conformational transitions took place within CP1 connective peptide (L125-G163 region), which is important structural element for tRNA(Tyr) recognition.  Hydrogen bonds formed between K147-E157 region and tRNA(Tyr) were following (lifetime): E151:C75 – 38.71%, Q155:A76 – 12.90%, K147:G72 – 12.90%, K147:G71 – 9.68%.

Our results support the idea, that defects of the intermolecular interfaces in the complexes of mutational forms of HsTyrRS with cognate tRNA(Tyr) and/or eEF1A2 may be common molecular mechanism of Charcot-Marie-Tooth disorder.
33 Derek Archer General Poster only Archer Leveraging longitudinal diffusion MRI data to quantify differences in white matter microstructural decline in normal and abnormal aging Derek B. Archer, Niranjana Shashikumar, Varuna Jasodanand, Elizabeth E. Moore, Kimberly R. Pechman, Murat Bilgel, Lori L. Beason-Held, Yang An, Andrea Shafer, Shannon L. Risacher, Kurt G. Schilling, Bennett A. Landman, Angela L. Jefferson, Andrew J. Saykin, Susan M. Resnick, Timothy J. Hohman Vanderbilt University Medical Center Several prior studies have used diffusion MRI to investigate differences between normal and abnormal aging; however, many of these studies used conventional diffusion MRI measures and single site data. The goal of this study is to leverage multi-site harmonized diffusion MRI data in conjunction with an advanced post-processing technique [i.e., free-water (FW) correction] to quantify aging related tract-specific changes in white matter microstructure. We found that normal and abnormal agers had several interactions on white matter decline, with pronounced effects in tracts such as the inferior frontal occipital fasciculus, transcallosal inferior frontal gyrus pars opercularis, fornix, and transcallosal angular gyrus. We also conducted bootstrapped analysis of age models, which only considered typical covariates, and additional bootstrapped analysis using models which also covaried for AD diagnosis and conversion along the AD continuum. Differences in marginal R2 between these models was calculated, and we found that the FW metric was highly vulnerable to AD diagnosis and conversion. Within the FW metric, we found that the limbic tracts were most vulnerable. The results of this study suggest that white matter has high sensitivity to changes in AD, and suggests that FW correction be incorporated into studies of AD.
34 Johanna Blee General Poster only Blee In silico optimization of antibiotic dosing to efficiently treat persistent biofilm infections Sabine Hauert,

Thomas Gorochowski
University of Bristol Most clinical infections are caused by biofilms. As we approach an antimicrobial resistance crisis smarter and more effective treatemnt stratgies are crucial. Persistent cells in biofilms are more tolerant to stress and allow biofilms to survive antibiotic treatments. Here, we explore how periodic dosing can reduce the total antibiotic dose required to eradicate persistent sub-populations and more effectively treat a biofilm infection. Because the dynamics and mechanisms of biofilm growth and persister cell formation are complex and diverse, we developed individual-based in silico models to streamline the process of finding optimal treatment strategies and to determine key parameters in this large parameter space. We simulated a broad range of persistence switching dynamics and corresponding optimized periodic dosing regimens. We found that optimised dosing reduced the total dose required to effectively treat the biofilms by 66.6 ± 0.5%. This demonstrates how antibiotic dosing regimens tailored to the biofilm dynamics could be used to reduce the antibiotic load and deliver more efficient and precise treatments. We also discovered that the biofilms’ architecture and response to antibiotics depends on the persister dynamics. Biofilms with the slowest switching to persister cells and fastest switching back to susceptible cells always required the lowest dose of antibiotics, whereas those with the fastest switching to persister and slowest back always required the highest dose of antibiotics. The optimal treatment and recovery time between treatments depended on the persistence dynamics. We also found that despite these differences, a single optimal periodic dosing regimen was still effective against all persistence dynamics and reduced the antibiotic dose by 56 ± 0.1%. This suggests that even when persistence and its evolution are not fully parametrised, we can still develop more efficient dosing regimens. As persistence becomes better understood and quantified these models can then be used to develop tailored optimal strategies.
35 Raquel Bromberg General Poster only Bromberg CryoEM single particle reconstruction with a complex-valued particle stack Raqul Bromberg, Yirui Guo, Dominika Borek, Zbyszek Otwinowski Ligo Analytics Single particle reconstruction (SPR) in cryoEM is an image processing task with an elaborate hierarchy that starts with a large number of very noisy multi-frame images. Efficient representation of the intermediary image structures is critical for keeping the calculations manageable. One such intermediary structure is called a particle stack and contains cut-out images of particles in square boxes of predefined size. The contrast transfer function (CTF) or its Fourier Transform point spread function (PSF) are not considered at this step. Historically, the particle stack was intended for large particles and for a tighter PSF, which is characteristic of lower resolution data. The field now performs analyses of smaller particles and to higher resolution, and these conditions result in a broader PSF that requires larger padding and slower calculations to integrate information for each particle. Consequently, the approach to handling structures such as the particle stack should be reexamined to optimize data processing.

Here we propose to use as a source image for the particle stack a complex-valued image, in which CTF correction is implicitly applied as a real component of the image. The final CTF correction that we later refine and apply has a very narrow PSF, and so cutting out particles from micrographs that were approximately corrected for CTF does not require extended buffering, i.e. the boxes during the analysis only have to be large enough to encompass the particle. The Fourier Transform creates a complex value image considered in real space, opposed to standard SPR data processing where complex numbers appear only in Fourier space. This extension of the micrograph concept provides multiple advantages because the particle box size can be small while calculations crucial for high resolution reconstruction such as Ewald sphere correction, aberration refinement, and particle-specific defocus refinement can be performed on the small box data.
36 Yirui Guo General Poster only Bromberg Singular value decomposition (SVD) of particle movements for motion analysis in cryoEM movies Raquel Bromberg, Yirui Guo, Thomas Stanley, Dominika Borek, Zbyszek Otwinowski Ligo Analytics Singular value decomposition (SVD) is an efficient method that can be used to find patterns in data. The motions observed in cryoEM movies can be decomposed with SVD, because alignment of multiframes in movies collected in cryoEM SPR provides natural vectorization. The SVD components obtained from such SVD informs how many types of motion are present in a particular experiment. We developed and implemented efficient SVD decomposition to map the motion features in space and time. SVD can be used as a guided restraint allowing larger motions at the start and dampening them at the later stages. Another use is to filter data for excessive and unusual type of motions, e.g. ice layers collapsing, to allow for automatic data selection for subsequent steps of structure solution. Finally, SVD provides an unbiased, comprehensive, and dataset specific estimate of the magnitude and character of the largest initial motions driven by ice expansion and bulging. We observed that these highly detrimental initial motions depend on sample features, e.g. particle density, buffer components, and imaging conditions. This approach significantly simplifies motion analysis and make the interpretation more objective. Here we present the results of this analysis for selected cryoEM SPR reconstructions.  
37 Richard Chen General Poster only Chen Multimodal Co-Attention Transformer for Survival Prediction in Gigapixel Whole Slide Images Richard J. Chen, Ming Y. Lu, Wei-Hung Weng, Tiffany Y. Chen, Drew F.K. Williamson, Trevor Manz, Maha Shady, Faisal Mahmood Richard Survival outcome prediction is a challenging weakly-supervised and ordinal regression task in computational pathology that involves modeling complex interactions within the tumor microenvironment in gigapixel whole slide images (WSIs). Despite recent progress in formulating WSIs as bags for multiple instance learning (MIL), representation learning of entire WSIs remains an open and challenging problem, especially in overcoming: 1) the computational complexity of feature aggregation in large bags, and 2) the data heterogeneity gap in incorporating biological priors such as genomic measurements. In this work, we present a Multimodal Co-Attention Transformer (MCAT) framework that learns an interpretable, dense co-attention mapping between WSIs and genomic features formulated in an embedding space. Inspired by approaches in Visual Question Answering (VQA) that can attribute how word embeddings attend to salient objects in an image when answering a question, MCAT learns how histology patches attend to genes when predicting patient survival. In addition to visualizing multimodal interactions, our co-attention transformation also reduces the space complexity of WSI bags, which enables the adaptation of Transformer layers as a general encoder backbone in MIL. We apply our proposed method on five different cancer datasets (4,730 WSIs, 67 million patches). Our experimental results demonstrate that the proposed method consistently achieves superior performance compared to the state-of-the-art methods.
38 Xingyao Chen General Poster only Chen SPE3D: A Distributed and Scalable 2D/3D Digital Pathology Platform Xingyao Chen, Edwin Heredia, Shiva Patre, Jerry S.H. Lee., Naim Matasci, David B. Agus The Lawrence J. Ellison Institute for Transformative Medicine  Recent advancements in whole slide digital scanning and increased focus in personalized precision medicine has paved way for research in digital pathology, especially in the application of machine learning (ML) to infer pathological diagnoses. While interest in ML applications in digital pathology is strong, conducting meaningful research in this field demands significant allocation of computational resources. Therefore, there is an urgent need for a scalable digital pathology platform that not only provides tools for Whole Slide Image (WSI) processing but can also efficiently allocate computing resources and upload results to a centralized database. To meet this need, we present a state-of-the art, robust, cloud-based digital pathology platform that allows end-to-end processing of pathology images to both train models and run inference at scale. The platform is designed to ingest 2D and 3D tissue samples, process them to create any necessary datasets, and then use the datasets to train and evaluate a collection of models from classical ML models to deep neural networks. The platform runs entirely on a cloud-based infrastructure comprised of CPU and GPU nodes, high-performance machine storage, as well as distributed object storage. The platform can scale the number of nodes required to perform each of the operations based on workload by making extensive use of containerization and workflow-based approaches. Furthermore, the platform supports modern machine learning toolkits, allowing reusability of individual components as well as rapid deployment, prototyping and validation of novel algorithms. Finally, we benchmarked the platform on several 2D and 3D digital pathology datasets to demonstrate its computational capacity as well as the quality of its output datasets.
39 Chel Hun  Choi General Poster only Choi Comparison of RNA-Seq and Microarray in Transcriptome Profiling for clinical outcome prediction Chel Hun Choi, Yoo-Young Lee, Cheong-Rae Roh, Sung-Jong Lee Samsung Medical Center Introduction:  Gene expression profiling, either RNA-seq and microarray, is being widely applied in cancer research to identify biomarkers for clinical endpoint prediction. We aimed to systematically investigate the potential of RNA-seq-based classifiers in predicting clinical endpoints in comparison to microarrays using TCGA dataset of small cell lung cancer, colorectal cancer, renal cancer, breast cancer, endometrial cancer, ovarian cancer.

Method: The R value(RRNA-seq) was measured by checking the correlation between gene expression and RPPA performed by RNAseq, and the R value(Rmicroarray) was calculated by checking the correlation between gene expression and RPPA through microarray.

Results: Most genes showed similar correlation coefficients between RNA-seq and microarray, however,

16 genes showed significant differences between the two methods. Among the 16 genes, the BAX gene belonging to colorectal cancer, renal cancer, and ovary cancer, respectively, was identified three times. PIK3CA belonging to renal cancer and breast cancer showed higher correlation in microarray, and other genes showed better results in RNA-seq. In addition, after selecting 100 survival-related top genes (by cox univariate) from RNA seq data, the c-index that predicts survival (random survival forest model) was calculated. The survival prediction model using different platform, microarray was better in COAD, KIRC, LUSC, and RNA seq was better in Ovarian, Endometrial cancer.

Conclusion: The correlation between mRNA level and protein (measured by RPPA) are good, and performance of RNA seq and Array are similar. And survival prediction model showed controversial result.

40 Sinead Cullina General Poster only Cullina Admixture mapping for phenome-wide discovery in a diverse health system biobank Sinead Cullina, Ruhollah Shemirani, Bryce Rowan, Noura Abul-Husn, Samira Asgari, Eimear E. Kenny Icahn School of Medicine at Mount Sinai The genomes of admixed individuals reflect recent ancestry from two or more continents, examples include African American (AA), Hispanic Latino (HL) and South African populations. Despite many common heritable diseases being enriched in admixed populations, most genomic discovery approaches are tailored to more homogenous populations. Admixture mapping utilizes differences in disease prevalence and allele frequency between ancestral populations of admixed individuals for association testing. We have developed a generalizable pipeline and best practices for admixture mapping in a diverse biobank in New York City (NYC).

Individual-level ancestry proportions were calculated using genotype array data for 53,900 individuals in the BioMe Biobank at Mount Sinai, NYC. Participants were assigned to AA or HL cohorts based on self-report race/ethnicity and patterns of recent admixture. Genotypes were imputed using the TOPMed reference panel, phased and RFMIX2 was used to call two- (African (AFR) and European (EUR) ancestry) or three-way (AFR, EUR, and Native American (NAT) ancestry) LA inference, for AA and HL groups, respectively. Admixture mapping was performed using generalized linear models in which LA haplotypes are tested for association with ~1,000 phecodes from electronic health records.

To assess the quality of LA calls, we compare R2 correlation between the sum of LA haplotypes per individual and their global admixture proportions. We observe strong correlation in HL (R2: AFR 1, EUR 0.96, NAT 0.96) and AA (R2: AFR 1, EUR 1). In AA, of 776 phecodes tested 43 reach suggestive significance (P <= 2.07e-05) including replication of known (e.g. ACKR1, APOL1) and novel admixture mapping associations. We will present results from the association and fine-mapping of LA haplotypes in HL and AA.

Large efforts to increase diversity in genomic research means that more admixed individuals will be recruited to biobanks, motivating development of pipelines like this for well-calibrated admixture mapping.
41 Guanlan Dong General Poster only Dong Somatic genomic alterations in single neurons from brains with chronic traumatic encephalopathy (CTE) Guanlan Dong, Chanthia Ma, Michael B. Miller, August Yue Huang, Ann C. McKee, Eunjung Alice Lee, Christopher A. Walsh Harvard University Chronic traumatic encephalopathy (CTE) is a neurodegenerative disease associated with repetitive head trauma. While CTE shares certain pathological features with Alzheimer’s disease (AD), the genetic, molecular, and cellular mechanisms behind the development of CTE are not well understood. The advent of single-cell sequencing technologies allows for the analysis of contribution of somatic mutations to disease pathogenesis. Previous studies of single-cell whole genome sequencing (scWGS) on aging and neurodegenerative brains showed that somatic single-nucleotide variants (sSNVs) increase both with aging and in disease, but present with distinct patterns of mutational signatures, suggesting that genetic, environmental, or disease states might influence this accumulation.

In this study, we applied scWGS using Primary Template-directed Amplification (PTA) to neurons from the prefrontal cortex of CTE brains. We performed sSNV calling using three bioinformatics pipelines, LiRA, SCAN-SNV, and SCAN2, and integrated the results to improve genome-wide burden estimation and statistical power for downstream analyses. We compared the rates of sSNV accumulation in CTE and control neurons and found a significant increase of hundreds of sSNVs in CTE brains compared with age-matched controls. To interpret the biological impact of sSNVs, we used a non-negative least-squares method to decompose the sSNV spectra into known mutational signatures and observed significantly more contributions from signatures associated with aging and oxidative damage in CTE than in controls. In addition, we used a non-negative matrix factorization method to discover de novo signatures in CTE. Since amplification artifacts can complicate mutational calling by introducing single-stranded lesions, we also attempted to apply orthogonal duplex sequencing methods to distinguish single-stranded lesions from double-stranded mutations. Given the pathological overlap between CTE and other neurodegenerative diseases, these results suggest potential contributing factors that are specific to CTE and provide unique insights into the mechanisms of neurodegeneration. 
42 Sabri Eyuboglu General Poster only Eyuboglu DataFrames for Complex Data Types in Biomedicine Sabri Eyuboglu, Karan Goel, Arjun Desai, James Zou, Chris Ré Stanford University Machine learning models in biomedicine commonly operate over complex data types like images, graphs, and text documents. The behavior of these models can be quantified by asking statistical questions over validation data. For example, a practitioner might ask: what is the subgroup accuracy of a Chest X-ray diagnostic model on the subpopulation of patient’s with chest drains? However, answering these questions is challenging, in part because existing data wrangling tools (e.g. Pandas DataFrames) do not lend themselves to complex data types. To address these challenges, we develop a DataFrame abstraction that provides intelligent, “soft” operations over complex data types and a simple GUI framework for validating the results of these operations.   
43 Tierra Farris General Poster only Farris Simplifying Submission of Links and Excerpts into the Linked Data Hub via the New Bridge API Tierra Farris, Andrew R. Jackson, Kevin Riehle, Neethu Shah, Jessie Arce, Alejandro Zuniga, Bosko Jevtic, Ramin Zahedi, Sharon E. Plon, Aleksandar Milosavljevic Molecular and Human Genetics Department, Baylor College of Medicine, Houston, TX The Linked Data Hub (LDH) is a framework that allows for linking of myriad types of data about a subject. The ClinGen LDH ( and Common Fund Data Ecosystem (CFDE) LDH ( utilize this framework (1) to support variant curation efforts by aggregating excerpts of pertinent variant and regulatory element data from a variety of external sources, along with provenance information and links back to the original data source (2) to provide convenient access to aggregated variant and regulatory element information to the broader research community. A long-standing goal has been to lower the barrier for external data sources to link their data via the LDH for consumption by diverse applications. The critical task is to empower external data owners to submit links to their APIs and relevant excerpts of their data into the LDH, thus spreading awareness of the available data from their site about specific variants. One obstacle to high-volume contributions to the LDH by third parties has been the synchronous nature of the direct LDH APIs. To address that we created the Bridge API, a service that allows users to submit  JSON, TSV, or raw text data  via a simple, authorized, HTTP PUT request to a topic specific API endpoint.  Accepted documents are published into one of our Pulsar message queues, where they will be processed and ingested into the LDH out-of-band. The Bridge API was recently used to submit more than 31 million documents representing 446 genes, 26,803 regulatory elements, 7,336,070 variants, and 23,867,110 evidence documents as part of an ongoing effort to increase the gene coverage of ClinGen genes targeted for variant curation and genes of interest for genetic burden testing by NIH CFDE studies.  
44 Lindsay Fernandez-Rhodes General Poster only Fernandez-Rhodes Socioepigenomic Pathways to Elevated Body Mass Index in African American Adults Lindsay Fernández-Rhodes, LáShauntá Glover, Adam G. Lilly, Brooke S. Staley, Yujie Wang, Mariaelisa Graff, Anne E. Justice, Kari E. North on behalf of ARIC Investigators Penn State University Although epigenetic research has provided novel insights into obesity etiology, few studies have examined the biologic pathways through which socioeconomic circumstances may become embodied.  We sought to conduct epigenome-wide and structural equation analyses to assess whether cytosine-phosphate-guanine (CpG)  methylation may mediate the association between socioeconomic status (SES) and body mass index (BMI) in a sample of African American adults. 

Among participants with epigenetic data in the Atherosclerosis Risk in Communities Study, we created a composite SES score based on education, income, and occupation/employment from visit 1 data (1987-1989). BMI and DNA methylation β-values were derived from visit 2 or 3 data (1990-1995). We conducted two epigenome-wide association analyses using linear regression to identify CpG methylation sites associated with both SES and BMI (n=2593-2630), stratified by sex and adjusted for age, age2, center, smoking (past, current, never), alcohol use (current, never/former), 10 genetic principal components, and time between visits (SES only). We then utilized a structural equation model to test whether identified sites mediated the association between SES and BMI. Analyses will be replicated in the Jackson Heart Study (n=1,183, 2000-2004). 

A CpG site at SOCS3 (cg18181703) was both associated with the composite SES score (measured 2-8 years earlier; β=9x10-3; p-value=2x10-10) and with concurrent BMI (β=-7x10-3; p-value=1x10-9). The structural equation model revealed that 20% of the effect (β=-0.60; p-value=<0.001) of SES on BMI was mediated via this CpG site (β=-0.15; p-value=<0.001). The indirect effect operating through cg18181703 at SOCS3 differed significantly between women and men (p-value for difference=0.002), and was stronger in women than men (women: β=-0.19, p-value=<0.001; men: β=-0.03, p-value=0.3).

Based on these preliminary findings, we conclude that epigenetic changes may partially explain the persistent epidemiologic association between SES and obesity, but future research should consider how this process may differ between women and men.

45 Obed Garcia General Poster only Garcia Developing an HLA curation framework in ClinGen for complex  diseases Obed A. Garcia, John D. Reveille, Ingrid Kessler, Michelle Whirl-Carillo, Clarissa Klein, Rachel Shapira, Matt W. Wright, S. Louis Bridges, Jr., Marcelo Fernandez-Viña, Steven Mack, Teri Klein on behalf of the ClinGen HLA Expert Panel Obed The HLA region (Human Leukocyte Antigen region) is an immune response related protein-encoding region on chromosome 6. The region’s deep evolutionary history has selected for maintaining the region as diverse as possible to interact with unknown pathogens which has led to it being one of the most complex and polymorphic regions in the human genome. Natural selection for strong immunity, particularly in the HLA system has also contributed to the rise of autoimmune disorders across human populations. Here we propose a baseline framework for expanding the current ClinGen Consortium curations from its current framework on monogenic disorders to include the complex HLA region and its many associations with autoimmune disorders such as Spondylarthritis, Rheumatoid Arthritis, etc. In conjunction with the Rheumatologic Autoimmune Disease CDWG, our HLA expert panel has developed a point-based framework for HLA allele curation, which can ingest the complex HLA nomenclature, focuses on incorporating measures of typing, determining the burden of association, and assigning it a level of evidence for association. We categorize evidence based on 4 levels of association and incorporate different variables such as effect size, sample size, and allele frequency. This framework will be instrumental in the development of a new curation tool for the ClinGen consortium. This model is not only is applicable to HLA disease association but can also be adapted to other complex polygenic curations in the future, thereby advancing the efforts for precision medicine by creating a resource for the curation of the relationship between genomics and human health.   
46 Jianye Ge General Poster only Ge A software tool to detect macrohaplotypes from long-read sequences
for DNA mixtures deconvolution
Jianye Ge, Xuewen Wang, Muyi Liu University of North Texas Health Science Center Deconvoluting mixture samples is one of the most challenging problems confronting DNA forensics, as well as many medical applications (e.g., detecting multiple cancer subtypes). Efforts have been made to provide solutions regarding mixture interpretation both in marker design and computational analysis. Recently, a novel DNA marker, macrohaplotype, was designed based on long-read sequencing technologies to improve the performance of mixture deconvolution. Macrohaplotype combines Short Tandem Repeat (STR), Single Nucleotide Polymorphisms (SNPs), and Insertion-Deletions (Indels) in a nearby region to offer extremely high numbers of haplotypes and hence very high discrimination power per marker and can substantially improve mixture interpretation capabilities.

In this study, a software tool was developed to detect predefined macrohaplotypes with long-read sequences. This tool first detects the variants (i.e., STRs, SNPs, Indels) from each individual read separately, concatenates the variants from the same reads, and then generates a macrohaplotype from each read which expands the whole predefined regions. This tool takes in an aligned BAM file, reference genome, a list of predefined macrohaplotype regions, and a list of predefined STR configurations as input, and it outputs the detected macrohaplotypes, in both sequence format and variant format, with summary statistics. This tool is able to detect variants from multiple contributors, while the current variant callers usually assume the sequence profile is generated from a single source sample.

This tool was first designed for forensic mixture interpretation, but can also be used for other applications, such as multiploidy species analysis, detecting multiple cancer subtypes within the same tissues, etc. This study is supported in part by funding 15PNIJ-21-GG-04159-RESS, awarded by the National Institute of Justice, Office of Justice Programs, U.S. Department of Justice.
47 Wenbin Guo General Poster only Guo Epigenetic age acceleration is associated with cell compositional changes and cancer risk scores in healthy breast tissue  Wenbin Guo, Mary E. Sehl, Colin Farrell, Matteo Pellegrini, Steve Horvath, Patricia A. Ganz UCLA Introduction: Breast tissue age, measured with DNA methylation-based estimates, is accelerated in tumor and normal adjacent tissues in breast cancer. However, it is unclear whether the degree of epigenetic age acceleration is associated with breast cancer risks in normal breast tissue. No prior work has examined the alterations in gene expression and cell compositions accompanying epigenetic aging in the breast. In this study, we aimed to fill these gaps by jointly analyzing the breast epigenome and transcriptome data.

Methods: We performed DNA methylation and gene expression profiling in 163 healthy women breast tissue from Susan G. Komen Tissue Bank5. The DNA methylation data were used to estimate the epigenetic age and calculate the age acceleration for seven epigenetic clocks. Gene expression data were deconvolved to estimate the cell type abundance using GTEx breast single nucleus RNA-seq reference. To dissect the differentially expressed genes intrinsically associated with age acceleration, we performed regression analysis to test the association between gene expression and age acceleration for each gene, while controlling for the confounding cell type abundance.

Results and conclusion: We evaluated the prediction performance of seven epigenetic clocks, found the Hovarth clock is the most accurate and used it for the downstream analysis. The association test shows that age acceleration significantly correlates with breast cancer risk(Tyrer-Cuzick 10 year score, pearson correlation 0.235, p-value < 0.0025). Meanwhile, exhausted T cell abundance positively correlates with age acceleration (p-value 0.014), indicating reduced immune surveillance during age acceleration. Differential gene expression analysis shows no DE genes associated with age acceleration after controlling cell composition effects, suggesting that cell compositional changes explain age acceleration. In summary, our findings showed that age acceleration is associated with cancer risks, potentially representing a mechanistic link between breast accelerated aging and carcinogenesis.
48 Yu Hamba General Poster only Hamba Tissue-specific expression of lincRNAs is based on the 3D genome Yu Hamba, Takashi Kamatani, Fuyuki Miya, Keith A. Boroevich, Tatsuhiko Tsunoda RIKEN Center for Integrative Medical Sciences Laboratory for Medical Science Mathematics Long intergenic noncoding RNAs (lincRNAs) are genes with independent transcription units that yield transcripts without translation. Accumulating evidence indicates that they show more cell- and tissue-specific expression patterns than protein-coding genes (PCGs), suggesting that they could be potential next-generation biomarkers and drug targets for human diseases. However, while lincRNAs are subject to active promoter regulation and post-transcriptional processing like PCGs, the molecular basis for the specificity of their expression patterns remains unclear.

Here, using lincRNA expression data in human tissues and coordinates of topologically associating domains (TADs), we show that that lincRNA loci are significantly enriched in TADs and that lincRNAs in TADs have higher tissue specificity than those outside TADs. Gene pairs in the same TADs showed higher expression correlations than those in different TADs or at TAD boundaries. Based on these, we propose an analytical framework to identify specific molecular profiles in two ways: estimations of cis-regulation and trans-regulation of lincRNAs. Applying it to hypertrophic cardiomyopathy (HCM) derived expression data, we found ectopic expression of keratin as a consequence of cis-regulation and derepression of myocyte differentiation-related genes by LINC00881 and FOXO3 as a consequence of trans-regulation.

The results highlight that the lincRNA loci localize to TADs and that this localization is associated with tissue-specific expression of lincRNAs. Our results will initiate understanding the function and regulation of lincRNAs according to their genomic structure and their possible application in medicine.
49 Jiyeon Han General Poster only Han DRSPRING – A composite model of GCN-based deep learning modules to predict the drug synergy effect Jiyeon Han, Min Ji Kang, Sanghyuk Lee Ewha Womans University Drug combination has long been explored as a promising way of developing new therapeutic methods. Biological networks that encompass protein-protein, drug-target, and gene-gene interactions are often the object analysis data to infer novel drug combinations. Diverse methods have been utilized to mine the complex networks including state-of-the-art deep learning methods. High-throughput pharmacogenomic data are valuable assets for drug development. For example, the LINCS project generated transcriptome profiling data for >1,000 cell lines perturbed by >1,000 drugs, which was extensively used for drug repurposing. However, it is rather difficult to use this data to infer drug combinations because the number of matching drug and cell line pairs is rather small to apply machine learning methods. Here, we suggest a composite model to predict the drug synergy effect utilizing the graph convolutional network (GCN) method of deep learning. Our model DRSPRING (DRug Synergy PRediction by INtegrated GCN) consists of two component modules – PDIGEC and PDSS. The PDIGEC predicts the gene expression change induced by a given drug where the drug chemical structure and gene-gene network were represented as a GCN of atoms or genes, respectively. The LINCS data were prefiltered to provide reliable learning data to the GCN model, and we achieved the average correlation coefficient between the predicted and experimental expression profiles over the LINCS drugs to be 0.55. The PDSS module predicts the drug synergy score for a drug pair in an identical GCN framework but with the drug-target information and the drug-induced expression change (predicted from the PDIGEC module) were added as node features of gene-gene network. Drug synergy scores from the DrugComb database were used for training the PDSS module, and we achieved the correlation coefficient between the predicted and experimental synergy scores to be 0.78. Independent validation experiments are in progress for high scoring drug pairs.  
50 William Hardison General Poster only Hardison PharmCAT v2.0 Analysis of GeT-RM Samples William Hardison, Binglan Li, Katrin Sangkuhl, Michelle Whirl-Carrillo and Teri E. Klein Wesleyan University ​PharmCAT1 (Pharmacogenomics Clinical Annotation Tool) is a bioinformatics tool used to extract CPIC2 (Clinical Pharmacogenetics Implementation Consortium) guideline variants from a genetic dataset (represented as a VCF file), interpret the variant alleles, and generate a report with genotype-based drug prescribing recommendations which can be used to inform treatment decisions. We compared the PharmCAT v2.0 PGx allele calls to the GeT-RM3 (Genetic Testing Reference Material Coordination Program) consensus calls available on the CDC (Centers for Disease Control and Prevention) website4. We obtained and performed quality control on 30x Whole-Genome Sequencing data for 132 GeT-RM samples that were also part of the 1000 Genomes Project cohort5. The VCF files for these GeT-RM samples were first run through the PharmCAT VCF Preprocessor to normalize and standardize the genotypes. The preprocessed VCF files were then fed to PharmCAT using a multi-sample analytic framework, which analyzed all 132 samples utilizing multiple parameters. The PharmCAT JSON outputs for each sample were then converted into TSV files using an accessory R script available on the PharmCATtutorial GitHub repository6. We then compared PharmCAT PGx allele calls to the GeT-RM consensus calls.  VCF files were manually checked when differences between the PharmCAT and GeT-RM were observed to verify PharmCAT calls based on the allele definitions defined at PharmGKB7. We found that PharmCAT had 100% concordance calls for CYP2B6, CYP2C19, CYP2C9, CYP3A5, CYP4F2, SLCO1B1, TPMT, UGT1A1, and VKORC1. For DPYD, 87 of the 88 samples (98.8%) had concordant calls, finding one sample previously reported as *1/*1 by GeT-RM actually carried a decreased function allele based on the WGS data. Excluding samples with structural variations (outside the scope of PharmCAT), PharmCAT reached a 100% concordance rate with GeT-RM for CYP2D6.













51 Soyoung Jeon General Poster only Jeon Evaluating genomic polygenic risk scores for childhood acute lymphoblastic leukemia in Latino children So Young Jeon, Ying Chu Lo, Libby M. Morimoto, Catherine Metayer, Xiaomei Ma, Joseph L. WIemels, Adam J. de Smith, Charleston W.K. Chiang Keck school of medicine, USC Polygenic risk scores (PRS) can improve risk stratification among patients, particularly for European ancestry cohorts. However, the utilities of these models have not been evaluated for acute lymphoblastic leukemia (ALL), the most common form of childhood cancer in the US. Previous PRS models for ALL were based on only known GWAS loci, even though genomic PRS models have been shown to improve model efficacy for a number of complex diseases. The transferability of the PRS models for ALL has also not been studied. Latino children have the highest risk of ALL but may benefit less from PRS-based risk stratifications if PRS models transfer poorly to populations of non-European ancestries. We thus set out to train and evaluate genomic PRS models using two methods, pruning-and-thresholding and LDPred2, based on either European-ancestry only GWAS or a multi-ethnic GWAS that we previously performed. We found that the best PRS model trained in European Americans performed similarly between held-out European American (EA) or Latino (LAT) samples (PseudoR2 = 0.030 in EA vs. 0.028 in LAT). The performance of the PRS model were also similar across LAT individuals with greater or less inferred European ancestries (PseudoR2 = 0.009, 0.014, 0.006, in tertiles of LAT by decreasing European ancestry). The best PRS model can be further improved for LAT children if we utilized LAT sample LD in training (PseudoR2 = 0.017), or if the multi-ethnic GWAS were used (PseudoR2 = 0.13). Our results thus suggest that larger and more inclusive GWAS will further improve PRS performance. Moreover, the comparable performance between populations may suggest a more oligo-genic architecture for ALL, where some large effect loci may be shared between populations. As such, future PRS models that move away from the infinite causal loci assumption may further improve PRS for ALL.

52 Meihong Jin General Poster only Jin Integration of multi-omics and immunogenicity data enhances the classification of cancer phenotypes Meihong Jin, Kwoneel Kim Department of Biomedical and Pharmaceutical Sciences, Kyung Hee University, Seoul, Korea There is an urgent need for integrated data model addressing both cancer multi-omics and tumor immune

microenvironment features. Here, we constructed an integrated data model including multi-omics and

immunogenicity from large cancer cohort databases that were TCGA, PCAWG, and TCIA. We curated the

data by applying uniformed normalization strategy and validated that the normalization was effective in

clustering normal and cancer groups. Furthermore, we tried to predict the cancer phenotypes of metastasis

and MSI status using the integrated data model. The prediction model constructed based on multi-omics data

showed better classification performance than the model of single-omics data. Also, the model based on both

multi-omics and immunogenicity data had the best prediction performance in consonance with different

cancer types. The ranks of features acting importantly in the prediction were different across cancer types,

which indicate that explanatory power of each feature was dependent on the context of relevant cancer type.

These results imply that cancer phenotype could be more predictable when we consider multi-omics and

tumor immune microenvironment together.
53 Doyun Kwon General Poster only Kwon Non-invasive prediction of massive transfusion during surgery using intraoperative hemodynamic monitoring data Doyun Kwon, Young Mi Jung, Taekyoung Kim, Kwangsoo Kim, Garam Lee, Dokyoon Kim, Hyung-Chul Lee, Seung-Bo Lee, Seung Mi Lee Seoul National University College of Medicine Introduction: Failure to receive a prompt blood transfusion leads to serious complications after surgery, if massive bleeding occurs during surgery. For the timely preparation of blood products, the prediction of possibility of massive transfusion (MT) is essential to decrease morbidity and mortality. Recently we have reported that intraoperative vital parameters from invasive monitoring can be used for real-time prediction of MT (intra-arterial blood pressure monitoring). However, most surgeries are undergone without intra-arterial blood pressure monitoring and massive bleeding can occur even in these surgeries without invasive monitoring. Therefore, the purpose of this study was to develop a model for predicting MT 10 minutes in advance using non-invasive bio-signal waveforms that change in real-time.

Methods: In this retrospective study, we developed a deep learning-based algorithm (DLA) to predict intraoperative MT before 10 minutes. MT was defined as the transfusion of 3 or more units of red blood cells within an hour. The datasets consisted of 13,835 patients who underwent surgery at Seoul National University Hospital (SNUH) for model development and internal validation, and 528 patients who underwent surgery at Boramae Medical Center (BMC) for external validation, respectively. We used features extracted from plethysmography (collected in 500Hz), non-invasive blood pressure (NIBP), and hematocrit measured during surgery for the construction of DLA.

Results: Among 13,835 patients in SNUH and 528 patients in BMC, 232 patients (1.7%) and 7 patients (1.3%) received MT during surgery, respectively. The areas under the receiver operating characteristic curve (AUROC) of DLA predicting intraoperative MT before 10 minutes was 0.94 (95% CI, 0.910-0.970) in internal validation and 0.93 (95% CI, 0.883-0.972) in external validation, respectively.

Conclusions: The DLA can successfully predict intraoperative MT using non-invasive bio-signal waveforms.
54 Kord Kober General Poster only Le An Evaluation of a Meta-Dimensional Approach to Combine Multimodal Omics and Phenotypic Data to Identify Subgroups of Patients Associated with Morning Cancer-Related Fatigue Severity Caroline Le, Christine Miaskowski, Ritu Roy, Adam Olshen, Kord M. Kober University of California San Francisco Cancer-related fatigue (CRF) is the most common symptom associated with cancer and its treatments. CRF has a negative impact on patients’ ability to tolerate treatments and on their quality of life (QOL). One of the major knowledge gaps for CRF is a lack of knowledge of its underlying mechanisms. Increased knowledge of CRF mechanisms could identify potential therapeutic targets. CRF is a complex phenotype, which is the result of environmental (e.g., demographic, clinical, social) and molecular factors. By integrating data from multiple sources, we will have increased power to identify and better interpret the omics-phenotype relationships relative to an analysis that uses only a single source of omics data. The similarity network fusion (SNF) method uses networks of samples across different datatypes as a basis for integration. The fused network can identify groups of patients that have shared and complementary information across the data types. The purpose of this study was to create a fused network using multimodal data types and (1) evaluate this network for joint data profiles (JDPs) of patients, (2) evaluate the JDPs for an association with morning CRF, and (3) evaluate the patient characteristics associated with JDPs.

The analysis evaluated transcriptomic, epigenomic, and phenotypic (i.e., age, income, performance, comorbidity burden) data for 299 oncology patients receiving chemotherapy (CTX). Patients with similar joint data profiles in the fused similarity network were identified using spectral clustering (k={2, 3, 4, 5}).  Across the analyses, the phenotype network had the largest contribution to the unified network. JDPs differ in mean levels of morning fatigue for values of k > 3 (k=3, F=17.52, p-value=6.57x10-8; k=4, F=16.71, p-value=4.95x10-10; k=5, F=13.86, p-value=2.42x10-10). This approach demonstrated potential to identify clinically informative groups of patients through the fusion of different data types. These profiles are useful to identify high risk patients and novel therapeutic targets.
55 Inkyung Jung General Poster only Lee Characterization of altered molecular mechanisms in Parkinson’s disease through cell type-resolved multi-omics analyses Andrew J. Lee, Changyoun Kim, Eliezer Masliah, Inkyung Jung Department of Biological Sciences / Korea Advanced Institute of Science and Technology  Parkinson’s disease (PD) is a progressive neurodegenerative disorder. However, cell type-dependent transcriptional regulatory programs responsible for PD pathogenesis remain elusive. Here, we establish transcriptomic and epigenomic landscapes of the substantia nigra (SN) by profiling 113,207 nuclei obtained from healthy controls and PD patients. Our multi-omic data integration provides cell type annotation of 128,724 cis-regulatory elements (cREs), and uncovers cell type-specific cRE dysregulations with a strong transcriptional influence on genes implicated in PD. The establishment of high-resolution three-dimensional chromatin contact maps identifies 656 target genes of dysregulated cREs and genetic risk loci, uncovering both novel candidates and known PD risk genes. Notably, these candidate genes exhibit modular gene expression patterns with unique molecular signatures in distinct cell types, highlighting altered molecular mechanisms in dopaminergic neurons and glial cells including oligodendrocytes and microglia. Together, our single-cell transcriptome and epigenome uncover cell type-specific disruption in transcriptional regulations related to Parkinson’s diseases.
56 SUNG HWAN LEE General Poster only LEE Comprehensive profiling of multi-OMIC datasets uncovers cancer-specific molecular subtypes with clinical relevance in pancreatic cancer. Sung Hwan LEE CHA University School of Medicine Introduction: Pancreatic cancer is lethal, showing dismal prognosis and therapeutic resistance. Previous molecular subtypes from genome or transcriptome did not show clinical relevance regarding precision strategy for optimal therapeutic options followed by precise patient classification. This study aims to uncover consensus molecular subtypes from cancer-specific multi-omics data showing clinically relevant therapeutic opportunities.

Methods: We performed comprehensive analyses using the dataset from the cancer dependency map (DepMap) project, including cancer-specific molecular characterization with multi-omics data, genome-wide loss-of-function screening using the CRISPR-Cas9 system, and cancer drug sensitivity. The subtype-specific molecular signatures were validated in independent translational cohorts (TCGA-PAAD; n=150, ICGC-PACA-AU; n=461, ICGC-PACA-CA; n=317).

Results: Integrative profiling of multi-omics molecular layers (Mutational signature, Copy number alteration, Transcriptome, MicroRNA, Chromatic profile, Proteome, and Metabolome) from pancreatic cancer cell lines (n=59) from the Cancer Cell Line Encyclopedia (CCLE) revealed a total of three cancer-specific molecular subtypes showing distinct tumor biology through all omics layer as well as clinical relevance with unique molecular dependency. Major molecular features of each subtype were reproducible in the validation cohorts. Subtype-specific molecular biomarkers, including mutational signatures and metabolites, were identified. Finally, the target drug with subtype-specific genetic dependency was analyzed to provide a precision strategy according to distinct subtypes’ molecular characterization.

Conclusions: Integrative profiling from multi-omics molecular layers revealed precision strategies based on cancer-specific molecular subtypes of pancreatic cancer in terms of tumor classification and discriminative therapeutic opportunities. Prospective translational studies companion with clinical trials based on cancer-specific molecular subtypes is mandatory to establish the precision strategy for managing pancreatic cancer.

57 Steven Brenner General Poster only Lin Is expression or splicing more important in an RNA-seq study? Yu-Jen (Jennifer) Lin, Zhiqiang Hu, Steven E. Brenner University of California, Berkeley RNA-seq has been widely used in biological and medical research because of its capability to quantify transcriptome changes. Researchers usually use their impressions and experience to choose whether to analyze changes in gene expression or alternative splicing levels. However, often the answer is not obvious, and the different pipelines and analytical approaches mean there is no systematic way to compare the importance of gene expression changes with splicing alterations.

We have developed statistical strategies to address the relative roles of gene expression or splicing changes in the transcriptome to provide insights for further analyses. We first quantify the impact on gene expression or splicing changes by computing the between-group variation of the control and treated groups to quantify the variation caused by the treatment. To put the effect of gene expression and splicing changes on the same scale to compare their relative importance, we calculate the proportion of the between-group variation to the total variation for gene expression and splicing changes, respectively. Finally, we use a t-test to test if the proportion of the between-group variation of gene expression and splicing changes are significantly different to provide insights into the relative roles of gene expression or splicing changes in the transcriptome. As an example, we categorized a sample of 25 single-gene transcription- and splicing-factor knockdown or overexpression RNA-seq sample datasets. We have then applied our method to knockdown or overexpression studies whose genes with previously unknown roles in the transcriptome to yield new insights for understanding their the function.  Such evaluation of the relative importance of gene expression and splicing should be broadly considered in analyzing transcriptomic datasets.

58 Chao Chun Liu General Poster only Liu Supervised machine learning reveals high efficacy of mobile elements to discriminate Salmonella outbreaks Chao Chun Liu, William Hsiao Simon Fraser University Since its introduction in the 21st century, genomic epidemiology has revolutionized modern-day public health surveillance and responses to infectious disease outbreaks. The high-resolution typing of pathogens empowered by genomics in conjunction with the contextualization of diseases by epidemiology has contributed to great successes in identifying outbreak origins and linking related cases. Existing methods to infer clonal strains in surveillance settings primarily involve the comparison of genetic variation in a stable segment of pathogen genomes known as the core genome. In consequence, the non-conserved genetic elements are often neglected and contribute no information to the genetic distance between two query genomes. To demonstrate the analytical value of non-conserved elements to link cases that share a common cause of infection, our study characterized a comprehensive set of genetic features predictive of outbreak origins by training multivariate regularized regression models on 24 historical foodborne outbreaks caused by Salmonella enterica. In total, 5,037 genetic features of high predictive value were identified, consisting of short deletions, large insertions, nucleotide substitutions, and carriage of extrachromosomal elements. The outbreak predictive features included a wide range of non-conserved genetic elements that were found unique to specific outbreaks such as plasmids, CRISPR spacers and phage genomes. We rationalized that many of these non-conserved elements have high predictive values due to the strong environmental influence on the transmission dynamics and evolution of these features in bacterial populations. Sequence comparison of the predictive elements identified in our study can complement current analytical practices for cluster detection and outbreak tracing to improve the concordance between genomic inferences and epidemiology.
59 Nicole Martinez-Martin General Poster only Martinez-Martin Researcher Perspectives on Bias and Fairness for Mental Health Applications of Digital Phenotyping Nicole Martinez-Martin Nicole Digital phenotyping refers to approaches that collect biometric and personal data situ from digital devices, such as smartphones, wearables, or social media and then conduct analyses to generate moment-by-moment quantification of a person’s mental state or predict future behavior. For example, data on finger taps or voice features could be tracked via a person’s smartphone and then analyzed to measure behavior, physiological states, or cognitive functioning. Digital phenotyping projects increasingly include multimodal data, such as electronic health records (EHRs), facial recognition technology, ambient sensors, biological scans, or genomic information. We conducted qualitative interviews with 20 researchers who are working on digital phenotyping applications for mental and behavioral health. We analyzed the interviews for themes that were relevant to ethical issues in digital phenotyping. As with other medical tools that utilize AI, bias and fairness were recognized as significant areas of concern for digital phenotyping. Participants discussed the need for increasing representation of participants from minoritized groups in the training data, as well as technological solutions to address bias in data sets and algorithms. While these are important steps, several participants noted that bias needed to be addressed at all stages of the development process. Addressing fairness and bias in digital phenotyping was seen as complicated by the different ways in which social inequities shape the development and uses of digital phenotyping tools.
60 E. SILA OZDEMIR General Poster only OZDEMIR A computationally guided approach to improve soluble expression of VHH binders E. Sila Ozdemir, Jessica Tolley, Eli Wagnell, Florian Goncalves, Michelle Gomes, Corey Dambacher, Viktoriya Dubrovskaya, Bruce Branchaud, Srivathsan V. Ranganathan OREGON HEALTH AND SCIENCE UNIVERSITY Nanobodies are variable heavy chain fragments derived from camelid antibodies. They have recently shown promise as high-affinity reagents for targeting biomolecules. Nanobodies offer higher stability compared to antibodies, and their smaller size (~15-20 kDa) allows better targeting of molecules localized inside the cell and in crowded environments, like tissues and protein aggregates. We setup a computational pipeline for measuring surface hydrophobicity by calculating Spatial Aggregation Propensity (SAP) of the nanobody binders. SAP maps for those binders highlight the hydrophobic hotspots with single amino acid resolution, which were subsequently used to guide mutagenesis of the binders for soluble expression. Some of the mutations introduced clones showed remarkable increase in yield in 500 ml cultures (up to 40x increase in expression). With increase of soluble expression of the binders, we were able to characterize their binding to KRas (WT and G12D) on soluble ELISA, and biolayer interferometry, towards evaluating their potential as affinity reagents for imagining application, diagnostics, and therapeutics.  
61 Jisoo Park General Poster only Park The landscape of neoantigen in Korean NSCLC patients Jisoo Park, Kwoneel Kim Department of  Biology, Kyung Hee University Lung cancer is one of the major worldwide health problems which accounts for 18% of global cancer-related deaths (Sung, Ferlay et al. 2021). About 80 percent to 85 percent of lung cancers are non-small cell lung cancer (NSCLC). Lung adenocarcinoma (LUAD) is the most common type of lung cancer seen in non-smokers, and smokers. Therefore, understanding genetic and epigenetic interaction between genes is important to understand lung cancer. We conducted research on NSCLC and the immune landscape. In this analysis, we profiled putative regulators in Korean NSCLC patients. Our co-researcher team divided the subtypes of the samples by using the Non-negative Matrix Factorization (NMF) clustering method using global protein, phosphor-proteome, and acetylproteome data. we hypothesized that neoantigens were favorably associated with survival in NSCLC patients.

we analyze neoantigen counts and antigen processing and presentation machinery (APM) together, and this result was assumed to be different for each ‘multi-omics subtype’. We classified neoantigens into four subtypes. Then we performed regression analysis between neoantigen subtypes and overall survival time and found a high negative correlation with ‘validated novel neoantigen’ that one of the subtypes. Also, we found some validated novel neoantigens that are commonly expressed in patients, and these validated novel neoantigens act as a favorable factor for patient survival. Moreover, we performed a survival analysis validated novel neoantigen and APM together, and we found that the survival analysis result of combining these two factors has high explanatory power. Last, we see the pattern of multi-omics subtype in validated novel neoantigen and APM, we found that patterns of the immune system matched the survival patterns of patients with each multi-omics subtype.
62 Predrag Radivojac General Poster only Ramola The field of protein function prediction as viewed by different domain scientists Rashika Ramola, Iddo Friedberg, Predrag Radivojac Northeastern University Experimental biologists, biocurators, and computational biologists all play a role in characterizing a protein's function. The discovery of protein function in the laboratory by experimental scientists is the foundation of our knowledge about proteins. Experimental findings are compiled in knowledgebases by biocurators to provide standardized, readily accessible, and computationally amenable information. Computational biologists train their methods using these data to predict protein function and guide subsequent experiments. To understand the state of affairs in this ecosystem, centered here around protein function prediction, we surveyed scientists from these three constituent communities. We show that the three communities have common but also idiosyncratic perspectives on the field. Most strikingly, experimentalists rarely use state-of-the-art prediction software, but when presented with predictions, report many to be surprising and useful. Ontologies appear to be highly valued by biocurators, less so by experimentalists and computational biologists, yet controlled vocabularies bridge the communities and simplify the prediction task. Additionally, many software tools are not readily accessible and the predictions presented to the users can be broad and uninformative. We conclude that to meet both the social and technical challenges in the field, a more productive and meaningful interaction between members of the core communities is necessary.  
63 Neethu Shah General Poster only Shah ClinGen Resources for Generation And Dissemination of ACMG Criteria Specifications for Variant Classification by Variant Curation Expert Panels Neethu Shah, Andrew R. Jackson, Arturo Alejandro Zuniga, Tierra Farris, Danielle Azzariti, Marina DiStefano, Steven. M. Harrison, Mark E. Mandell, Christine G. Preston, Jennifer L. Goldstein, Matt W. Wright, Teri E. Klein, Sharon E. Plon, Aleksandar Milosavljevic Baylor College of Medicine The ClinGen Sequence Variant Interpretation (SVI) Working Group provides general recommendations for using ACMG/AMP criteria codes to enhance consistency in usage and transparency of variant classification rationales. In addition to following these general recommendations, ClinGen Variant Curation Expert Panels (VCEPs) define and apply their own gene-disease specific recommendations for each of these ACMG evidence codes. We here present multiple tools that facilitate generation and dissemination of these specifications in both human- and machine-readable formats.

ClinGen Criteria Specification (CSpec) Editor ( is a centralized system to create, manage, version, and disseminate criteria specifications which are created by the VCEPs and approved by the SVI VCEP Review Committee. CSpec Registry ( is a public repository of all the SVI approved specifications and is a user-friendly web interface for browsing, filtering, and searching criteria specifications by gene and disease. The CSpec Registry, as of November 2022 contains 29 approved criteria specifications, defined by 21 VCEPs for more than 50 genes. The CSpec data messaging service publishes SVI approved specifications to the ClinGen Data Exchange thereby notifying other ClinGen Curation Tools, like the Variant Curation Interface (VCI), of the approval and release of these specifications. This enables CSpec knowledge integration into the VCI curation and other ClinGen workflows, as well as provides direct access to the up-to-date specifications by the curators as they classify variants. The CSpec Registry REST-API service ( provides programmatic access to the structured content in JSON and JSON-LD formats so that other resources, including those external to ClinGen, can integrate ClinGen approved evidence code specifications into their variant classification and genome interpretation processes.  By allowing both programmatic and web user interfaces for developing and releasing the criteria specifications, these tools elevate the transparency and consistency of variant curation processes across ClinGen while empowering research and clinical communities globally.
64 SeokBo Shim General Poster only Shim The study of Immune microenvironment of Korean NSCLC patients Kwoneel Kim Department of Biology, Kyung Hee University Lung cancer is one of the major worldwide health problems which accounts for 18% of global cancer-related deaths (Sung, Ferlay et al. 2021). About 80 percent to 85 percent of lung cancers are non-small cell lung cancer (NSCLC). Lung adenocarcinoma (LUAD) is the most common type of lung cancer seen in non-smokers, and smokers. Therefore, understanding genetic and epigenetic interaction between genes are important to understand lung cancer. We conducted research on NSCLC and the immune landscape.

We hypothesized that high immune cell activity will have better survival than low immune cell activity, and this phenomenon will be caused by a major gene involved in immune activity.

In this analysis, we profiled putative regulator in Korean NSCLC patients. Our co-researcher team divided the subtypes of the samples by using the Non-negative Matrix Factorization (NMF) clustering method using global protein, phosphoproteome, and acetylproteome data. We attempted to account for these sample subtypes using clustering of patient-specific immune constructs for genes or samples. Hot-tumor-enriched (HTE) and cold-tumor-enriched (CTE) were obtained through clustering using immune composition values, and survival analysis was performed between these clusters. From the results of the survival analysis, we confirmed that the group with activated immune cells showed a positive correlation with longer survival. Lastly, we performed pathway analysis in the immune cluster at the multi-omics level. We found immune-related pathways were enriched in THE tumors. On the other hand, cell cycle-related pathways were enriched in CTE in multi-omics level.
65 Onur Mutlu General Poster only Singh RUBICON: A Framework for Designing Efficient Deep Learning-Based Genomic Basecallers Gagandeep Singh, Mohammed Alser, Alireza Khodamoradi, Kristof Denolf, Can Firtina, Meryem Banu Cavlak, Henk Corporaal, Onur Mutlu ETH Zurich Nanopore sequencing is a widely-used high-throughput genome sequencing technology that can sequence long fragments of a genome. Nanopore sequencing generates noisy electrical signals that need to be converted into a standard string of DNA nucleotide bases using a computational step called basecalling. The accuracy and speed of basecalling have critical implications for all later steps in genome analysis. Many researchers adopt complex deep learning-based models to perform basecalling without considering the compute demands of such models,  which leads to slow, inefficient, and memory-hungry basecallers. Therefore, there is a need to reduce the computation and memory cost of basecalling while maintaining accuracy. H​​owever, developing a very fast basecaller that can provide high accuracy requires a deep understanding of genome sequencing, machine learning, and hardware design. Our goal is to develop a comprehensive framework for creating deep learning-based basecallers that provide high efficiency and performance. We introduce RUBICON, a framework to develop hardware-optimized basecallers. RUBICON consists of two novel machine-learning techniques that are specifically designed for basecalling. First, we introduce the first quantization-aware basecalling neural architecture search (QABAS) framework to specialize the basecalling neural network architecture for a given hardware acceleration platform while jointly exploring and finding the best bit-width precision for each neural network layer. Second, we develop SkipClip, the first technique to remove the skip connections present in modern basecallers to greatly reduce resource and storage requirements without any loss in basecalling accuracy. We demonstrate the benefits of RUBICON by developing RUBICALL, the first hardware-optimized basecaller that performs fast and accurate basecalling. Compared to the fastest state-of-the-art basecaller, RUBICALL provides a 3.19x speedup with 2.97% higher accuracy. Compared to the most accurate state-of-the-art basecaller, RUBICALL provides 35.42x speedup, at the expense of 2.55% lower accuracy.  We show that RUBICON helps researchers develop hardware-optimized basecallers that are superior to expert-designed models.
66 Pankhuri Singhal General Poster only Singhal DETECT: Feature extraction method for disease trajectory modeling Pankhuri Singhal, Lindsay Guare, Colleen Morse, Marta Byrska-Bishop, Marie A. Guerraty, Dokyoon Kim, Marylyn D. Ritchie, and Anurag Verma University of Pennsylvania Modeling with longitudinal electronic health record (EHR) data proves challenging given the high dimensionality, redundancy, and noise captured in EHR. In order to improve precision medicine strategies and identify predictors of disease risk in advance, evaluating meaningful patient disease trajectories is essential. In this study, we develop the algorithm DiseasE Trajectory fEature extraCTion (DETECT) for clinical feature extraction and trajectory generation in high-throughput temporal EHR data. This algorithm can 1) simulate longitudinal individual-level EHR data, specified to user parameters of scale, complexity, and noise and 2) use a convergent relative risk statistical framework to test intermediate diagnosis codes occurring between a specified index code(s) and outcome code(s) to determine if they are predictive conditions of the outcome. We benchmarked our method on simulated data and generated real-world disease trajectories using DETECT in a cohort of 145,575 individuals diagnosed with hypertension in Penn Medicine EHR for cardiometabolic and neurological outcomes. Our current study is underway to train a recurrent neural network (RNN) model using the generated patient disease trajectories along with lab, medication, procedural, vital, and demographic data to predict subsequent clinical events in patient paths.   
67 Subin Cho General Poster only Subin Single cell-level analysis of responses to cancer immunotherapy using a syngeneic murine model of lung cancer Subin Cho, Jaesang Kim, Sanghyuk Lee EWHA Womans University Immunotherapy is currently the most promising treatment modality for various types of cancers. It has been demonstrated that targeting immune checkpoint with PD-1 or CTLA-4 blockade can lead to longterm survival or complete remission from cancers. However, immunotherapy is limited in efficacy as only a minority of patients benefit from such therapies. Clearly, closer scrutiny of responses from various cells found in tumor tissues during immunotherapy would greatly enhance our understanding of the interactions within the tumor microenvironment and possibly point to novel targets. Here, we report a longitudinal study using a Lewis lung carcinoma syngeneic mouse model and anti-PD-1 antibody treatment. Specifically, we performed single-cell RNAseq using tumor tissues isolated from multiple time points and attempted to reconstruct potentially significant cellular interactions in response to the immunotherapy. Interestingly, natural killer cells in particular showed activation by anti-PD-1 antibody treatment and expressed elevated levels of Xcl-1 which has been implicated in recruiting conventional dendritic cells (cDCs). We present results from gene expression analysis suggesting enhanced antigen-presenting by cDCs and strategies for demonstrating the functionalities of altered cellular responses to immunotherapy.  
68 David Wishart General Poster only Wishart MiMeDB: The Human Microbial Metabolome Database David S. Wishart, Eponine Oler, Harrison Peters, AnChi Guo, Sagan Girod, Scott Han, Sukanta Saha, Vicki W. Lui, Marcia LeVatte, Vasuk Gautam, Rima Kaddurah-Daouk, Naama Karu University of Alberta The Human Microbial Metabolome DataBase or MiMeDB ( is a comprehensive, multi-omic, microbiome resource that connects: 1) microbes to microbial genomes; 2) microbial genomes to microbial metabolites; 3) microbial metabolites to the human exposome; and 4) all of these “omes” to human health. MiMeDB was established to consolidate the growing body of data connecting the human microbiome and the chemicals it produces to both health and disease. MiMeDB contains detailed taxonomic, microbiological and body-site location data on most known human microbes (bacteria and fungi). This microbial data is linked to extensive genomic and proteomic sequence data that is closely coupled to colourful interactive chromosomal maps. The database also houses detailed information about all the known metabolites generated by these microbes, their structural, chemical and spectral properties, the reactions and enzymes responsible for these metabolites and the primary exposome sources (food, drug, cosmetic, pollutant, etc.) that ultimately lead to the observed microbial metabolites in humans. Additional, extensively referenced data about the known or presumptive health effects, measured biosample concentrations and human protein targets for these compounds is provided. All of this information is housed in richly annotated, highly interactive, visually pleasing database that has been designed to be easy to search, easy to browse and easy to navigate. Currently MiMeDB contains data on 626 health effects or bioactivities, 1,904 microbes, 3,112 references, 22,054 reactions, 24,254 metabolites or exposure chemicals, 648,861 MS and NMR spectra, 6.4 million genes and 7.6 billion DNA bases. We believe that MiMeDB represents the kind of integrated, multi-omic or systems biology database that is needed to enable comprehensive multi-omic integration.
69 Lixing Yang General Poster only Yang Mutational signatures of complex genomic rearrangements in human cancer Lisui Bao, Xiaoming Zhong, Yang Yang, Lixing Yang University of Chicago Complex genomic rearrangements (CGRs) are common in cancer and are known to form via two aberrant cellular structures—micronuclei and chromatin bridge. However, which mechanism is more relevant to CGR formation in cancer and whether there are other undiscovered mechanisms remain unknown. Here we developed a computational algorithm ‘Starfish’ to analyze 2,014 CGRs from 2,428 whole-genome-sequenced tumors and discover six CGR signatures based on their copy number and breakpoint patterns. Through extensive benchmarking, we show that our CGR signatures are highly accurate and biologically meaningful. Three signatures can be attributed to known biological processes—micronuclei- and chromatin-bridge-induced chromothripsis and circular extrachromosomal DNA. More than half of the CGRs belong to the remaining three signatures not been reported previously. A unique signature, we named “hourglass chromothripsis”, with localized breakpoints and small amount of DNA loss is abundant in prostate cancer. We find SPOP is associated with hourglass chromothripsis and may play an important role in maintaining genome integrity.  
70 Shuang Yang General Poster only Yang Rate of adverse cardiovascular events in breast cancer patients receiving chemotherapy and targeted therapy Shuang Yang, Dongyu Zhang, Jiang Bian, Dejana Braithwaite, Yi Guo  University of University Breast cancer is the most common cancer and the second leading cause of cancer death in women in the United States. Prior studies have shown that use of adjuvant therapy significantly increases breast cancer cure rate and reduces recurrences. However, adjuvant therapy, especially chemotherapy and targeted therapy, may cause cardiotoxicity, one of the most important adverse reactions of adjuvant therapy. Patients with a higher burden of comorbidities may have a higher cardiotoxicity risk since they have poorer prognosis compared to their healthier counterparts. The current study aimed to assess how risk of cardiotoxicity in breast cancer patients on chemotherapy and targeted therapy varied by burden of comorbidities. In electronic health records (EHRs) in the OneFlorida+ Clinical Research Network, we identified 2,198 breast cancer patients who received chemotherapy and target therapy. Cardiovascular events were identified from the first date of chemo/targeted therapy to 3 months after the last date of chemo/targeted therapy using ICD codes. We used 12 months data before the first chemo/targeted therapy date to calculate a Charlson comorbidity index (CCI) score for all patients. We calculated cardiovascular event rates by CCI and built multivariable logistic models to examine the association of multiple factors with cardiotoxicity, including CCI, age, race-ethnicity, body mass index (BMI), outpatient visits, tumor grade, literality, stage, size, number of positive nodules at tumor diagnosis, and census tract-level poverty and residence. Overall, 9.5% (N=208) of the patients had any cardiotoxicity event after chemo/targeted therapy. The cardiotoxicity rates by CCI were 9.1% (CCI=0), 9.4% (CCI=1), and 10.5% (CCI=1+). In the regression analysis, although statistically non-significant, patients with higher CCI were more likely to experience cardiotoxicity events (For CCI = 1 vs. 0: odds ratio [OR] = 1.20, 95% confidence interval [CI] = 0.85-1.71; For CCI = 1+ vs. 0: OR = 1.34, 95% CI = 0.87-2.08).
71 Jing Zhu General Poster only Zhu Visualization and analysis of cancer genomics data using UCSC Xena Jing Zhu, Mary Goldman, Brian Craft, David Haussler UCSC Genomics Institute, UC Santa Cruz UCSC Xena ( is a web-based visual integration and exploration tool for multi-omic data and associated clinical and phenotypic annotations. Researchers can easily view and explore public data, their own private data, or both using the Xena Browser. Private data are kept on the researcher's computer and are never uploaded to our public servers. We support Mac, Windows, and Linux.

Questions Xena can help you answer:

* Is overexpression of this gene associated with lower/higher survival?

* What genes are differentially expressed between these two groups of samples?

* What is the relationship between mutation, copy number, expression, etc for this gene?

Xena showcases seminal cancer genomics datasets from TCGA, the Pan-Cancer Atlas, GDC, PCAWG, ICGC, and more; a total of more than 1500 datasets across 50 cancer types. We support virtually any type of functional genomics data: SNPs, INDELs, copy number variation, gene expression, ATAC-seq, DNA methylation, exon-, transcript-, miRNA-, lncRNA-expression and structural variants. We also support clinical data such as phenotype information, subtype classifications and biomarkers. All of our data is available for download via python or R APIs, or using our URL links.

Our signature Visual Spreadsheet view shows multiple data types side-by-side enabling discovery of correlations across and within genes and genomic regions. We also have dynamic Kaplan-Meier survival analysis, powerful filtering and subgrouping, differential gene expression analysis, charts, statistical analyses, genomic signatures, and the ability to generate URLs to live views. We link out to the UCSC Genome Browser as well as MuPIT/CRAVAT and TumorMap.

New features include:

* Genome-wide differential gene expression analysis

* Select samples directly from the screen for filtering and creating subgroups

* Violin plots on any numerical data

* Loading of Microsoft Excel files

Our development site has interactive visualization of single-cell data.