Poster Number / Zoom Breakout Room Number Affiliation / Institution Session/Workshop Area Abstract Type Abstract Title List all authors (first name first with names separated by commas and proper capitalization) in the order they appear on the abstract. Please do NOT list affiliations or addresses. Abstract (300 words or less) Poster DOI or URL
1 Lokender Kumar Colorado school of mines General Poster only Toll-Like Receptor Signaling in Pocillopora damicornis Lokender Kumar, Matthew Lyn-Goin, Monsurat Olaosebikan, Liza Roger, Hannah Reich, Hollie Putnam, Nastassja Lewinski, Rohit Singh, Noah Daniels, Lenore Cowen, Judith Klein-Seetharaman Toll-like receptors (TLRs) play a crucial role in innate immunity. These membrane glycoprotein receptors recognize and respond to a variety of microbial components such as lipopeptides, peptidoglycans, lipopolysaccharide, flagellin, DNA, and RNA from viruses or bacteria. Viruses and bacteria are the most abundant biological entries in seawater and live in close association with corals. The chemical crosstalk between corals and microbes plays an important role in coral growth and development. In the present study, we used our remote homology detection pipeline [1] to find human TLR protein homologs in Pocillopora damicornis. Our results showed the presence of unique TLR homologs in the coral P. damicornis. Further, using leucine rich sequence analysis and ligand active site structural analysis, we have provided the significant preliminary evidence for the presence of TLR homologs in P. damicornis. Our results provide valuable insights for the presence of a comprehensive innate immunity in corals. This research may provide valuable information to understand the coral immune system and the evolution of innate immunity in humans over the course of evolution.

[1] Monsurat Olaosebikan, Matthew Lynn-Goin, Lokender Kumar, Nathanael Brenner, Nastassja Lewinski, Rohit Singh, Lenore Cowen, Noah Daniels  and Judith Klein-Seetharaman (2020)

Pipeline for Discovery of Remote Homologues in Non-Model Organisms. Pacific Symposium of Biocomputing January 2021, Abstract.

https://f1000research.com/posters/9-1411
3 Sayantan Kumar Institute for Informatics at Washington University in St. Louis, School of Medicine General Poster only Simplified Form of Recurrent Neural Networks for Predicting Alzheimer Disease Progression  Sayantan Kumar, Aditi Gupta, Inez Oh,  Suzanne Schindler, Albert M. Lai, Philip R.O. Payne

Alzheimer disease (AD) is the most common cause of dementia and results in progressive memory loss, cognitive impairment, and general disability. Analyzing the disease course of AD is critical for implementing personalized diagnostic and therapeutic strategies to manage disease progression. Longitudinal data for AD patients can be extracted from electronic health records (EHR), consisting of varying numbers of visits and non-uniform time intervals between visits.  These aspects of EHR-derived data can result in the failure of traditional time series and machine learning methods when trying to predict AD progression. As an alternative, we explored the use of a simplified form of recurrent neural networks (RNN), MinimalRNN, to predict the progression of AD markers and disease stage up to 6 years into the future by leveraging temporal and clinical patterns extracted from the EHR.  Outpatient EHR data from June 1st 2013 through May 31st 2018 were extracted from the EHR of Barnes-Jewish Hospital, a large tertiary-referral academic medical center in St. Louis, MO. Longitudinal data from 1193 patients encompassing 3676 visits with one or more neurobehavioral status exam scores were selected for further analyses. To impute missing data, we explored two techniques: a ‘linear filling’ strategy which performs a linear interpolation between the previous and next observed timepoints using available data during preprocessing, and an integrative ‘model filling’ strategy which utilizes the predictions from the MinimalRNN model itself to fill in missing data during training and testing. Our analyses demonstrated that with simplified update equations and fewer parameters, MinimalRNN has ~1.6X faster training time compared to baseline long short-term memory networks, while performing similarly (Multiclass Area Under Receiver Operating Characteristics for clinical diagnosis = 0.7474 vs 0.704 respectively). Future analyses will include model hyperparameter optimization, incorporating more clinical features, and interpreting the importance of each clinical feature and model component.  https://wustl.box.com/s/uyoad7w8u1z0mifcb8blzkj5f8315ygi
5 Chitaranjan Mahapatra UCSF General Poster only Hydroxychloroquine: A SARS-CoV-2 infection inhibitor causes Cardiac toxicity. A simulation Study Chitaranjan Mahapatra, Ashish Pradhan Objectives: The outbreak of coronavirus disease 2019 (COVID-19) caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2/2019-nCoV) poses a serious threat to global public health and local economies. With the heightened interest of the potential use of chloroquine and hydroxychloroquine for the treatment of patients with SARS-CoV2 (COVID-19 or novel coronavirus)1 it may be prudent to reflect on the risks of therapy, particularly their cardiac toxicity. The purpose of this study was to investigate the propensity of hydroxychloroquine (HCQ) to cause bradycardia.

Methods: The sinoatrial node cell is described as an equivalent electrical circuit with a number of variable conductances representing voltage-gated Na+ channels, voltage-gated Ca2+ channels, voltage-gated potassium channels, calcium-dependent potassium channels and funny current channel. A HCQ drug model for the funny current is simulated after mining data from experimental studies. The biophysically altered funny current is integrated into the single SA node electrophysiological model.

Results: The resting membrane potential (RMP) is set at ─80mV. A current pulse of 2 nA for 10 ms is injected to evoke the AP. The steady state value of the activation parameter of the funny current (if) is shifted to negative side after applying HQN of 1 µM.  The action potential timing is altered, when we incorporated the biophysically modified funny current.  The results show that the modified funny current plays an important role in reducing the frequency of the spontaneous action potentials at SA node.

Conclusions:  The model successfully reproduces both ionic currents and action potential observed in intracellular recordings from individual SAN cell. The effects of Hydroxychloroquine drug are simulated with respect to funny current and action potential. As Hydroxychloroquin reduces the frequency rate of the spontaneous action potential firing, it should be prevented as a potential drug against COVID-19.

https://f1000research.com/slides/9-1427
7 Eugene Matveev Institute for Information Transmission Problems General Poster only Prediction of protein regions susceptible to proteolytic processing Eugene Matveev, Vyacheslav Safronov, Marat Kazanov Regulatory proteolysis or proteolytic processing, which is the main topic of this study, is a process of activation or deactivation of a target protein via hydrolysis of its one or few peptide bonds by a specialized enzyme, protease. Contrary to proteases involved in a food digestion, regulatory proteases usually possess high specificity, which allows selectively bind its substrate for the modulation of its function. Proteases recognize their cleavage sites primarily based on a specific amino acid context around the cleaved peptide bond. Stretches of amino acid sequences satisfying specificity of regulatory proteases are rarely appears at random and if occurs, especially if they are also accessible in a structural sense, in most cases represent native cleavage sites. Thus, a prediction of cleavage sites of regulatory proteases based on the known specificity of protease and known 3D structure of substrate can be a powerful approach for the identification of protease substrates. In this study we developed the computational model predicting structural susceptibility of proteins to proteolytic processing.

https://f1000research.com/posters/9-1421
9 Aarthy Murali Alagappa University General Poster only Interpretations on the interaction between protein tyrosine Phosphatase and E7 oncoproteins of high and low risk HPV – A computational perception Aarthy Murali, Sanjeev Kumar Singh Human papillomavirus (HPV) is the most widespread sexually transmitted infection around the globe and it is reported that 75% sexually active women are the victims of the HPV infection. Other factors that is prone for the infection of HPV is the smoking habits, sexual activity at the earliest age and other environmental factors. The cellular hyper proliferation which includes the most prevalent cancer in women is the main association by HPV infection. More than 200 types of HPV has been recognized on the basis of DNA sequence and characterized as the high and low risk genotypes. The high risk types are associated with the cervical, vulvar, vaginal and anal cancers by its interaction with the tumor suppressor proteins whereas the low risk types are associated with genital warts. The HPVs causing cervical cancer in women possess double stranded DNA genomes that are approximately 8 kb in size and encode eight genes among which the E6 and E7 possess transforming properties namely the transmembrane signalling, regulation of the cell cycle, transformation of established cell lines, immortalization of primary cell line and regulation of chromosomal stability. The E6 oncoprotein interacts with the p53 and E7 interacts with pRb which gets degraded and in turn cervical cancer developed. Our study focuses on the identification of interaction of between PTNP14 with the high and low risk types 16 and 11 respectively which helps in understanding the degradation of protein activity.  https://doi.org/10.7490/f1000research.1118432.1
11 Umesh  Panwar Department of Bioinformatics, Alagappa University General Poster only Identification of potent inhibitors for blocking the interaction between HIV-1 IN-LEDGF/p75 using virtual screening and docking studies Umesh Panwar, Sanjeev Kumar Singh Integrase is an essential enzyme that processes the integration of reverse transcribed viral cDNA into the host genome. To accomplish a successful infection, the role of cellular cofactor proteins, specifically Lens Epithelium-Derived Growth Factor (LEDGF/p75) is necessary during replication cycle. The protein-protein interaction of IN-LEDGF/p75 has become a promising drug target in the drug discovery.  Therefore the present study designed to identify a novel compound using in silico approaches. Resulted outcome suggest that the compounds NCI 744471 and BindingDB_27849 are found with enhanced binding affinity to inhibit the interaction between IN and LEDGF/p75. Overall, the obtained leads could be therapeutic agent to prevent and treat the HIV-1 viral infection. https://doi.org/10.7490/f1000research.1118429.1
13 Rahagir Salekeen Biotechnology and Genetic Engineering Discipline, Life Science School, Khulna University General Poster only in-silico Evaluation of Dietary Fish Fatty Acids as Potential Atheroprotective Therapeutics Rahagir Salekeen, Sadia Noor Mou, Kazi Mohammed Didarul Islam Fish lipid supplements have long been used for their cardioprotective roles, but the molecular mechanisms involved in their activities have not been elucidated. Here we have considered 32 dietary fish lipids and through standard virtual screening methods evaluated their potentials in addressing atherosclerosis and cardio vascular diseases by multiple pathway interactions. Metabolic network simulations suggest the studied lipids to have a pathway relation index of 0.44±0.02 to key atherosclerosis associated pathways. In-silico ADMET simulations reveal all the candidate lipids to have optimum bioavailability scores (0.56±0.13) and 26 compounds had no predicted toxicity outcomes. (LD50 = 2500±660mg/kg) Molecular docking, dynamics and mechanics reveal nervonic acid and docosapentaenoic acid to be the best modulators of four different pathway regulating enzymes – cyclooxygenase 2, thromboxane a2 synthase, angiotensin converting enzyme I and L-type Ca2+ channel protein. The findings of this study provide a molecular and structural foundation for atheroprotective potentials of dietary fish lipids as well as screens for the two best candidates to be used as single compound regulators of atherosclerotic pathways. https://f1000research.com/posters/9-1403
15 Naila Shinwari Kyoto University General Poster only scMontage: Fast and Robust Gene Expression Similarity Search for Massive Single-cell Data Naila Shinwari, Tomoya Mori, Wataru Fujibuchi Single-cell RNA-seq (scRNA-seq) analysis is widely used to characterize cell types or detect the heterogeneity of cell states at much higher resolutions than ever before. With the emergence of large-scale data like Human Cell Landscape or Mouse Cell Atlas, there is a growing need for a data mining tool that can swiftly search the data of interest. Here, we introduce scMontage (http://cellblast.stemcellinformatics.org), a gene expression similarity search server dedicated to scRNA-seq data and capable of rapidly comparing a query with thousands of samples within a few seconds. The scMontage search is based on Spearman’s rank correlation coefficient, and its robustness is ensured by introducing Fisher’s Z-transformation and Z-test. Furthermore, search results are linked to the human cell database SHOGoiN (http://shogoin.stemcellinformatics.org), which enables users to quickly access additional cell-type specific information. Although there already exist tools for gene similarity searches, the fast search, easy to use server and no need to normalize the data are distinctive attributes of scMontage. scMontage is available not only as a web server, but also as a stand-alone application for the user’s own data, and thus enhances the reliability and throughput of the cell analysis to help users gain new insights into their research. https://f1000research.com/slides/9-1405
17 Yao Yan University of Washington General Poster only A continuous benchmarking challenge for COVID-19 outcome prediction Yao Yan, Thomas Schaffter, Timothy Bergquist, Thomas Yu, Vikas Pejaver, Noah Hammarlund, Justin Prosser, Justin Guinney, Sean Mooney The COVID-19 pandemic spread across the globe in early 2020  and resulted in soaring case counts, strained hospital capacity and increased deaths. Machine learning approaches applied to COVID-19 patient data have already shown value in early screening and outcome prediction. However, due to the sensitive nature of EHR data, patient data sharing is restricted under HIPAA to protect patient privacy, raising the threshold for researchers to access and utilize patient data to aid in the pandemic response. To overcome this problem, we implemented a model-to-data continuous benchmarking challenge - the COVID-19 EHR DREAM Challenge - to enable the community to validate clinically relevant prediction models on a private EHR dataset from University of Washington without ever accessing the data. The COVID-19 EHR DREAM challenge is a continuous benchmarking community challenge where the training and evaluation datasets for COVID-19 patients are updated every 2-4 weeks. As of late November, the combined dataset included ~90,000 patients tested for COVID-19 since February 2020. Among all patients who received a COVID-19 test, ~3,500 tested positive. These data were then put to help the community address the following two questions: Q1) For a patient who has received a COVID-19 test, how likely will he/she  test positive? Q2) For a patient who has tested positive during an outpatient visit, how likely will he/she be hospitalized within 21 days from COVID-19 test? Datasets for Q1 have 25 weeks with 6 versions and for Q2 have 13 weeks with 3 versions. In total, 482  participants registered for the challenge and 90 teams made at least one valid submission. The AUROC and AUPRC of the best-performing model so far is 0.827 and 0.303 seen in the dataset version “09-18-20” for Q1, and 0.982 and 0.897 for Q2 dataset version “08-19-2020”.

https://f1000research.com/posters/9-1425
19 Hong Zheng Stanford university General Poster only Multi-cohort analysis of host immune response identifies conserved protective and detrimental modules associated with severity irrespective of virus Hong Zheng, Aditya M Rao, Denis Dermadi, Jiaying Toh, Lara Murphy Jones, Michele Donato, Yiran Liu, Yapeng Su, Cheng Dai, Sergey Kornilov, Minas Karagiannis, Theodoros Marantos, Yehudit Hasin-Brumshtein, Yudong He, Evangelos J Giamarellos-Bourboulis, Jim Heath, Purvesh Khatri SARS-CoV-2 pandemic, the fourth pandemic of the decade, has underscored gaps in global pandemic preparedness and the need for generalizable tests to avert overwhelming healthcare systems worldwide, irrespective of a virus. We integrated 4,780 blood transcriptome profiles from patients infected with one of 16 viruses across 34 independent cohorts from 18 countries, and scRNA-seq profiles of 702,970 immune cells from 289 samples across three independent cohorts. We found a myeloid cell-dominated conserved host response associated with severity. It showed increased hematopoiesis, myelopoiesis, and myeloid-derived suppressor cells with increased severity. We identified four gene modules that delineate distinct trajectories associated with mild and severe outcomes, and show the interferon response was decoupled from protective host response during severe viral infection. These modules distinguished non-severe from severe viral infection with clinically useful accuracy. Our findings provide insights into immune response dynamics during viral infection, and identify factors that may influence patient outcomes. https://f1000research.com/posters/9-1410
21 Jill Ashey University of Rhode Island Workshop: Bioinformatics of Corals Poster only Transcriptomic responses of Pacific and Caribbean corals to sediment stress Jill Ashey, Hailey McKelvie, John Freeman, Polina Shpilker, Lenore Cowen, Francois Seneca, and Hollie Putnam

As human impacts on marine systems increase, wave action from intensifying storms and runoffs from coastal development increase the sediment suspended in the water column and deposited on reefs. Cilia and a mucus layer on the surface of corals can effectively remove deposited sediments, harmful bacteria, and toxins; however, these removal processes are energy-intensive and divert energy away from key biological processes, such as growth and reproduction. Additionally, suspended sediment in the water column prevents light from reaching the coral’s photosynthetic endosymbionts, impeding their ability to provide energy to the coral. Despite the deleterious effects that sediment suspension and deposition have on corals, few studies have examined the response of multiple species of corals in the context of sediment stress. To address this knowledge gap, this study assessed the transcriptomic response of two Caribbean coral species in the Florida Keys (Acropora cervicornis and Montastrea cavernosa) and two Pacific coral species in Hawaii (Pocillopora damicornis and Porites lobata) to varying levels of sediment stress for days to one month. FL sediment was dispensed weekly to mimic natural sedimentation movement caused by frequent storms, while HI corals were exposed daily to simulate runoff from development. RNA-seq analysis revealed that for all species from both sites, treatment samples clustered with one another separate from the control, indicating a concerted response to sedimentation. The greatest amount of differentially expressed genes were present in comparison of treatments to controls, and not between levels in the treatments. Gene ontology enrichment analysis found GO terms, which were related to heme binding, peroxidase activity, and oxidation-reduction process, and there was greater differential expression and enrichment of functions in flat plating corals compared to branching corals. These findings help to elucidate the genetic basis underlying coral resilience and susceptibility to sedimentation, which can guide decisions related to reef management.

https://figshare.com/articles/poster/Transcriptomic_responses_of_Pacific_and_Caribbean_corals_to_sediment_stress/13345193
23 Dimitri Perrin Queensland University of Technology Workshop: Bioinformatics of Corals Poster only Optimal CRISPR guide RNA design for gene editing in corals Jacob Bradford, Phillip A. Cleves, Amanda I. Tinoco, Line K. Bay, Dimitri Perrin Growing amounts of genomic and transcriptomic data from multiple reef building coral species have been used to generate hypotheses about the roles of particular genes and molecular pathways in coral health and resilience.

However, it has been difficult to rigorously test these hypotheses, and in particular to generate and analyse the appropriate mutants, due to a lack of genetic tools.

CRISPR-based gene editing can address this challenge, but the design of guide RNAs is not trivial. Efficient guides are crucially important here, given that zygotes are typically available only seasonally when the coral of interest spawns.

We have developed and optimised a gRNA design method that can be used to systematically identify all suitable guides in the genome of the coral, Acropora millepora. As a proof of concept, guides generated from our pipeline were recently used to achieve a knock-out efficiency over 90% in a gene encoding the Heat Shock Transcription Factor 1 (HSF1). These mutant animals were subsequently used to demonstrate HSF1’s role in thermal tolerance in coral. Finally, we discuss how these guide design methods can be rolled out to all coral genomes and elaborate on the significance of these for the field more broadly.
http://biomedicaldatascience.com/posters/PSB-2021.pdf
25 John Freeman Department of Computer Science / Tufts University Workshop: Bioinformatics of Corals Poster only  MEDFORD- RNASeq: A human and machine readable markup language to facilitate FAIR coral metadata John Freeman, Jill Ashey, Polina Shpilker, Hailey McKelvie, Hollie M. Putnam, Jane Greenberg,  Lenore J. Cowen, Alva Couch,  Noah Daniels Corals are home to vast biodiversity and play a critical role protecting shores from degradation. Research into RNA expression can be used to examine coral responses to stressors. The corals research community has long been committed to sharing and open data formats, and both individual researchers and large funding agencies have invested heavily in making data available,  but the standard formats in which these metadata are recommended to be structured so that they are FAIR (Findable, Accessible, Interoperable, and Reusable) such as RDF or CoRIS,  are not designed to be easily readable or writable by humans as opposed to machine creators. This is particularly a barrier when the humans generating the metadata are not programmers.



We introduce a preliminary design for MEDFORD, a markup language file format that is designed to be simultaneously human and machine writable and readable. Our first use case is for coral holobiont transcriptomics data  (e.g., RNASeq, one of the most powerful and common types of omics experiments to explore the genetic basis of factors that lead to coral resilience or vulnerability to environmental stressors), where we build the needed complexity to manage spatial-temporal holobiont expression  metadata into MEDFORD from the start.. A coral researcher, untrained in programming and not a database expert, will be able to directly produce and interpret MEDFORD files, and we are currently writing the back-end parser that will be able to automatically translate MEDFORD files into existing file format standards for depositing in databases and repositories.  MEDFORD will enable transcriptomic data to be findable, accessible, and interoperable, where we are currently building the back-end infrastructure to translate between MEDFORD and make it compatible with other existing databases and systems such as RDF, ultimately supporting the reusability “R” in FAIR as well.

https://f1000research.com/posters/9-1407
27 Liza M. Roger Virginia Commonwealth University Workshop: Bioinformatics of Corals Poster only Coral microbioreactors for model validations Liza M. Roger, Hannah G. Reich, Shuaifeng Li, Lokender Kumar, Isaiah Cuadras, Jinkyu Yang, Hollie M. Putnam, Nastassja Lewinski Computational methods and model predictions in biology challenge our current understanding of

biological systems. Understanding intricate exchanges between coral cells and their symbiotic

dinoflagellate algae and the respective needs of different cells require higher resolution.

Microfluidic chips, or microbioreactors enable us to test and validate model predictions related

to protein interactions and gene expression applied to the response of corals to environmental

change, toxicity (e.g. microplastics), disease and healing. The design presented here offers

multiple independent input/output channels to control growth media or treatment parameters

(flow, temperature, composition), light (for photosynthesis) and live-imaging capabilities. Shear

stress and velocity profiles were modelled for both single chamber and network configurations.

Coral cells and polyps will be exposed to varying temperature, light and nanoparticle

concentrations. To realize the full potential of these methods and validate response predictions,

we need to bridge the gap between biocomputing outputs and process knowledge and

observation. The added advantage of these designs is scalability and line assembly to simulate

networks and connections between coral cells, corals cells and symbiotic algae, or polyps for

better fidelity to the coral holobiont. This new microfluidic device offers a robust and adaptable

design to investigate mechanisms of coral bleaching, toxicity and disease.
https://f1000research.com/posters/9-1396
29 Federica Scucchia University of Haifa Workshop: Bioinformatics of Corals Poster only Response of Stylophora pistillata primary polyps to decreasing seawater pH Federica Scucchia, Assaf Malik, Paul Zaslansky, Hollie M. Putnam, Tali Mass With coral reefs declining globally, resilience of these ecosystems hinges on successful coral recruitment. However, knowledge of the acclimatory and/or adaptive potential in response to environmental challenges such as ocean acidification (OA) in earliest life stages is limited. In this study, we cultured Stylophora pistillata larvae and primary polyps under reduced pH conditions and integrated the response of the endosymbiotic algae and the coral host across biological levels, from cellular to organismal. By combining transcriptomic analysis to a suite of physiological and morphological measurements, we identified that while the survival and settlement of coral larvae were reduced at acidified conditions, the surviving recruits adjusted to OA. This was achieved by extensive physiological and transcriptional changes, and by the transition to a less-skeleton/more-tissue phenotype. We found that decreased pH conditions stimulate photosynthesis and endosymbiont growth, and gene expression linked to photosynthates translocation. Our unique holistic study shows that there is acclimatory potential to OA in newly settled corals, which are essential for the maintenance of coral reefs. https://drive.google.com/file/d/1Q-wHeBxNigLb2Wndk5Pg7OY1UI4Le3dX/view?usp=sharing
31 Francois Seneca Centre Scientifique de Monaco Workshop: Bioinformatics of Corals Poster only Gene expression kinetics of Exaiptasia pallida innate immune response to Vibrio parahaemolyticus infection Francois Seneca, David Davtian, Laurent Boyer, Dorota Czerucka Climate change acts on the physical conditions of the ocean. These environmental changes have the greatest of influence on micro-organisms including marine bacteria at the basis of all ecosystems, and therefore the food-chains that feed us. With warmer sea surface temperatures and coastal pollutions, the risks for marine bacteria to become pathogenic for human also increases. In order to better study these relationships, we are developing a marine host/pathogen model system using the sea anemone Exaiptasia pallida (Ep) as host and the Vibrio parahaemolyticus (Vp) clinical strain 03:K6 as marine bacteria pathogenic for humans. Ep’s genome is rich with innate immunity gene families that have evolved over ~600 Myrs and show significant homologies with human counterparts, making the anemone a valuable animal model to study innate immunity. Vp occurs naturally within the bacterial community of cnidarians, but is also the first cause of food-borne gastroenteritis world-wide via the consumption of contaminated seafood. Therefore, the same Vp strain can either be used in seawater or human cell culture conditions, making it a powerful infectious pathogen model. https://f1000research.com/posters/9-1432
33 Samuel Sledzieski Massachusetts Institute of Technology Workshop: Bioinformatics of Corals Poster only D-SCRIPT: Structure-Aware Deep Learning for Protein Interaction Prediction Samuel Sledzieski, Rohit Singh, Lenore Cowen, Bonnie Berger Protein-protein interaction (PPI) networks have proven to be a valuable tool in systems biology to facilitate the discovery and understanding of protein function. Unfortunately, experimental PPI data remains sparse in most model organisms and even more so in other species. Existing methods for computational prediction of PPIs seek to address this limitation, and while they perform well when sufficient within-species training data is available, they generalize poorly to new species or often require specific types and sizes of training data that may not be available in the species of interest. We therefore present D-SCRIPT, a deep learning method for predicting a physical interaction between two proteins given just their sequences. Compared to existing methods, D-SCRIPT generalizes better to new species and is robust to limitations in training data size.

Our approach encodes the intuition that for two proteins to physically interact, a subset of amino acids from each protein should be in contact with the other. The intermediate stages of D-SCRIPT directly implement this intuition; the penultimate stage in D-SCRIPT is a rough estimate of the inter-protein contact map of the protein dimer. This structurally-motivated design enables interpretability of our model and, since structure is more conserved evolutionarily than sequence, improves generalizability across species. We show that a D-SCRIPT model trained on human PPIs enables significantly improved interaction prediction in other model species compared to the state-of-the-art approach. Evaluating the same D-SCRIPT model on protein complexes with known 3-D structure, we find that the inter-protein contact map output by D-SCRIPT has significant overlap with the ground truth. Our work suggests that recent advances in deep learning language modeling of protein structure can be leveraged for protein interaction prediction from sequence.

D-SCRIPT is available at http://dscript.csail.mit.edu.
https://f1000research.com/posters/9-1422
35 Sagnik Banerjee Iowa State University Workshop: Making Tools that People Will Use: User-Centered Design in Computational Biology Research Poster only Enhancing eukaryotic gene structure annotation via changepoint analysis of short-read coverage data Sagnik Banerjee, Priyanka Bhandary, Margaret Woodhouse, Taner Z. Sen, Roger P. Wise, Carson Andorf Gene annotation in eukaryotes is a non-trivial task that requires meticulous analysis of expression data. The presence of transposable elements and sequence repeats in eukaryotic genomes adds to this complexity, as does overlapping genes and genes that produce numerous transcripts. Currently available softwares annotate genomes by relying on full-length cDNA or on a database of splice junctions to predict genes. We present FINDER1, which automates downloading of expression data from NCBI, optimized read alignment, transcript assembly and performing gene prediction. FINDER1 is optimized to conduct read mapping with different settings to capture all biologically relevant alignments with special attention to micro-exons (exon length less than 51 nucleotides). FINDER1 further reports transcripts and recognizes genes that are expressed under specific conditions. FINDER1 integrates prediction results from BRAKER2 with assemblies constructed from expression data to approach the goal of exhaustive genome annotation. On the entire set Arabidopsis thaliana genes, FINDER1 achieves a transcript F1 score of 0.5, exceeding that of BRAKER2 by 0.23. Even in different groups like transcripts with micro-exons, overlapping transcripts etc, FINDER records a superior performance. The pipeline scores genes as high confidence or low confidence based on the available evidence. Finally, FINDER predicts non-coding genes using different approaches and includes them in the final list of annotations. https://github.com/sagnikbanerjee15/PSB-2021/blob/main/FINDER1_36x48_v1.pdf
37 Alejandro Mejia School of microbiology/universidad de antioquia Workshop: Translational Bioinformatics Poster only Frequency and spectrum of mutations in the BRCA1, BRCA2, PALB2, P53, PTEN, CHEK2, CDH1 genes in women with breast cancer from three cities of Colombia Alejandro Mejía García, Luz María Gonzales,  William Hernán Arias Pérez,  Yina Tatiana Zambrano,  Shalom Gómez Pugarin, Johanna Tejada Moreno,  Roberto Jaramillo, Yorlany Rodas,

Edgar Navarro,  Andrés Ossa, Mauricio Borrero,  Gonzalo Alberto Angel,  Alicia M. Cock Rada,  Gabriel Bedoya,  Michael Dean, Sabina Rinaldi, Gloria Inés Sánchez
Germline mutations in the BRCA1 and BRCA2 genes confer a life time risk of 40-80% of developing breast cancer, while mutations in TP53, PTEN CDH1, PALB2, CHEK2 confer moderate to high life time risk for this disease. It is important to detect these mutations in order to give genetic counseling and specific treatment. The aim of this study is to determine the frequency and spectrum of mutations in 7 genes in women with breast cancer unselected living in three cities of Colombia. 135 patients with breast cancer unselected between the ages of 25-77 were recruited in 6 health centers from Medellin, Cali and Barranquilla. DNA was extracted from blood samples by salting out; then exons and 20 nucleotides in the intron-exon boundaries of the BRCA1, BRCA2, PALB2, P53, PTEN, CHEK2 and CDH1 genes were sequenced by next generation sequencing on the ion torrent platform.  Raw signal data were analyzed using Torrent SuiteTM ®, The pipeline included Quality control, read alignment to human genome 19 reference (with TMAP), quality control of mapping quality, coverage analysis, and variant calling using the torrent variant caller 5.0-7 (SNVs and INDELs) and GATK (for SNVs).  The variants were annotated with the Ion reporter software and classified according to the following databases:  Clinvar, Leiden Open Variation Database and Wintervar. We found six pathogenic mutations (frequency of 4.4%) were found in these patients: BRCA1: C.5186C>A, C.178C>T and C.213-12A>G, BRCA2: C.7007+1G>A y C.631+3A>G and TP53: C.586C>T. this is the first study in Colombia that evaluates genes different from BRCA1 and BRCA2 in unselected cases in Colombia, and the frequency of pathogenic mutations was 4.4%. three mutations were found in splicing sites, so it is important to include these sites in the sequencing.   https://f1000research.com/posters/9-1357
39 Raquel Aoki Simon Fraser University Computational Challenges and Artificial Intelligence in Precision Medicine Accepted proceedings paper with poster presentation ParKCa: Causal Inference with Partially Known Causes Raquel Aoki, Martin Ester Methods for causal inference from observational data are an alternative for scenarios where collecting counterfactual data or realizing a randomized experiment is not possible. Adopting a stacking approach, our proposed method ParKCA combines the results of several causal inference methods to learn new causes in applications with some known causes and many potential causes. We validate ParKCA in two Genome-wide association studies, one real-world and one simulated dataset. Our results show that ParKCA can infer more causes than existing methods. https://doi.org/10.7490/f1000research.1118393.1
2 Dov Greenbaum Interdisciplinary Center Herzliya Achieving Trustworthy Biomedical Data Accepted proceedings paper with poster presentation Making Compassionate Use More Useful: Using real-world data, real-world evidence and digital twins to supplement or supplant randomized controlled trials. Dov Greenbum The coronavirus pandemic has placed renewed focus on expanded access (EA) programs to provide compassionate use exceptions to the waves of patients seeking medical care in treating the novel disease. While commendable, justifiable, and compassionate, EA programs are not designed to collect the necessary vital clinical data that can be later used in the New Drug Application process before the U.S. Food and Drug Administration (FDA). In particular, they lack the necessary rigor of properly crafted and controlled randomized controlled trials (RCT) which ensure that each patient closely monitored for side effects and other potential dangers associated with the drug, that the data is documented, stable and are traceable and that the patient population is well defined with the defined target condition. Overall, while RCTs is deemed to be of the most reliable methodologies within evidence-based medicine, morally, however, they are problematic in EA programs. Nevertheless, actionable data ought to be collected from EA patients. To this end, we look to the growing incorporation of real-world data real-world evidence as increasingly useful substitutes for data collected via RCTs, including the ethical, legal and social implications thereof. Finally, we suggest the use of digital twins as an additional method to derive causal inferences from real-world trials involving expanded access patients.  https://bit.ly/PSB-RWD-2021
4 Silvia Canelón University of Pennsylvania Advanced Methods for Big Data Analytics in Women's Health Accepted proceedings paper with poster presentation Not All C-sections Are the Same: Investigating Emergency vs. Elective C-section Deliveries as an Adverse Pregnancy Outcome Silvia P. Canelón, Mary Regina Boland Electronic Health Records (EHR) contain detailed information about a patient’s medical history and can be helpful in understanding clinical outcomes among populations generally underrepresented in research, including pregnant individuals. A cesarean delivery is a clinical outcome often considered in studies as an adverse pregnancy outcome, when in reality there are circumstances in which a cesarean delivery is considered the safest or best choice given the patient’s medical history, situation, and comfort. Rather than consider all cesarean deliveries to be negative outcomes, it is important to examine other risk factors that may contribute to a cesarean delivery being an adverse event. Looking at emergency admissions can be a useful way to ascertain whether or not a cesarean delivery is part of an adverse event. This study utilizes EHR data from Penn Medicine to assess patient characteristics and pregnancy-related conditions as risk factors for an emergency admission at the time of delivery. After adjusting for pregnancy number and cesarean number for each patient, preterm birth increased risk of an emergency admission, and patients younger than 25, or identifying as Black/African American, Asian, or Other/Mixed, had an increased risk. Later pregnancies and repeat cesareans decreased the risk of an emergency delivery, and White, Hispanic, and Native Hawaiian/Pacific Islander patients were at decreased risk. The same risk factors and trends were found among cesarean deliveries, except that Asian patients did not have an increased risk, and Native Hawaiian/Pacific Islander patients did not have a reduced risk in this group. https://silvia.rbind.io/publication/csections-emergency-admissions/PSB2021_poster.pdf
6 Deendayal Dinakarpandian Stanford University Biocomputing and AI for infectious disease modelling and therapeutics Accepted proceedings paper with poster presentation Semantic Changepoint Detection for Finding Potentially Novel Research Publications Bhavish Dinakar, Mayla R. Boguslav, Carsten Goerg, Deendayal Dinakarpandian How has the focus of research papers on a given disease changed over time? Identifying the papers at the cusps of change can help highlight the emergence of a new topic or a change in the direction of research. We present a generally applicable unsupervised approach to this question based on semantic changepoints within a given collection of research papers. We illustrate the approach by a range of examples based on a nascent corpus of literature on COVID-19 as well as subsets of papers from PubMed on the World Health Organization list of neglected tropical diseases. The software is freely available at: https://github.com/pdddinakar/SemanticChangepointDetection.

https://github.com/pdddinakar/SemanticChangepointDetection/blob/master/Final_PSB2021_Poster.pdf
8 Gerard Goh Goh's BioComputing Biocomputing and AI for infectious disease modelling and therapeutics Accepted proceedings paper with poster presentation Feasibility of the vaccine development for SARS-CoV-2 and other viruses using the shell disorder analysis Gerard Kian-Meng Goh, A. Keith Dunker, James A. Foster, Vladimir N. Uversky Several related viral shell disorder (disorder of shell proteins of viruses) models were built using a disorder predictor via AI. The parent model detected the presence of high levels of disorder at the outer shell in viruses, for which vaccines are not available. Another model found correlations between inner shell disorder and viral virulence. A third model was able to positively correlate the levels of respiratory transmission of coronaviruses (CoVs). These models are linked together by the fact that they have uncovered two novel immune evading strategies employed by the various viruses. The first involve the use of highly disordered “shape-shifting” outer shell to prevent antibodies from binding tightly to the virus   thus leading to vaccine failure. The second usually involves a more disordered inner shell that provides for more efficient binding in the rapid replication of viral particles before any host immune response. This “Trojan horse” immune evasion often backfires on the virus, when the viral load becomes too great at a vital organ, which leads to death of the host. Just as such virulence entails the viral load to exceed at a vital organ, a minimal viral load in the saliva/mucus is necessary for respiratory transmission to be feasible. As for the SARS-CoV-2, no high levels of disorder can be detected at the outer shell membrane (M) protein, but some evidence of correlation between virulence and inner shell (nucleocapsid, N) disorder has been observed. This suggests that not only the development of vaccine for SARS-CoV-2, unlike HIV, HSV and HCV, is feasible but its attenuated vaccine strain can either be found in nature or generated by genetically modifying N.

https://gohsbiocomputing.wixsite.com/website
10 Anish Mudide Phillips Exeter Academy Biocomputing and AI for infectious disease modelling and therapeutics Accepted proceedings paper with poster presentation SARS-CoV-2 Drug Discovery based on Intrinsically Disordered Regions Anish Mudide, Gil Alterovitz Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), a close relative of SARS-CoV-1, causes coronavirus disease 2019 (COVID-19), which, at the time of writing, has spread to over 65.8 million people worldwide. We aim to discover drugs capable of inhibiting SARS-CoV-2 through interaction modeling and statistical methods. Currently, many drug discovery approaches follow the typical protein structure-function paradigm, designing drugs to bind to fixed three-dimensional structures. However, in recent years such approaches have failed to address drug resistance and limit the set of possible drug targets and candidates. For these reasons we instead focus on targeting protein regions that lack a stable structure, known as intrinsically disordered regions (IDRs). Such regions are essential to numerous biological pathways that contribute to the virulence of various viruses. In this work, we discover eleven new SARS-CoV-2 drug candidates targeting IDRs and provide further evidence for the involvement of IDRs in viral processes such as enzymatic peptide cleavage while demonstrating the efficacy of our unique docking approach. https://f1000research.com/posters/9-1402
12 Mason Chen Stanford OHS Biocomputing and AI for infectious disease modelling and therapeutics Poster only Using Phase-Specific Nonparametric Density Regression Technique to Model and Compare the COVID-19 Infection Rate between South Korea and Italy Patrick, Charles This project uses nonparametric density and regression to construct a technique that can accurately and consistently model different countries’ cumulative growth curve through phase divisions. Previous outbreaks (SARS, MERS, and the 1918 flu pandemic) and existing models (SIR and logistic/exponential) were initially consulted to help model the growth, but the unique replication and circumstances of COVID-19 are unlike any other. Additionally, different countries have different approaches to the pandemic, and using one prediction line for the whole curve will not model the growth patterns accurately. This paper utilizes the first and second nonparametric densities to divide up the graph into separate phases and then model each phase using regression. Although each phase already provides a general picture of the different stages of the COVID-19 pandemic, South Korea’s and Italy’s graphs were further studied and compared to uncover other underlying patterns. The importance of factors such as strictness and timing of government regulations, an abundance of healthcare resources, the presence of a local outbreak, testing availability, and a working contact tracing system are all reflected in the slopes and durations of each country’s models. This tool can be further applied across other nations that have reached farther phases in the outbreak to predict the duration and slopes for countries that are still trying to control the outbreak. https://f1000research.com/slides/9-1363
16 Michael Mayhew Inflammatix, Inc. Computational Challenges and Artificial Intelligence in Precision Medicine Accepted proceedings paper with poster presentation Optimization of Genomic Classifiers for Clinical Deployment: Evaluation of Bayesian Optimization to Select Predictive Models of Acute Infection and In-Hospital Mortality Michael B. Mayhew, Elizabeth Tran, Kirindi Choi, Uros Midic, Roland Luethy, Nandita Damaraju, Ljubomir Buturovic Acute infection, if not rapidly and accurately detected, can lead to sepsis, organ failure and even death. Current detection of acute infection as well as assessment of a patient's severity of illness are imperfect. Characterization of a patient's immune response by quantifying expression levels of specific genes from blood represents a potentially more timely and precise means of accomplishing both tasks. Machine learning methods provide a platform to leverage this 'host response' for development of deployment-ready classification models. Prioritization of promising classifiers is dependent, in part, on hyperparameter optimization for which a number of approaches including grid search, random sampling and Bayesian optimization have been shown to be effective. We compare HO approaches for the development of diagnostic classifiers of acute infection and in-hospital mortality from gene expression of 29 diagnostic markers. We take a deployment-centered approach to our comprehensive analysis, accounting for heterogeneity in our multi-study patient cohort with our choices of dataset partitioning and hyperparameter optimization objective as well as assessing selected classifiers in external (as well as internal) validation. We find that classifiers selected by Bayesian optimization for in-hospital mortality can outperform those selected by grid search or random sampling. However, in contrast to previous research: 1) Bayesian optimization is not more efficient in selecting classifiers in all instances compared to grid search or random sampling-based methods and 2) we note marginal gains in classifier performance in only specific circumstances when using a common variant of Bayesian optimization (i.e. automatic relevance determination). Our analysis highlights the need for further practical, deployment-centered benchmarking of HO approaches in the healthcare context. https://f1000research.com/posters/9-1406
18 Constantina Bakolitsa University of California Berkeley Computational Challenges and Artificial Intelligence in Precision Medicine Poster only Findings from the Critical Assessment of Genome Interpretation, a community experiment to evaluate phenotype prediction  Constantina Bakolitsa, Gaia Andreoletti, Roger A Hoskins, Predrag Radivojac, John Moult, Steven E Brenner, CAGI Participants  Interpretation of genomic variation plays an essential role in the analysis of cancer and  monogenic disease, and increasingly in complex trait disease, with applications ranging from basic research to clinical decisions. Yet the field lacks a clear consensus on the appropriate level of confidence to place in variant “impact” and interpretation methods. The Critical Assessment of Genome Interpretation (CAGI, \'kā-jē\) is a community experiment to objectively assess computational methods for predicting the phenotypic impacts of genomic variation. CAGI participants are provided genetic variants and make blind predictions of resulting phenotype. Independent assessors evaluate the predictions by comparing with experimental and clinical data.



CAGI has completed five editions with the goals of establishing the state of art in genome interpretation and of encouraging new methodological developments. Challenges have been predominantly based on human data, mirroring problems in clinical practice and basic research. CAGI has focused on interpreting nonsynonymous variants, splicing variants, structural variation, whole-exomes and whole-genomes, with phenotypes ranging from molecular and cellular measurements to organismal phenotypes in inherited disease and cancer. Results from previous CAGI experiments have been described in two special issues of Human Mutation.



Each edition of CAGI has revealed new aspects of the methods, with significant progress emerging in several areas. Independent assessment has found that top missense prediction methods are highly statistically significant, but individual variant accuracy is limited. Missense methods tend to correlate better with each other than with experiment. Bespoke approaches often enhance performance. Interpretation of non-coding variants shows promise but is not at the level of missense. In examples using clinical data, predictors identified causal variants overlooked in the initial clinical analysis. CAGI results suggest that running multiple uncalibrated methods and considering their consensus may result in undue confidence, so we advise against this.



Detailed information about CAGI may be found at https://genomeinterpretation.org.



https://f1000research.com/posters/9-1429
20 Yash Pershad Stanford University, Department of Bioengineering Computational Challenges and Artificial Intelligence in Precision Medicine Poster only Genetic information and molecular pathways recapitulate psychiatric drug prescriptions and diagnoses Yash Pershad, Margaret Guo, Russ Altman One in five Americans experience mental illness, and a majority of psychiatric prescriptions do not improve symptoms or cause major side effects. Recent evidence implicates variants disrupting transcription in the pathophysiology of psychiatric diseases. Accordingly, gene expression data may illuminate disease-specific and patient-specific changes in transcription. Here, we classify psychiatric disease from patient gene expression, rank psychiatric drugs by disease from drug gene targets and disease gene sets, and evaluate rankings with clinical prescribing patterns from the UK Biobank (UKBB). From Gene Expression Omnibus, we extract expression data from 145 cases of schizophrenia, 82 cases of bipolar disorder, 190 cases of major depressive disorder, and 307 shared controls. We generate probabilistic pathway scores for each patient. Using Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway scores as features to predict psychiatric disease diagnosis with a random forest model improves accuracy compared to batch-corrected gene expression from 56% to 78%. After curating gene targets for 167 psychiatric drugs and using genes from distinguishing KEGG pathways as disease gene sets, we assessed “overlap” between disease and disease gene sets to rank drugs by disease. Using protein-protein-interaction networks and node2vec, we build a pipeline to rank treatments for psychiatric diseases that achieves a 3.4-fold improvement over a background model. We then validated these drug rankings with clinical prescribing trends in the UKBB. By counting co-occurrences of psychiatric diagnoses and psychiatric drug prescriptions, we were able to compare UKBB frequency ranks to predicted rankings. Our rankings correlated with the frequency rankings by Spearman correlation (ρ = 0.67). Thus, we demonstrate that gene-expression-derived pathway features can diagnose psychiatric diseases and that molecular insights derived from this classification task can inform treatment prioritization for psychiatric diseases. Moreover, gene-based rankings recapture clinical population-level prescribing patterns in the UKBB. https://f1000research.com/posters/9-1404
22 Jan Sokol Stanford University Computational Challenges and Artificial Intelligence in Precision Medicine Poster only Conditional random fields improve local ancestry model predictions Jan Sokol, Matthew Aguirre, Guhan Venkataraman, Alexander Ioannidis Local ancestry inference (LAI) identifies the ancestry of each segment of an individual genome and has several uses in medical and population genetics. Many tools have been developed for LAI, including ones using random forests and hidden Markov models. Here, we work to improve an existing model based on the SegNet neural network architecture, which was originally designed for image processing. This tool has the advantage that a trained model can be applied to different input datasets  (e.g. multiple cohorts with varied array designs) without the need for re-training, because of SegNet’s positional invariance. However this model often fails to predict genomic ancestries in regions in which haplotypes are not associated with ancestry. To correct this, we add a conditional random field (CRF) to post-process the SegNet’s predictions using sequential information, which the SegNet cannot. The CRF is trained on the SegNet’s predictions for admixed samples simulated from 372 real chromosome 20 haplotypes. Further, to reduce computational complexity, we collapsed windows of varying sizes into a single position by replacing variant-level probabilities with  the mean predicted probability for each ancestry over all positions in each window. We found that the CRF boosts the classification accuracy of genetic variants on haplotypes made by 10 generations of simulated admixture from 85.6% to 95.1%, and that pooling the input dataset using windows of up to 500 positions did not decrease accuracy. Training the CRF took under two minutes. Using these data, we validate that our tool for predicting genomic ancestries can be applied to different input datasets without re-training. https://f1000research.com/posters/9-1417
24 Arda Durmaz Case Western Reserve University Pattern Recognition in Biomedical Data for Discovery Accepted proceedings paper with poster presentation Frequent Subgraph Mining of Functional Interaction Patterns Across Multiple Cancers Arda Durmaz, Tim A. D. Henderson, Gurkan Bebek Molecular mechanisms characterizing cancer development and progression are complex andprocess through thousands of interacting elements in the cell. Understanding the underlying structureof interactions requires the integration of cellular networks with extensive combinations of dysregulation patterns. Recent pan-cancer studies focused on identifying common dysregulation patterns in aconfined set of pathways or targeting a manually curated set of genes. However, the complex nature ofthe disease presents a challenge for finding pathways that would constitute a basis for tumor progressionand requires evaluation of subnetworks with functional interactions. Uncovering these relationships iscritical for translational medicine and the identification of future therapeutics. We present a frequentsubgraph mining algorithm to find functional dysregulation patterns across the cancer spectrum. Wemined frequent subgraphs coupled with biased random walks utilizing genomic alterations, gene expression  profiles,  and  protein-protein  interaction  networks.  In  this  unsupervised  approach,  we  haverecovered expert-curated pathways previously reported for explaining the underlying biology of cancerprogression in multiple cancer types. Furthermore, we have clustered the genes identified in the frequentsubgraphs into highly connected networks using a greedy approach and evaluated biological significancethrough pathway enrichment analysis. Gene clusters further elaborated on the inherent heterogeneityof cancer samples by both suggesting specific mechanisms for cancer type and common dysregulationpatterns across different cancer types. Survival analysis of sample level clusters also revealed significantdifferences among cancer types (p <0.001). These results could extend the current understanding of disease etiology by identifying biologically relevant interactions. https://f1000research.com/posters/9-1435
26 Li-Wei Chang University of Pittsburgh Medical Center Pattern Recognition in Biomedical Data for Discovery Poster only RNAhybrid-M: An enhanced algorithm for predicting microRNA-RNA binding sites Li-Wei Chang, Ching-Jung Lu, Julide T. Celebi Mounting evidence supports a role for dysregulation of long non-coding RNAs (lncRNA) in the development of human diseases. A recently discovered function of lncRNAs is to act as “microRNA decoys”. These microRNA (miR) decoys sequester miR and upregulate genes that would otherwise be suppressed by miR. As new miRs and lncRNAs are discovered, a fast and efficient algorithm for miR and RNA interaction site prediction is an important tool to study non-coding gene regulatory networks. RNAhybrid is a previously developed dynamic programming algorithm for miR-RNA binding site prediction. RNAhybrid uses free energy parameters to characterize low energy hybrid structures between two RNA molecules. The limitation of RNAhybrid is that it only identifies one best miR binding site that has the lowest minimum free energy (MFE). To find the second-best binding site, one needs to mask previously identified site and run RNAhybrid iteratively. Such masking is inefficient and prevents identification of overlapping binding sites. Here we modified the original RNAhybrid algorithm to overcome these limitations. We introduced a new dynamic programming trace back matrix to record the hybridization structure with the lowest MFE ending at each base position in the query RNA. A single survey of this new matrix allowed for identification of all miR binding sites with MFE lower than a given cutoff. The search time of the modified algorithm, termed RNAhybrid-M (“multiple-hit RNAhybrid”), is improved from the scale of n square to the scale of n, because it only searches the query RNA once. RNAhybrid-M also returns more complete results than the original algorithm because it reports all potential miR binding sites, including those overlap each other. In conclusion, RNAhybrid-M is an improved version of RNAhybrid that allows for fast, efficient, and complete search of miR hybridization sites in RNA sequences.  https://www.dropbox.com/s/0gl1ngxgr1t6x0q/PSB2021_poster.pdf?dl=0
28 Enrico Glaab University of Luxembourg Pattern Recognition in Biomedical Data for Discovery Poster only Network dysregulation analysis in complex diseases using mathematical programming Nikos Vlassis, Enrico Glaab Complex diseases such as neurodegenerative or cancer disorders are characterized by deregulations in multiple genes and proteins. Previous research has shown that neighboring genes in a molecular network tend to undergo coordinated expression changes. We describe an approach that allows identifying such jointly deregulated genes from input expression data and a graph encoding pairwise functional associations between genes or proteins (such as gene regulatory interactions or protein-protein interactions). We cast this as a feature selection problem in penalized two-class (cases vs. controls) classification, and we propose a novel Pairwise Elastic Net penalty that favors the selection of discriminative genes according to their connectedness in the interaction graph. Experiments on microarray gene expression data for Parkinson’s disease demonstrate marked improvements in feature grouping over competitive methods. http://enricogl.bplaced.net/poster_genepen_2020_psb.pdf
30 Mahmoud Ahmed Gyeongsang National University General Poster only Integrating binding and expression data to predict transcription factors combined function Mahmoud Ahmed, Deok Ryong Kim Transcription factor binding to the regulatory region of a gene induces or represses its gene expression. Transcription factors share their binding sites with other factors, co-factors, and/or DNA-binding proteins. These proteins form complexes which bind to the DNA as one-units. The binding of two factors to a shared site does not always lead to a functional interaction. We propose a method to predict the combined functions of two factors using comparable binding and expression data (target). We based this method on binding and expression target analysis (BETA), which we re-implemented in R and extended for this purpose. target ranks the factor's targets by importance and predicts the dominant type of interaction between two transcription factors. We applied the method to simulated and real datasets of transcription factor-binding sites and gene expression under perturbation of factors. We found that Yin Yang 1 transcription factor (YY1) and YY2 have antagonistic and independent regulatory targets in HeLa cells, but they may cooperate on a few shared targets. We developed an R package and a web application to integrate binding (ChIP-seq) and expression (microarrays or RNA-seq) data to determine the cooperative or competitive combined function of two transcription factors. https://doi.org/10.6084/m9.figshare.13323413.v1
32 Timothy Bergquist Sage Bionetworks General Poster only A framework for studying machine learning methods in healthcare: The First EHR DREAM Challenge Timothy Bergquist, Thomas Schaffter, Yao Yan, Thomas Yu, Vikas Pejaver, Noah Hammarlund, Justin Prosser, Sean Mooney, Justin Guinney Implementation of machine learning-based methods in healthcare is of high interest and has the potential to have a major impact on patient care.  To that end, real world accuracy and outcomes from the application of these methods remain largely unknown, and performance on different subpopulations of patients also remains unclear.  In order to address these important questions, we hosted a community challenge to evaluate different methods that predict healthcare outcomes. We focused on the prediction of all-cause mortality as it is quantitative and clinically unambiguous. In order to overcome patient privacy concerns, we employed a Model-to-Data approach, allowing citizen scientists and researchers to train and evaluate machine learning models on private health data without access to that data. In total, we had 345 registered participants, coalescing into 25 independent teams, spread over 3 continents and 10 countries. The top performing team achieved an area under the receiver operator curve of 0.947 (95% CI 0.942, 0.951) and an area under the precision-recall curve of 0.487 (95% CI 0.458, 0.499) on all patients over a one year observation of a large health system. In a follow up phase of the challenge, we evaluated the generalizability of models across different patient populations, revealing that models differ in accuracy on subpopulations, such as race, even when they are trained on the same data and have similar accuracy on the population. This is the broadest community challenge focused on the evaluation of state-of-the-art machine learning methods in healthcare performed to date and shows the importance of prospective evaluation and collaborative development of individual models. https://f1000research.com/posters/9-1426
34 Pei-Ju Chin U.S. Food and Drug Administration General Poster only A non-redundant, reference virus database (RVDB) to facilitate high-throughput sequencing (HTS) bioinformatics analysis for detection of known and novel viruses Pei-Ju Chin, Arifa S. Khan Introduction: There are challenges for HTS bioinformatics detection of distantly-related viruses due to limitations of the publicly-available databases, which are not complete with all viral sequences and obscured by the large amount of cellular content. Therefore, we undertook efforts to create a non-redundant, reference viral database (RVDB) that would include all viral-related sequences, with a reduced cellular sequence content. This was done in consultation with the Advanced Virus Detection Technologies Interest Group, which includes scientists from industry, regulatory and other government agencies (including NCBI), technology providers, and academia.

Methods: The details for developing RVDB have been published [Goodacre et al., mSphere, 2018]. The overall strategy included: a) development of a keyword screen for semantic selection of all viral-related sequences from GenBank, regardless of size; b) verification of viral identity using various bioinformatics tools (BLAST and HMMER); and c) clustering at 98% sequence identity using CD-HIT-EST to remove redundancy.

Results: RVDB contains complete viral genomes as well as partial viral sequences. The unclustered (U) and clustered (C) versions of RVDB are publicly-available at the University of Delaware RVDB Site (https://rvdb.dbi.udel.edu/). The nucleotidic RVDBs were converted to proteic databases by Marc Eloit and Thomas Bigot (available at http://rvdb-prot.pasteur.fr/ and at the University of Delaware RVDB Site) The updated RVDB v20.0 contains 3,180,577 sequences in U-RVDB and 789,728 sequences in C-RVDB. The python scripts to construct RVDB can be found at GitHub (https://github.com/ArifaKhanLab/RVDB)

Conclusions: RVDB is expected to enhance HTS bioinformatics for detection of known and distantly-related viruses due to the presence of all viral-related sequences including sub-genomic viral fragments as well as endogenous retroviruses and retrotransposons.

https://doi.org/10.7490/f1000research.1118368.1
36 William Hairfield University of Washington General Poster only Perspectives and Concerns Regarding "Ground Truth" and "Gold Standard" as Applied to Artificial Intelligence and Machine Learning in the Health Sciences. William Hairfield This paper reviews and explores recurring perspectives and concerns regarding "ground truth" (GT) and "gold standard" (GS) in applications of "artificial intelligence" (AI) and "machine learning" (ML) in medicine, biology, and the health sciences. The first instance of "randomized clinical trial" (RCT) being referred to as a "gold standard" appeared in a paper in The New England Journal of Medicine (NEJM) December 1982 by Alvan Feinstein and Ralph Horwitz. As described by Jones and Podolsky "… not as a gold standard that all research must strive to attain, but as an elusive ideal in many circumstances. Their article was actually a brief in support of the rigorous conduct of other clinical epidemiological research designs…" [Jones, D. 2015] Secondary to encouragement and funding from NIH, the RCT became identified as the GS for any kind of diagnostic or therapeutic trial in medicine. Originally, the term "ground truth" was used in fields such as map making and meteorology to refer to information provided by direct observation collected "on the ground". The most critical pejorative pessimistic end of the assessment spectrum regarding the use of GS and GT throughout AI and ML in medicine follows along this position …“This scheme, like all other schemes for … artificial intelligence applications in medicine … relies on speculative technology, does not in its current form take into account all possible sources of noise … including loss of "meaning" when transcribing artifacts from actual patient encounters to machine readable formats, unreliability, uncertainty, human behaviors, and human-to-machine "knowledge" interpretation mismatching and probably will not work as expected.” [see text details; Landauer, R. 1990]; however, future elucidation, discourse, and research concerning GS and GT will advance us toward the enlightening optimistic end of the spectrum. https://drive.google.com/file/d/1wPiBJ37Aj28xSJwxYUPQGsYWJSzQWgqT/view?usp=sharing
38 Zhiqiang Hu University of California, Berkeley General Poster only Biological discovery and consumer genetics activate latent privacy risk in omics data Zhiqiang Hu, Steven E. Brenner Recent research studies and forensics have underscored the ability to re-identify a person using genomic-identified relatives and quasi-identifiers, such as sex, birthdate and zip code. The privacy risk from individuals’ genomes has therefore become a practical concern. However, the risks arising from other omics data are not widely appreciated.

Though studies have shown summary omics data, such as gene expression values, can be matched to the corresponding genomic data in small populations, they are still generally treated as safe to share (e.g., in the GTEx project, gene expression values are unrestrictedly public, while sequence data have controlled access), because most research genomic data are under controlled access and the chance of having both genomic data and omics data of a person publicly available is low. The linkage of genomic data and omics data is achieved by inferring genotypes from gene expression values using public expression quantitative trait loci, and then matching inferred genotypes to a library of genomes.

In this study, we systematically assessed the potential of linking omics data to genomic data. We found that gene expression or DNase hypersensitive site data, on average, can be matched to unique genomes from the world population. The theoretical concern for re-identification becomes practically relevant, since consumer genealogy databases now serve as a genome library to match against. Our results indicated that by uploading predicted genotypes to consumer genetics databases, a person could be re-identified purely from their omics summary data. The ability to link sets of quasi-identifiers can reveal a research participant’s identity and protected health information.

Critically, risks from omics data increase over time, activated by new techniques, new knowledge, and new databases. The need to preserve individuals’ genomic privacy for their lifetime and beyond (for descendants and relatives) poses unique challenges to the effective sharing of high-throughput molecular data.

https://compbio.berkeley.edu/people/huz/gp/HuZ_PSB_2020_Genome_Privacy_Poster.pdf