Command Line to PipeLine: Cross-Biobank Analyses with Nextflow

About the Workshop

The rapid expansion of biobank data availability marks a significant milestone in human genetics research, offering unparalleled opportunities to study the genetic predisposition of complex diseases. Although there are platforms and tools for effectively utilizing these datasets for complex, multimodal analyses, there remains an unmet need to develop educational workshops. Fragmented data and incompatible tools create an intractable technical maze impeding research and slowing collaboration. In this workshop, we will equip participants with the necessary skills to fully exploit biobank resources, effectively bridging the gap between the abundance of available data and the capacity for research innovation. To bridge this gap effectively, we will introduce the participants to powerful workflow languages like Nextflow to overcome the technical hurdles for cross-biobank analyses. Nextflow offers a platform-agnostic framework, allowing you to seamlessly analyze data across diverse computing environments — local servers, high-performance computing, and cloud computing — using scripts and code you are already familiar with.

Why adopt workflow languages?

  • Abstraction: Nextflow hides the complexities of different platforms, letting you focus on the science and interpretation. Write code in your preferred language and Nextflow handles the rest.
  • Scalability and Reproducibility: Easily deploy your workflows without modification, ensuring consistent results across computing environments.
  • Containerization: Leverage Docker and Singularity to ensure your analyses run smoothly and avoid conflicts across different systems.

The Location is the Icing on the Cake: Beautiful Hawaii!

Mauna Kea Observatory South Kona Beach Kīlauea's summit caldera, Volcano National Park

Moonless Starry Sky Over Mauna Kea Observatory

Sunny South Kona Beach

Kīlauea's summit caldera, Volcano National Park

Learning Objectives

Learning Objectives: By the end of this workshop, participants will be equipped with the knowledge and skills to develop and deploy scalable and reproducible genomic workflows, navigate the complexities of cloud-based platforms, and conduct meaningful cross-biobank analyses to advance their research projects. The learning objectives for this workshop include:

  • Learn the foundational principles of genomic workflows using existing tools and resources.
  • Explore the limitations and challenges inherent in working across different computing infrastructures.
  • Understand the advantages of workflow languages such as Nextflow for bioinformatics.
  • Develop skills to run platform-agnostic workflows and how to integrate diverse data types seamlessly across biobanks.

Workshop Outline

Workshop Format: This interactive workshop combines presentations, demonstrations, hands-on tutorials, and discussions with our bioinformatics experts.

  • Informative Presentations: Gain a solid foundation in biobank data analysis and Nextflow functionalities.
  • Demonstrations: See real-world examples of configuring and running large-scale pipelines.
  • Hands-on Tutorials: Develop your skills by building your own Nextflow workflows under expert guidance.
  • Interactive Exercises: Practice your newfound skills individually and collaboratively, with on-demand support from our team.
  • Discussions: Encourage continuing discussion for best practices for cross-biobank analyses with an emphasis on reproducibility, scalability, and security/privacy.
Introduction to Cloud-based Platforms and Pipeline Managers

  • Discussion on the expansion of biobanks and the necessity for cloud-agnostic workflows for advancing genomic studies.
  • The role of cloud-based platforms and institutional biobanks in advancing genomic studies.
  • Introduction to pipeline managers like Nextflow, focusing on their role in enabling cross-platform computing, portability, and reproducibility.
Module 1: Genomic Pipelines for Biobanks: Development and Deployment

  • A detailed guide on how to deploy analysis pipelines across different computing infrastructures, including high-performance computing and cloud-based platforms (DNAnexus, for UKB analysis, and the All of Us Researcher Workbench).
  • Demonstration on how to utilize and understand genome-wide association study (GWAS) and polygenic score (PGS) pipelines built by our team
Module 2: Developing Your Own Workflows

  • Introduction to cloud-agnostic workflow languages with a focus on demystifying Nextflow pipeline management concepts.
  • Hands-on Tutorial: A gentle introduction for intermediate command line users to start their own workflow development.
  • Individual Exercise: Turning your local pipeline into a deployable Nextflow workflow.
Module 3: Overcoming Limitations of Working Across Biobanks & Cloud Platforms

  • Present resources for overcoming common hurdles, including a compilation of materials on our GitHub repository and strategies for interdisciplinary collaboration.
  • Group Exercise: Deploying a workflow across cloud environments; Coding collaboratively with Google Cloud Shell.
  • Discussion: A discussion with our team regarding challenges and best practices for unifying and scaling your pipelines.

Workshop Organizers

Anurag Verma, PhD, University of Pennsylvania. Anurag is an Assistant Professor in the Department of Medicine at the University of Pennsylvania, and he also serves as Associate Director of Clinical Informatics and Genomics for Penn Medicine BioBank. His research has focused on the study of the genetic basis of complex diseases using big data techniques with the main focus on studying the genetic architecture of multimorbidity, the phenotypic architecture of common genetic risk, polygenic risk scores, and phenome-wide association studies to identify the complex phenotypic and genomic interactions that lead to complex disease. In his capacity at PMBB, Anurag leads a team called CodeWorks that develops scalable workflows and harnesses both in-house and cloud computing resources for advancements in genetic research. His team's efforts are in expanding the boundaries of how data informatics can be applied to keep pace with the rapidly changing landscape of large-scale biobanks.

Lindsay Guare, University of Pennsylvania. Lindsay is a second-year PhD student in the Genomics and Computational Biology Program at UPenn with a focus in Biomedical Informatics. She has been involved in many large-scale genetic association study collaborations, but her research will be focused on leveraging innovative computational data science approaches to explore clinical and genetic heterogeneity in endometriosis. Her interdisciplinary background includes computer science, contributing to her leadership in CodeWorks.

Katie Cardone, University of Pennsylvania. Katie is a Research Specialist in the Department of Genetics at the University of Pennsylvania, and is a Graduate Student in the University of Pennsylvania’s Master of Biomedical Informatics Program. In her role, Katie executes a wide range of bioinformatic analyses, including genome-wide association studies, phenome-wide association studies, exome-wide rare variant association studies, and polygenic scores on large biobanks, including the Penn Medicine BioBank, the eMERGE network, and the All of Us research program. She also develops Nextflow pipelines for polygenic score tools.

Christopher Carson, MS, University of Pennsylvania. Chris is a Bioinformatician at the University of Pennsylvania Institute for Biomedical Informatics. His role in the Verma lab covers an extensive range of workflow pipeline development, conducting genetic analysis requests for the Penn Medicine Biobank (PMBB), and producing bioinformatics software for analyzing large-scale genomic and phenomic datasets. He has experience conducting genome-wide, phenome-wide, and exome-wide association studies using the large-scale datasets retained in the PMBB with the use of SAIGE.

Zachary Rodriguez, PhD, University of Pennsylvania. Zach is a Bioinformatician at the University of Pennsylvania’s Perelman School of Medicine. His research has focused on the study of the genetic basis of complex diseases using big data techniques with the main focus on studying the genetic architecture of multimorbidity, the phenotypic architecture of common genetic risk, polygenic risk scores, and phenome-wide association studies to identify the complex phenotypic and genomic interactions that lead to complex disease. He has informatics expertise in machine learning, natural language processing, and pipeline development, with extensive experience in analyzing large-scale genomic data, electronic health records (EHR), and biobank datasets, including Penn Medicine BioBank.