PSB - Abstract

Democratizing data science through data science training

John Darrell Van Horn¹, Lily Fierro², Jeana Kamdar¹, Jonathan Gordon², Crystal Stewart¹, Avnish Bhattrai¹, Sumiko Abe¹, Xiaoxiao Lei¹, Caroline O'Driscoll¹, Aakanchha Sinha², Priyambada Jain², Gully Burns², Kristina Lerman², José Luis Ambite²

¹USC Mark and Mary Stevens Neuroimaging and Informatics Institute, Keck School of Medicine of USC, University of Southern California
²Information Sciences Institute, University of Southern California
Email: jvanhorn@usc.edu, jeana.kamdar@ini.usc.edu, crystal.stewart@ini.usc.edu, avnish.bhattrai@ini.usc.edu, Sumiko.abe@ini.usc.edu, xiaolei@ini.usc.edu, caroline.odriscoll@ini.usc.edu, ambite@isi.edu, lfierro@isi.edu, jgordon@isi.edu, burns@isi.edu, lerman@isi.edu, priyambj@isi.edu

Pacific Symposium on Biocomputing 23:292-303(2018)

© 2018 World Scientific
Open Access chapter published by World Scientific Publishing Company and distributed under the terms of the Creative Commons Attribution (CC BY) 4.0 License.

Abstract

The biomedical sciences have experienced an explosion of data which promises to overwhelm many current practitioners. Without easy access to data science training resources, biomedical researchers may find themselves unable to wrangle their own datasets. In 2014, to address the challenges posed such a data onslaught, the National Institutes of Health (NIH) launched the Big Data to Knowledge (BD2K) initiative. To this end, the BD2K Training Coordinating Center (TCC; bigdatau.org) was funded to facilitate both in-person and online learning, and open up the concepts of data science to the widest possible audience. Here, we describe the activities of the BD2K TCC and its focus on the construction of the Educational Resource Discovery Index (ERuDIte), which identifies, collects, describes, and organizes online data science materials from BD2K awardees, open online courses, and videos from scientific lectures and tutorials. ERuDIte now indexes over 9,500 resources. Given the richness of online training materials and the constant evolution of biomedical data science, computational methods applying information retrieval, natural language processing, and machine learning techniques are required - in effect, using data science to inform training in data science. In so doing, the TCC seeks to democratize novel insights and discoveries brought forth via large-scale data science training.

[Full-Text PDF] [PSB Home Page]