PSB 2003 Tutorial
We have entered an era in genomics and genetics in which the generation of information is far
outpacing our ability to understand its implications for the prevention, diagnosis, and treatment
of common multifactorial human diseases. The availability of vast quantities of genetic information
has presented geneticists and computational biologists with several important computational and
statistical challenges. The first of these challenges is the variable selection problem. This
problem stems from the growing realization that interactions among multiple genetic and
environmental factors are likely to be more important than any one factor for predicting
risk of common multifactorial diseases. Given interactions play an important role in
disease etiology, we need to be considering combinations of genetic variations or gene
expression variables rather than one variable at a time in our genetic analysis.
The problem is that, when the number of variables is large, there are an effectively
infinite number of combinations that could be evaluated.
The second challenge is the statistical modeling or feature selection problem. That is, what is the most appropriate way to model the relationship between combinations of genetic variations or gene expression variables and clinical endpoints? Logistic regression is a parametric statistical approach for relating one or more independent or explanatory variables to a dependent or outcome variable (e.g. disease status) that follows a binomial distribution. However, the number of possible interaction terms grows exponentially as each additional main effect is included in the logistic regression model. Thus, logistic regression is limited in its ability to deal with interaction data. Alternative approaches such as multifactor dimensionality reduction, cellular automata, and symbolic discriminant analysis are flexible, nonparametric, and genetic model-free but at the cost of ease of computation. As with the variable selection problem, there are an effectively infinite number of possible model forms.
The goals of this tutorial are to 1) review the importance of gene-gene and gene-environment interactions for understanding the etiology of common multifactorial diseases, 2) review the variable selection and statistical modeling problems for the detection and characterization of gene-gene and gene-environment interactions, and 3) review new computational approaches for dealing with these challenges.
Jason H. Moore is Assistant Professor of Molecular Physiology and Biophysics in the Program in Human Genetics at Vanderbilt University Medical School. He is Director of the Program in Human Genetics Bioinformatics Core and is Co-Founder and Co-Director of the Vanderbilt Multi-Processor Integrated Research Engine (VAMPIRE), a Beowulf computer cluster. Dr. Moore's research focuses on the development and application of new statistical and computational methods for identifying and characterizing gene-gene interactions in common, complex multifactorial human diseases. Recent work has included development of the multifactor dimensionality reduction (MDR) method and a cellular automata (CA) method for identifying high-order interactions among single-nucleotide polymorphisms (SNPs). Additional methodological work has included development of symbolic discriminant analysis (SDA) and dynamics-based pattern recognition (DBPR) for microarray data analysis and spectrogram alignment using genetic algorithms (SAGA) for processing proteomics data from MALDI mass spectrometry. Applied work includes genetic studies of sporadic breast cancer, lung cancer, gastrointestinal cancer, hypertension, cardiovascular disease, depression, autoimmune disease, diabetes, Alzheimer disease, and pharmacological endpoints in addition to the establishment of a population-based resource in Ghana, Africa for genetic studies of arterial thrombosis. Dr. Moore holds a Ph.D. and M.S. in Human Genetics and an M.A. in Applied Statistics from the University of Michigan. He has been a recipient of the James V. Neel Young Investigator Award from the International Genetic Epidemiology Society for his work on microarray data analysis methods and has held the position of President of the Middle Tennessee Chapter of the American Statistical Association. He also serves on the scientific advisory board of GenoMed Inc. For the last three years, Dr. Moore has given tutorials on biostatistics, quantitative genetics, and microarray data analysis for the Genetic Analysis of Complex Human Diseases workshop at Duke University and has taught courses on biostatistics, human genetics, and bioinformatics at Vanderbilt University. Recent invitations for lectures have come from the International Genetic Epidemiology Society, American Society of Human Genetics, Yale University, National Institutes of Health, European Conference on Machine Learning, World Congress on Psychiatric Genetics, University of Pennsylvania, University of Minnesota, Pacific Symposium on Biocomputing, Genetic and Evolutionary Algorithm Conference, Joint Statistical Meetings, Parallel Problem Solving from Nature conference, University of Alabama, University of Tennessee, Duke University, GlaxoSmithKline, and California State University.