Data Analysis Tools for DNA Microarray data

Sorin Draghici,
BioDiscovery Inc. and Computer Science Department, Wayne State University

You can view the actual talk slides in PDF format:
Slide set 1
Slide set 2
Slide set 3
Slide set 4
Slide set 5
Slide set 6
Slide set 7
Slide set 8
Slide set 9
Slide set 10
Slide set 11

The state of the art in a number of biological disciplines is characterized by a wealth of advanced tools and techniques able to produce raw data of biological significance at unprecedented speed and level of detail. Most researchers agree that the challenge of the near future is to analyze, interpret and understand all data that is being produced. In essence, the challenge faced by the biologists is to use the large-scale data that is being produced to discover and understand fundamental biological phenomena. At the same time, the challenge faced by computer scientists is to develop new algorithms and techniques to support such discoveries. DNA microarrays are a typical example of such a tool able to provide a very large amount of data in a very short time. Typicall, a microarray will contain a few thousands or tens of thousands of genes hybridized in a single experiment. However, the immense potential of this technology can only be realized if many such experiments are done and expression levels are compared between species or between healthy and ill individuals or at different time points for the same individual or population of individuals. Such large scale experiments will produce huge amounts of data and the use of suitable analysis tools becomes a crucial issue.

This tutorial will present the main problems that need to be addressed when processing data coming from DNA microarray experiments. The talk will illustrate the typical information flow in the design of such experiment together with the types of data that need to be stored in each stage. Important (and often overlooked) issues such as data cleansing, data pre-processing, normalization and quality assesment will be discussed in detail. Techniques such as hierachical clustering, k-means clustering, principal component analysis (PCA), scatterplots, histograms, etc, will also be presented. The discussion will also include more subtle issues such as the reliability of the conclusions drawn and the statistical confidence obtained by using replicated genes and experiments.

Sorin Draghici is project manager for advanced technologies at BioDiscovery Inc. where he worked on the development of GeneSight, BioDiscovery's tool for data mining and data analysis for DNA microarrays. He is also an Assistant Professor of Computer Science in the Computer Science Department as well as a faculty in the Institute for Scientific Computing, Wayne State University. His background is in machine learning and neural networks. During the past 3 years, his interests concentrated around bioinformatics problems including HIV virtual phenotyping and the development of tools for genomics and proteomics.


Back to the PSB tutorial page


Back to the main PSB page