Spectral Clustering Strategies for Heterogeneous Disease Data

Grace T. Huang1, Kathryn I. Cunningham2, Panayiotis V. Benos3, Chakra S. Chennubhotla4

1Department of Computational and Systems Biology;2Joint CMU-Pitt PhD Program in Computational Biology;3 Clinical and Translational Science Institute University of Pittsburgh,;4Department of Computer Science

Pacific Symposium on Biocomputing 18:212-223(2013)


Clustering of gene expression data simplifies subsequent data analyses and forms the basis of numerous approaches for biomarker identification, prediction of clinical outcome, and personalized therapeutic strategies. The most popular clustering methods such as K-means and hierarchical clustering are intuitive and easy to use, but they require arbitrary choices on their various parameters (number of clusters for K-means, and a threshold to cut the tree for hierarchical clustering). Human disease gene expression data are in general more difficult to cluster efficiently due to background (genotype) heterogeneity, disease stage and progression differences and disease subtyping; all of which cause gene expression datasets to be more heterogeneous. Spectral clustering has been recently introduced in many fields as a promising alternative to standard clustering methods. The idea is that pairwise comparisons can help reveal global features through the eigen techniques. In this paper, we developed a new recursive K-means spectral clustering method (ReKS) for disease gene expression data. We benchmarked ReKS on three large-scale cancer datasets and we compared it to different clustering methods with respect to execution time, background models and external biological knowledge. We found ReKS to be superior to the hierarchical methods and equally good to K-means, but much faster than them and without the requirement for a priori knowledge of K. Overall, ReKS offers an attractive alternative for efficient clustering of human disease data.

[Full-Text PDF] [PSB Home Page]