Identification of regulatory binding sites using minimum spanning trees

Olman V, Xu D, Xu Y

Protein Informatics Group, Life Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831-6480, USA.

Pac Symp Biocomput. 2003;:327-38.


Abstract

Recognition of protein-binding sites from the upstream regions of genes is a highly important and unsolved problem. In this paper, we present a new approach for studying this challenging issue. We formulate the binding-site recognition problem as a cluster identification problem, i.e., to identify clusters in a data set that exhibit significantly different features (e.g., density) than the overall background of the data set. We have developed a general framework for solving such a cluster identification problem. The foundation of the framework is a rigorous relationship between data clusters and subtrees of a minimum spanning tree (MST) representation of a data set. We have proposed a formal and general definition of clusters, and have demonstrated that a cluster is always represented as a connected component of the MST, and further it corresponds to a substring of a linear representation of the MST. Hence a cluster identification problem is reduced to a problem of finding substrings with certain features, for which algorithms have been developed. We have applied this MST-based cluster identification algorithm to a number of binding site identification problems. The results are highly encouraging.


[Full-Text PDF] [PSB Home Page]