Identification of regulatory binding sites using minimum spanning treesOlman V, Xu D, Xu Y |
|
AbstractRecognition of protein-binding sites from the upstream regions of genes is a highly important and unsolved problem. In this paper, we present a new approach for studying this challenging issue. We formulate the binding-site recognition problem as a cluster identification problem, i.e., to identify clusters in a data set that exhibit significantly different features (e.g., density) than the overall background of the data set. We have developed a general framework for solving such a cluster identification problem. The foundation of the framework is a rigorous relationship between data clusters and subtrees of a minimum spanning tree (MST) representation of a data set. We have proposed a formal and general definition of clusters, and have demonstrated that a cluster is always represented as a connected component of the MST, and further it corresponds to a substring of a linear representation of the MST. Hence a cluster identification problem is reduced to a problem of finding substrings with certain features, for which algorithms have been developed. We have applied this MST-based cluster identification algorithm to a number of binding site identification problems. The results are highly encouraging. |
|
[Full-Text PDF] [PSB Home Page] |
|