Inferring Optimal Species Trees Under Gene Duplication and Loss


M.S. Bayzid1, Siavash Mirarab2, Tandy Warnow3



Department of Computer Science, The University of Texas at Austin
Email: tandy@cs.utexas.edu

Pacific Symposium on Biocomputing 18:250-261(2013)


Abstract

Species tree estimation from multiple markers is complicated by the fact that gene trees can di?er from each other (and from the true species tree) due to several biological processes, one of which is gene duplication and loss. Local search heuristics for two NP-hard optimization problems - min- imize gene duplications (MGD) and minimize gene duplications and losses (MGDL) - are popular techniques for estimating species trees in the presence of gene duplication and loss. In this paper, we present an alternative approach to solving MGD and MGDL from rooted gene trees. First, we characterize each tree in terms of its “subtree-bipartitions” (a concept we introduce). Then we show that the MGD species tree is de?ned by a maximum weight clique in a vertex-weighted graph that can be computed from the subtree-bipartitions of the input gene trees, and the MGDL species tree is de?ned by a minimum weight clique in a similarly constructed graph. We also show that these optimal cliques can be found in polynomial time in the number of vertices of the graph using a dynamic programming algorithm (similar to that of Hallett and Lagergren1 ), because of the special structure of the graphs. Finally, we show that a constrained version of these problems, where the subtree-bipartitions of the species tree are drawn from the subtree-bipartitions of the input gene trees, can be solved in time that is polynomial in the number of gene trees and taxa. We have implemented our dynamic programming algorithm in a publicly available software tool, available at http://www.cs.utexas.edu/users/phylo/software/dynadup/.


[Full-Text PDF] [PSB Home Page]