Transcription

Downloaded from genome.cshlp.org on June 17, 2013 - Published by Cold Spring Harbor Laboratory PressMethodUnified modeling of gene duplication, loss,and coalescence using a locus treeMatthew D. Rasmussen1 and Manolis Kellis1Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139,USA; Broad Institute, Cambridge, Massachusetts 02139, USAGene phylogenies provide a rich source of information about the way evolution shapes genomes, populations, andphenotypes. In addition to substitutions, evolutionary events such as gene duplication and loss (as well as horizontaltransfer) play a major role in gene evolution, and many phylogenetic models have been developed in order to reconstructand study these events. However, these models typically make the simplifying assumption that population-related effectssuch as incomplete lineage sorting (ILS) are negligible. While this assumption may have been reasonable in some settings, ithas become increasingly problematic as increased genome sequencing has led to denser phylogenies, where effects such asILS are more prominent. To address this challenge, we present a new probabilistic model, DLCoal, that defines geneduplication and loss in a population setting, such that coalescence and ILS can be directly addressed. Interestingly, thismodel implies that in addition to the usual gene tree and species tree, there exists a third tree, the locus tree, which willlikely have many applications. Using this model, we develop the first general reconciliation method that accurately infersgene duplications and losses in the presence of ILS, and we show its improved inference of orthologs, paralogs, duplications, and losses for a variety of clades, including flies, fungi, and primates. Also, our simulations show that geneduplications increase the frequency of ILS, further illustrating the importance of a joint model. Going forward, we believethat this unified model can offer insights to questions in both phylogenetics and population genetics.[Supplemental material is available for this article.]Understanding the way new gene functions arise in genomes isa fundamental and long-studied question in evolutionary biology.Gene duplication, in particular, has been recognized as a powerfulway of generating new functions through neofunctionalizationand subfunctionalization (Ohno 1970; Lynch and Conery 2000),and gene losses can dramatically shape gene families (Niimura andNei 2007). ‘‘Phylogenomics’’ (Eisen 1998) is the use of phylogenetics to systematically reconstruct the ancestry of thousands ofgene families across many related genomes, and in recent years ithas been pursued in a variety of ways (Zmasek and Eddy 2002; Liet al. 2006; Huerta-Cepas et al. 2007; Wapinski et al. 2007; Butleret al. 2009; Datta et al. 2009; Vilella et al. 2009; Mi et al. 2010).The key idea in many of these approaches is that gene duplications and losses lead to incongruence (topological differences)between two important kinds of phylogenetic trees, the gene treeand the species tree (Goodman et al. 1979; Page 1994). The gene treedescribes how a set of gene sequences has diverged from one another, while the species tree describes how a set of species hasspeciated. The gene tree can be thought of as evolving ‘‘inside’’the species tree (Fig. 1), and this nesting can be reconstructed byreconciliation methods, in which the task is to infer the events responsible for the observed incongruence between two such trees(Goodman et al. 1979). Building on this idea, many models havebeen developed that use phylogenetic incongruence to infer thenumber, age, and location of gene duplication and loss eventsacross several genomes (Page 1994; Arvestad et al. 2004; Durandet al. 2006; Rasmussen and Kellis 2011).1Corresponding authors.E-mail [email protected] [email protected] published online before print. Article, supplemental material, and publication date are at 1.Freely available online through the Genome Research Open Access option.While these models (which we refer to as dup-loss models) havebeen successful in many situations, there still remain several important challenges in accurately inferring these events (Li et al.2006; Hahn 2007; Huerta-Cepas et al. 2007; Rasmussen and Kellis2007). These challenges stem from the fact that incongruence canoccur due to phenomena other than duplications and losses, andtherefore one must use caution when interpreting incongruence.Several of the more recent approaches have dealt with this complication by expanding their models to incorporate other important phenomena. For example, in prokaryotes, horizontal genetransfer (HGT) is a major cause of incongruence, and developingmodels that incorporate HGT is an active area of research (Doyonet al. 2010; David and Alm 2011; Tofigh et al. 2011). Anothersource of incongruence is due to uncertainty in the reconstructionof the gene tree, and methods that account for this have showndramatic improvements (Durand et al. 2006; Åkerborg et al. 2009;Rasmussen and Kellis 2011).However, despite such efforts, dup-loss models have yet tocapture an important and potentially prominent effect called incomplete lineage sorting (ILS) or deep coalescence (Fig. 1D; Wakeley2009). When a population of individuals undergoes several speciations in a relatively brief period of time, there can exist polymorphisms maintained throughout that time that eventually fixdifferently in descendant lineages. This effect alone is enough tocause a gene tree to be incongruent with its species tree, and itoccurs most frequently in branches of the species tree that represent small time spans (few generations) or large population sizes(Pollard et al. 2006; Hobolth et al. 2007). While ILS can be inferredusing coalescent models (Pamilo and Nei 1988; Rosenberg 2002;Rannala and Yang 2003; Degnan and Rosenberg 2009), thesemodels have been developed for very different purposes, such asestimating population sizes, divergence times, or migration rates(Hey and Machado 2003; Rannala and Yang 2003; Liu and Pearl2007). Typically, these analyses only require a subset of genes from22:755–765 Ó 2012 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/12; www.genome.orgGenome Researchwww.genome.org755

Downloaded from genome.cshlp.org on June 17, 2013 - Published by Cold Spring Harbor Laboratory PressRasmussen and KellisFigure 1. Different views of gene trees and species trees. (A) In the dup-loss model, a congruent gene tree and species tree indicates that all genes areorthologs. (B) Incongruence indicates the presence of gene duplications (stars) and gene losses (red ‘‘X’’). (C ) An example of the Wright-Fisher (WF)process and the coalescence of three lineages within the population. (D) A multispecies coalescent is a combination of WF processes for each branch of thespecies tree. In this model, no duplications or losses are allowed, but a gene tree can be incongruent due to a phenomenon known as incomplete lineagesorting (ILS). (E ) In the dup-loss model, the same gene tree in panel D can be explained using one gene duplication and at least three gene losses. ILScannot be modeled in the dup-loss model.the genome; therefore, one can choose genes that happen to beone-to-one orthologous and effectively avoid considering complications due to gene duplications and losses. In studies in whichduplications are considered, they have been modeled in specificways, such as a single duplication or a single species, and the primary focus has been to model other phenomena such as geneconversion (Innan 2003; Thornton 2007; Zhang and Rosenberg2007; Innan 2009).Currently, dup-loss models have only dealt with the influenceof ILS in limited ways. Either ILS is assumed to be negligible and isignored, or several post-processing steps are performed in order tomitigate its impact. For example, several reconciliation methods(Huerta-Cepas et al. 2007; Vilella et al. 2009) augment the usualstrict interpretation of incongruence in order to identify extremeforms of incongruence that are unlikely to be due to duplicationand loss, for example, when a duplication is followed by losses ineach descendant lineage (Fig. 1E). Notice that such a gene tree caneasily be explained without duplications, if instead it is explainedwith ILS in a pure coalescent model (Fig. 1D). Another strategy hasbeen to collapse short branches within the species tree where ILS isthought to occur frequently, and perform reconciliation to a species tree that is not fully resolved (Vernot et al. 2008). While thesestrategies work in specific cases of ILS, they are not general. Inparticular, as more genomes are sequenced, they will add newbranches to the species tree, further breaking up long branchesinto smaller ones and increasing the frequency of ILS throughoutthe species tree.Here, we present the first general probabilistic model forjoint modeling of gene duplications, losses, and incomplete lineage sorting (ILS) across multiple species. Our model, DLCoal(Duplication, Loss, and Coalescence), provides a framework forstudying all three phenomena and how they interact with oneanother. Using our model, we find that duplications can actuallyincrease the probability of ILS and that what different researchersrefer to as ‘‘gene trees’’ in the dup-loss and coalescent fields areactually different objects, which we distinguish by introducinga third tree called the locus tree. Using the model, we have developed a new reconciliation algorithm, DLCoalRecon, whichaddresses a pressing need for inferring duplications and lossesdespite the presence of ILS. We show its improved accuracy overa standard reconciliation method on both real and simulated datasets. A program implementing this algorithm is freely availablefor download.756Genome Researchwww.genome.orgThe modelIn this work, we present a probabilistic model for gene familyevolution that includes gene duplications, losses, and coalescence.We define our model by building on features of existing dup-lossand multispecies coalescent models.Duplication-loss modelsIn a dup-loss model (Fig. 1A,B), gene duplications and losses arethought to be the main cause of incongruence (Goodman et al.1979; Page 1994). Therefore, gene-tree species-tree congruencestrongly implies that all genes within the gene family are orthologous and that the gene has always been present as a single copythroughout the history of the species (Fig. 1A). The internal nodesof such a gene tree are called speciation nodes (white circles) sincethey represent sequence divergence due to speciation. A duplicationevent copies a gene to a new locus in the genome, where it begins todiverge. This is represented by additional internal nodes calledduplication nodes (stars), which can be located anywhere along thelength of a species tree branch. In contrast, the gene loss event (red‘‘X’’) deletes a gene from the genome. Notice, these events can occurmultiple times, allowing the gene tree to possibly differ greatly fromthe species tree (Fig. 1B). A pair of genes are called orthologous if theirmost recent common ancestor (MRCA) is a speciation node, andthey are called paralogous if their MRCA is a duplication node.Coalescent modelsIn applications of the coalescent model, incomplete lineage sorting (ILS) is thought to be the main source of incongruence. Thismodel can be derived from lower-level population models, such asthe Wright-Fisher or Moran model (Wakeley 2009). The WrightFisher (WF) model contains several assumptions, including a fixedpopulation size N, nonoverlapping generations, random mating,and neutrality. It also assumes no recombination, which is reasonable for the mitochondrial chromosome as well as any smallregion within autosomes, such as a single gene. In any case, werefer to the WF process as operating on ‘‘chromosomes’’ and fordiploid species, the population has 2N chromosomes. Whentracing the ancestry of k chromosomes backward in time, the WFmodel defines the number of generations t until one pair finds acommon ancestor, or coalesces (Fig. 1C). Given a large populationsize, this process can be approximated with the coalescent (Kingman

Downloaded from genome.cshlp.org on June 17, 2013 - Published by Cold Spring Harbor Laboratory PressDuplications, losses, and coalescence1982), which assumes that t follows the exponential distributionwith rate parameter k2 2N. The process is repeated until all lineagescoalesce into a single common ancestor, and the tree generated bythis process is called a coalescent tree. Alternatively, the process can beterminated at some predetermined time possibly before all lineagesfully coalesce, which has been referred to as a censored coalescent(Rannala and Yang 2003).In the multispecies coalescent (Fig. 1D), each branch of thespecies tree is viewed as containing a WF process (Tajima 1983;Pamilo and Nei 1988; Rosenberg 2002; Rannala and Yang 2003;Degnan and Rosenberg 2009). This means that a gene tree is reallya ‘‘traceback’’ of the ancestral lineages through this combinedstructure. Again, the coalescent can be used to approximate howa gene tree’s topology and branch lengths should be distributed.The multispecies coalescent process is initialized with a family ofextant genes present in the leaves of the species tree. Within eachspecies branch, gene lineages present at the bottom of the branchare coalesced according to the censored coalescent. By visitingthe species branches bottom-up, the process generates a gene treeconnecting all gene lineages up to the root of the species tree,where a final (uncensored) coalescent process joins the remaininggene lineages.Note that if a species branch has a large population size ora short time span, it is possible that two or more gene lineages maynot coalesce at their first opportunity, a phenomenon called incomplete lineage sorting (ILS). Therefore, with ILS, a gene tree can beincongruent