Health Informatics & Image Analysis Research Group

Department of Computer Science & Engineering
Indian Institute of Technology Kharagpur, West Bengal, India

Software and tools on Phylogenetics




Supertree Estimation

Problem 1 - Unweighted supertree estimation

Input:

A set of M (> 2) phylogenetic trees depicting evolutionary relationships for a set of N taxa. Individual trees cover overlapping taxa subsets from the taxa set. These trees depict conflicting evolutionary relations among these taxa subsets.

Output:

A supertree displaying the evolutionary relationships for all N taxa, such that the consensus relations among individual taxa subsets are preferably reflected in the supertree. Here, the branch lengths (weights) of the input trees are not used for supertree estimation. So, the problem is termed as unweighted supertree estimation.

Existing studies:

Existing studies either use matrix or graph based representation of input tree topologies, or decompose the trees into fixed or variable sized subtrees. So far, decomposition of input trees upto a minimum of triplets (subtrees of 3 taxa) have been proposed in literature. Such a method incurs cubic time and space complexity.

Our implementations (two different methods):

We have implemented a supertree method COSPEDTREE which uses evolutionary relations among individual couplets (taxa pair) to generate the supertree. The method uses cubic time complexity, and most importantly, quadratic space complexity, thus efficient for use in large scale biological datasets.

Link to main page of COSPEDTREE

We have extended the above method to generate COSPEDTREE-II , which also uses couplet relations to derive the final supertree. However, the output supertree is much better resolved compared to the earlier version. The running time of the new approach is also considerably lower.

Link to main page of COSPEDTREE-II

Problem 2 - Weighted Suprtree estimation

Input:

Here, the branch lengths (weights) of the input trees are present (input trees are weighted) and they are used for supertree estimation.

Output:

Output supertree quests for satisfying the following two objectives:

  • The supertree should be topologically closer to the input trees (median tree criteria).
  • Branch length information of the supertree should also closely match to the input trees.

Existing studies:

  • Either simultaneously use topology and branch length information for supertree construction. However, such supertrees are found to be quite inaccurate, in terms of both topological and branch length similarities to the input trees.
  • Or, they first form a super distance matrix (SDM) from input distance matrices (distance or branch length information of input trees), and use this SDM to estimate the tree using a distance matrix based phylogenetic construction method (such as neighbor-joining (NJ) algorithm).

Our implementation:

We have extended our previous unweighted supertree algorithm COSPEDTREE to incorporate branch length information. It is a two stage approach, described as follows.

  • 1st stage determines the supertree topology. Here, input tree topologies and branch length information are used to infer the topology.
  • Using the derived tree topology, a novel least square based nonlinear programming is applied, to assign the branch length values. We refer to this method as COSPEDTREEBL.

Link to main page of COSPEDTREEBL

Estimation of species trees from incongruent gene trees

Problem Overview

Gene trees: Phylogenetic trees created by sampling a gene from a group of N taxa, and subsequently employing phylogenetic reconstruction methods.

Input: A set of M (> 1) gene trees covering a set of N taxa. These gene trees have conflicting topologies and branch lengths, due to independent site specific evolution of individual gene sequences. Even the consensus gene tree topology may not be the true species tree topology.

Reason for discordance: Mainly for one of the following three evolutionary processes: 1) Horizontal gene transfer (HGT), 2) Gene duplication / loss, and 3) incomplete lineage sorting (ILS) or deep coalescence (DC).

Objective: To infer a species tree from this set of incongruent gene trees.

Approach 1 - Supertree based species tree estimation

We have proposed a species tree estimation method COSPEDSPEC, which models ILS as the sole reason between gene tree / species tree incongruence.

  • It is a two stage method for species tree estimation.
  • First stage uses COSPEDTREE to produce a supertree (non-binary).
  • Second stage refines the supertree to form a binary species tree. We have proposed a novel couplet based measure, termed as the excess gene count , to employ an agglomerative clustering approach for species tree generation.
COSPEDSPEC has cubic time complexity and quadratic space complexity, which makes it useful in large biological datasets.

Link to main page of COSPEDSpec

Approach 2 - Species tree estimation using couplet based internode and excess gene count

We have proposed another species tree estimation method IDXL, which models ILS as the sole reason between gene tree / species tree incongruence.

  • It modifies an existing summary based species tree estimation method NJst (Liu et. al. 2011)
  • Species tree is inferred using couplet based internode distance (ID) (originally proposed in Liu et. al. 2011), and our proposed excess gene leaf count (XL) measures.
IDXL exhibits improved performance compared to the NJst approach, especially for biological datasets (in terms of higher bootstrap support values on the recommended clades).

Link to main page of IDXL

Approach 3 - Species tree estimation using couplet based accumulated coalescence rank and excess gene count

We have proposed two more species tree estimation methods named AcRNJ and AcRNJXL, which models ILS as the sole reason between gene tree / species tree incongruence.

  • It modifies an existing summary based species tree estimation method STAR (Liu et. al. 2009)
  • STAR proposes lowest common ancestor (LCA) coalescence rank for individual couplets, for species tree inference.
  • We propose accumulated coalescence rank (AcR) among individual couplets, which is proved to discriminate individual couplets better, and also infer the order of molecular evolution better.
  • Species tree is inferred using one of the following techniques:
    • Either the AcR measure is used as the sole distance measure for individual couplets. This method is termed as AcRNJ.
    • Or, the product of relative AcR and the relative XL measures is used to select the couplet for agglomeration (much like IDXL approach). This method is termed as AcRNJXL.
Both AcRNJ and AcRNJXL exhibit improved performance compared to STAR, especially for the biological datasets.

Link to main page of AcRNJXL