Courses home > How to draw phylogenetic trees

Phylogenetic reconstruction is an attempt to discern the ancestral relationship of a set of sequences. It involves the construction of a tree, where the nodes indicate separate evolutionary paths, and the lengths of the branches give an estimate of how distantly related the sequences represented by those branches are.

Genes from different species may not have the same phylogenetic history as the species from which those sequences are taken have (although they do, obviously, have the same evolutionary history).

Note on terminology:

  • topology: branching order of species, independent of branch length;
  • OTUs: operational taxonomic units: represent whatever group of organisms etc are under consideration;
  • outgroup: OTU included in study for explicit purpose of finding the root of the tree;
  • homoplasies: convergences of a particular character at a site.

Only shared and derived (synapomorphic) characters can be used to clearly establish a phylogenetic relationship.

There are several methods of constructing phylogenetic trees - the most common are:

  • distance methods
  • parsimony methods
  • maximum likelihood methods

All these methods can only provide estimates of what a phylogenetic tree might look like for a given set of data. Most good methods also provide an indication of how much variation there is in these estimates.

  • Distance methods:
    Preferred for work with immunological data, frequency data, or data with some impreciseness in its methods. Very rapid, and easily permits statistical tests e.g. bootstrapping. Derives some measure of similarity or difference between the input sequences
    • UPGMA
      Cluster algorithm. Links least different pairs of seqs, sequentially (so that when one pair is formed, they become a single entity). (Invalid) assumptions made: 1. Rate of change equal among all sequences. 2. Branch lengths correlate with the expected phenotypic distance between sequences, whic corresponds to a proportional measure of time.
    • NJ
      Corrects several assumptions made in the UPGMA method. Yields an unrooted tree.
    • Fitch and Margoliash
      Does not try to find pairs of least different sequences, but tries to find trees that fulfil an optimum criterion. Yields an unrooted tree.
  • Parsimony methods:
    Popular for reconstructing ancestral relationships.
    • Maximum parsimony Evaluates all possible trees. Infers the number of evolutionary events implied by a particular topology. The most likely tree is then one that requires the minimum number of evolutionary changes needed to explain the observed data. Problems: Most parsimonious tree may not be unique; difficult to make valid statistical statements if there are many steps in a tree; branches with particularly rapid rates of change tend to attract one another, especially when the sequence lengths are small.
  • Maximum likelihood:
    Very slow. Preferred when homoplasies (convergences of a particular character at a site) are expected to be concentrated in a few sites only, whose identities are known in advance. The method works by estimating, for all nucleotide positions in a sequence, what the probability of having a particular nucleotide at a particular site is, based on whether or not its ancestors had it (and the transition/transversion ratio). These probabilities are summed over the whole sequence, for both branches of a bifurcating tree. The product of the two probabilities gives you the likelihood of the tree up to this point. With more sequences, the estimation is done recursively at every branch point. Since each site evolves independently, the likelihood of the phylogeny can be estimated at every site. This process can only be done in a reasonable amount of time with four sequences. If there are more than four sequences, basic trees can be made for sets of four sequences, and then extra sequences added to the tree and the process of finding the maximum likelihood re-estimated. The order in which the sequences are added and the initial sequences chosen to start the process critically influences the resulting tree. To prevent any bias, the whole process is done multiple times with random choices for the order of the sequences. A majority rule consensus tree is then chosen as the final tree.

A more detailed analysis of some of the algorithms is available here.

To create a phylogentic tree, you must first have an alignment. This can be created using ClustalW (see previous tutorial). ClustalW can also create a tree file for you (if you choose 'nj', 'phylip', or 'dist' from the "Tree type" pull-down menu.) However, you have more control over the tree if you simply choose to create an alignment in ClustalW (do not choose a tree type in this case, because then the alignment itself will not be presented). Copy the alignment (including the title, so that the PHYLIP programs recognise the alignment format as ClustalW), and paste it into the text-entry box provided for alignments in one of the following programs in the PHYLIP suite of programs. Sample FASTA file of ebola glycoproteins.

[ phylip ] workedexample

[ phylodendron ] workedexample | help

The resulting tree files from all of these programs are in nh format, which can then be cut and pasted into the text-entry box in Phylodendron.

[ treeillustrator ]

Another useful program, that you will have to download to your personal computer, is called TreeIllustrator

Warnings:
The resulting tree from any of these phylogenetic programs is always an estimate. It is usually a good idea to try many different programs on your data set. You can also try to remove various sequences and see what difference they make to the topology of the tree. If there is a radical change, this could indicate that a particular sequence is causing an error (perhaps because it has a different rate of change). Remember also that long branches tend to attract each other. A good way to deal with these is to add them in to an existing tree. Always try to include at least one outgroup. This will help to root the tree, and also ensure that the outgroups really are 'out'. It is usually a good idea to have more than one outgroup, spaced evenly throughout the tree. The more intermediate sequences included, the more the 'correctness' of the branching topology should increase, because the intermediate sequences will provide the 'intermediate' states from which the descendants arise.

Sequence retrievalHomology searchingSequence alignmentPromoter analysis
News
Jun, 2008; Bioinformatics meets Alzheimer's disease research. Read about the discovery of the CALHM1 P86L polymorphism. The study appeared in the June 27th issue of Cell. [More]
Mar, 2008; A free bioinformatics walk-in clinic will be available every Monday, 1-3pm at the Weill Cornell Medical Library, in the Computer Room on the lower level. [More]

[News Archives] [Mailing List]


Events
Aug 25-29, 2008: Stanford University, CA - 7th Annual International Conference on Computational Systems Bioinformatics. Hosted by: Life Sciences Society [More]
Sep 22-26, 2008: Goettingen, Germany - Fall Course on Computational Neuroscience at the Max Planck Institute for Dynamics and Self-Organization. This annual course comprises tutorial lectures and seminar style coverage of selected current topics. Registration deadline: Aug 8, 2008. [More]