Phylogenetic reconstruction is an attempt to discern the ancestral
relationship of a set of sequences. It involves the construction of
a tree, where the nodes indicate separate evolutionary paths, and
the lengths of the branches give an estimate of how distantly
related the sequences represented by those branches are.
Genes from different species may not have the same phylogenetic
history as the species from which those sequences are taken have
(although they do, obviously, have the same evolutionary history).
Note on terminology:
- topology: branching order of species, independent of branch
length;
- OTUs: operational taxonomic units: represent whatever group
of organisms etc are under consideration;
- outgroup: OTU included in study for explicit purpose of finding
the root of the tree;
- homoplasies: convergences of a particular character at a site.
Only shared and derived (synapomorphic) characters can be used
to clearly establish a phylogenetic relationship.
There are several methods of constructing phylogenetic trees -
the most common are:
- distance methods
- parsimony methods
- maximum likelihood methods
All these methods can only provide estimates of what a phylogenetic
tree might look like for a given set of data. Most good methods also
provide an indication of how much variation there is in these estimates.
- Distance methods:
Preferred for work with immunological data, frequency data, or data
with some impreciseness in its methods. Very rapid, and easily
permits statistical tests e.g. bootstrapping. Derives some measure
of similarity or difference between the input sequences
- UPGMA
Cluster algorithm. Links least different pairs of seqs, sequentially
(so that when one pair is formed, they become a single entity).
(Invalid) assumptions made: 1. Rate of change equal among all
sequences. 2. Branch lengths correlate with the expected phenotypic
distance between sequences, whic corresponds to a proportional
measure of time.
- NJ
Corrects several assumptions made in the UPGMA method. Yields an
unrooted tree.
- Fitch and Margoliash
Does not try to find pairs of least different sequences, but tries to
find trees that fulfil an optimum criterion. Yields an unrooted tree.
- Parsimony methods:
Popular for reconstructing ancestral relationships.
- Maximum parsimony
Evaluates all possible trees. Infers the number of evolutionary
events implied by a particular topology. The most likely tree is
then one that requires the minimum number of evolutionary
changes needed to explain the observed data. Problems: Most
parsimonious tree may not be unique; difficult to make valid
statistical statements if there are many steps in a tree;
branches with particularly rapid rates of change tend to
attract one another, especially when the sequence lengths
are small.
- Maximum likelihood:
Very slow. Preferred when homoplasies (convergences of a
particular character at a site) are expected to be concentrated
in a few sites only, whose identities are known in advance.
The method works by estimating, for all nucleotide positions
in a sequence, what the probability of having a particular
nucleotide at a particular site is, based on whether or not
its ancestors had it (and the transition/transversion ratio).
These probabilities are summed over the whole sequence,
for both branches of a bifurcating tree. The product of the
two probabilities gives you the likelihood of the tree up to this
point. With more sequences, the estimation is done recursively
at every branch point. Since each site evolves independently,
the likelihood of the phylogeny can be estimated at every site.
This process can only be done in a reasonable amount of time
with four sequences. If there are more than four sequences,
basic trees can be made for sets of four sequences, and then
extra sequences added to the tree and the process of finding
the maximum likelihood re-estimated. The order in which the
sequences are added and the initial sequences chosen to start
the process critically influences the resulting tree. To prevent
any bias, the whole process is done multiple times with random
choices for the order of the sequences. A majority rule consensus
tree is then chosen as the final tree.
A more detailed analysis of some of the algorithms is available here.
To create a phylogentic tree, you must first have an alignment.
This can be created using ClustalW
(see previous tutorial). ClustalW
can also create a tree file for you (if you choose 'nj', 'phylip', or
'dist' from the "Tree type" pull-down menu.) However, you have
more control over the tree if you simply choose to create an
alignment in ClustalW (do not choose a tree type in this case,
because then the alignment itself will not be presented). Copy
the alignment (including the title, so that the PHYLIP programs
recognise the alignment format as ClustalW), and paste it into
the text-entry box provided for alignments in one of the following
programs in the PHYLIP suite of programs. Sample FASTA file of
ebola glycoproteins.
The resulting tree files from all of these programs are
in nh format,
which can then be cut and pasted into the text-entry box in Phylodendron.
Another useful program, that you will have to download to your personal computer,
is called TreeIllustrator
Warnings:
The resulting tree from any of these phylogenetic programs
is always an estimate. It is usually a good idea to try many
different programs on your data set. You can also try to
remove various sequences and see what difference they
make to the topology of the tree. If there is a radical change,
this could indicate that a particular sequence is causing an
error (perhaps because it has a different rate of change).
Remember also that long branches tend to attract each other.
A good way to deal with these is to add them in to an existing
tree. Always try to include at least one outgroup. This will
help to root the tree, and also ensure that the outgroups
really are 'out'. It is usually a good idea to have more than
one outgroup, spaced evenly throughout the tree. The more
intermediate sequences included, the more the 'correctness'
of the branching topology should increase, because the intermediate
sequences will provide the
'intermediate' states from which the descendants arise.
|