Evolution
A support site for modules SC0060-3 Evolution
and SCM009-M Molecular Evolution
Page last updated 14/02/02
Sequence
phylogenetics
[for SC0060-3 and SCM009-M]
Phylogenetics
is the study of evolutionary history and relatedness. Molecular phylogenetics
is the study
of
evolutionary history and relatedness via DNA base or protein amino acid
sequences,
or occasionally via
protein 3D conformation. Before
the 1960s, there was no sequencing, and study was only possible at
phenotypic level. In
the 1960s and 1970s, there was no DNA sequencing, but amino acid sequences could
be studied. In the 1980s and 1990s, DNA sequencing capability grew rapidly,
and DNA sequences
were
increasingly studied. In 2000, the landmark sequence of the human genome
(ca 3,300
million bp) and
several others were obtained. In the 2000s, there is the emergent science
of bioinformatics, which is the
study of patterns in and relationship between genomic sequences
using IT.
Phylogenetic
trees are used to diagrammatically represent phylogenetic relationships.
Phylogenetics
is a
' cladistic approach'. Cladistics is a traditional taxonomic term meaning
to categorise
on the basis of
evolutionary relationships (as compared to phenetics which is
categorising on the basis of observed
similarities and difference between organisms). A
clade is a group of species sharing a common ancestor
not shared by any other species outside
the clade. For example, comparison of DNA sequences shows
the classical taxonomic
class Reptilia is not a clade, since the three reptile orders share a common
ancestor
with birds, ie, the class Aves. However, Aves and order Crocodilia together
form the
clade
Archosauria since they share a common ancestor not shared by any other species
outside the
clade.
The basis
of molecular phylogenetics is comparison of equivalent extant sequences that
diverged
from a
common ancestor within the time period for which significant comparison
is possible.
Sequences accumulate
changes over time through mutation (note that only inherited
and therefore of evolutionary relevance when in
germ cells). The
types of mutations are as follows.
Substitutions
are the commonest mutation; but all mutations are rare. Mutations occur in an
individual,
and are then either fixed (ie, after a time be present in every individual in
the population)
or lost (ie, after
a time be eliminated from the population). Fixation probability depends
which if the following applies to
the mutation.
Mutations
can be in coding regions of genes and so could alter protein amino acid sequence;
or in
non-coding regions of genes such as 5’ and 3’ flanking and untranslated regions
and introns
(could
affect gene expression); or in non-genic regions. Substitutions in coding regions
can be either
one
of the following.
Substitution
rate is usually very gradual over millions of years, so cannot be directly observed.
One can
only compare extant sequences and imply phylogeny (note that PCR of ancient
DNA is a very limited
tool with best results of only ca 40,000 ya, and most sequences of just a few
mitochondrial
bases).
Substitution rate = K / 2T, where K is substitutions per site between
two compared
sequences and
T is time since divergence of the two compared sequences (ie,
since they last shared a common ancestor).
It's 2T rather than 1T because the comparison
is of a divergence event. T is often estimated from the
fossil record, but this can be unreliable
or even impossible.
RATES OF
NUCLEOTIDE SUBSTITUTION PER SITE PER 1000 MILLION YEARS BETWEEN
VARIOUS HUMAN AND RODENT PROTEINS-CODING GENES WITH DIVERGENCE
SET AT
80 MILLION YEARS BASED ON FOSSIL EVIDENCE.
| gene | number of codons | non-synonymous rate | synonymous rate |
| HISTONES | |||
| histone 3 | 135 | 0.00 | 6.38 |
| histone 4 | 101 | 0.00 | 6.12 |
| ACTINS | |||
| actin alpha | 376 | 0.01 | 3.68 |
| actin beta | 349 | 0.03 | 3.13 |
| SIGNALS | |||
| somatostatin-28 | 28 | 0.00 | 3.97 |
| insulin | 51 | 0.13 | 4.02 |
| thyrotropin | 118 | 0.33 | 4.66 |
| erythropoietin | 191 | 0.72 | 4.34 |
| insulin C peptide | 35 | 0.91 | 6.77 |
| parathyroid hormone | 90 | 0.94 | 4.18 |
| luteinizing hromone | 141 | 1.02 | 3.29 |
| growth hormone | 189 | 1.23 | 4.95 |
| interleukin I | 265 | 1.42 | 4.60 |
| relaxin | 54 | 2.51 | 7.49 |
| GLOBINS | |||
| alpha-globin | 141 | 0.55 | 5.14 |
| myoglobin | 153 | 0.56 | 4.44 |
| beta-globin | 144 | 0.80 | 3.05 |
| APOLIPOPROTEINS | |||
| E | 283 | 0.98 | 4.04 |
| A-I | 243 | 1.57 | 4.47 |
| A-IV | 371 | 1.58 | 4.15 |
| IMMUNOGLOBULINS | |||
| Ig-VH | 100 | 1.07 | 5.66 |
| Ig-gamma1 | 321 | 1.46 | 5.11 |
| Ig-kappa | 106 | 1.87 | 5.90 |
| INTERFERONS | |||
| alpha-1 | 166 | 1.41 | 3.53 |
| beta-1 | 159 | 2.21 | 5.88 |
| gamma | 136 | 2.79 | 8.59 |
| ENZYMES | |||
| aldolase A | 363 | 0.07 | 3.59 |
| creatine kinase M | 380 | 0.15 | 3.08 |
| GAPDH | 331 | 0.20 | 2.84 |
| lactate dehydrogenase A | 331 | 0.20 | 5.03 |
| mean | 0.85 | 4.61 | |
| SD | 0.73 | 1.44 |
Substitution
rate will depend on mutational input rate and fixation probability. The former
varies too
little
to account for differences above in observed rates (eg, there is only a two-fold
difference in
the minimum
and maximum mutation rates observed across mammalian genomes, this variation
probably being due
to differences in GC richness). The latter can account for the differences in
observed
rates through
differences in functional constraint, ie, how likely a substitution is
to alter
protein function.
For each
protein, synonymous rate > non-synonymous rate (5.4-fold greater on average).
Non-synonymous
substitutions are very likely to be deleterious and thus not fixed in population,
whilst synonymous
substitutions
should be neutral and so are much more often fixed in population. Whilst
most non-synonymous substitutions
are deleterious, a few are neutral, and very rarely they
enhance fitness, and these latter might be crucial to evolution.
The non-synonymous
rate is very low for histones and actins, low for enzymes and some hormones
like insulin,
moderate for globins and some hormones, and, high for apolipoproteins,
immunoglobulins,
interferons and some
hormones (ie, hormones vary greatly). A stronger functional
constraint leads to slower non-synonymous rate
(and vice versa). For
example, histones intimately and tightly bind DNA in eukaryotic nucleus, and
so must be
compact and alkaline, with almost every amino acid residue involved in an interaction
with DNA
or another histone.
Histones are thus highly intolerant of amino acid changes (ie,
a tight functional constraint), and consequently their
non-synonymous rate is nearly zero. For
example, apolipoproteins are protein-lipid complexes in vertebrate blood,
and exchange of
almost any hydrophobic amino acid residues in the lipid binding domain for one
of the other
hydrophobic amino acid residues does not affect protein function, The lipid
binding domain
of apolipoproteins is
thus tolerant of such substitutions, and consequently their non-synonymous
rate is relatively high.
Within
a single protein, different regions can have different substitution rates depending
on functional
constraint.
For example, insulin has A and B domains with tight functional constraint
giving a low non-synonymous rate.
Proinsulin has a third domain C, cleaved from
A and B to give active insulin, with a non-synonymous rate seven
times greater than
A and B, since C is not in the active hormone and so has a less tight functional
constraint.
However,
the non-synonymous rate for C is still quite modest because of functional
constraint that C is required
to fold A and B into active conformation. Another
instance of rate variation with a protein is the hypervariable
regions of immunoglobulins,
where non-synonymous rate > synonymous rate. This is the only known
instance
of this yet found. This is because amino acid changes are positively
selected
for in this antigen binding region
of immunoglobulins.
Synonymous
rate also varies, but less so than non-synonymous rate. However, variation
of synonymous rate
is still more than expected from random chance. One
explanation is that selection operates, especially in
highly
expressed genes, for
translational efficiency by favouring the synonymous codon with the most
abundant
tRNA (this is known as 'codon usage bias').
MEAN RATES
OF NUCLEOTIDE SUBSTITUTION PER SITE PER
1000 MILLION YEARS
IN DIFFERENT PARTS OF GENES AND
IN PSEUDOGENES SUMARRISING DATA
FROM A WIDE
RANGE OF DIFFERENT STUDIES
| gene region | rate | |
| 5' flanking regions | 2.05 | non-coding |
| 5' untranslated regions | 1.95 | non-coding (mRNA) |
| non-degenerate sites | 0.70 | coding |
| two-fold degenerate sites | 2.05 | coding |
| four-fold degenerate sites | 3.35 | coding |
| introns | 3.20 | non-coding |
| 3' untranslated regions | 2.15 | non-coding (mRNA) |
| 3' flanking regions | 3.10 | non-coding |
| pseudogenes | 3.60 |
Substitution
rates vary within the different parts of a single gene. Non-degenerate
sites (ie,
where all
three possible substitutions are non-synonymous) have the lowest rate.
Two-fold
generate sites
(ie, where one of the three possible substitutions is synonymous), 5'
flanking regions and both
5' and 3' untranslated regions have the next highest rates. Four-fold
degenerate sites (ie, where all
three possible substitutions are synonymous) and introns
have the next highest rate.
Peudogenes have the highest rate.
Again, the
key influence here is functional constraint. Pseudogenes are not expressed
and so
have
no functional constraint, giving the highest rate. Non-coding regions
of genes (ie, flanking,
untranslated and introns) have some functional constraint
due to their containing signals that
control gene expression and processing,
hence giving intermediate rates.
Completely non-synonymous sites have
the fullest functional constraint and so lowest rate,
with 1/3 non-synonymous sites
subject to lesser functional constraint and so give a higher rate.
The number
of substitutions observed
is generally less than the actual number of substitutions due
to the following.
These concealed substitutions become increasingly significant as time since divergence (T) increases.
PERCENTAGE
OF BASE CHANGES FOR NUCLEOTIDE SUBSTITUTION IN
THIRTEEN
MAMMALIAN PSEUDOGENES (ITALICISED FIGURES ARE EXCLUDING
ALL CG DINUCLEOTIDES).
| to A | to T | to C | to G | row totals | |
| from A | 4.7 | 5.0 | 9.4 | 19.1 | |
| 5.3 | 5.6 | 10.3 | 21.2 | ||
| from T | 4.4 | 8.2 | 3.3 | 15.9 | |
| 4.8 | 9.2 | 3.6 | 17.6 | ||
| from C | 6.5 | 21.0 | 4.2 | 31.7 | |
| 7.1 | 18.2 | 4.2 | 29.5 | ||
| from G | 20.7 | 7.2 | 5.3 | 33.2 | |
| 18.6 | 7.7 | 5.5 | 31.8 | ||
| column totals | 31.6 | 32.9 | 18.5 | 16.9 | |
| 30.5 | 31.2 | 20.3 | 18.1 |
Pseudogenes
show non-random substitution (ie, row and columns in table do not total 25).
Pseudogenes
are duplicated copies of functional genes that have themselves become
devoid of function, and thus have substitution without any functional constraint.
Transitions
(between two purines A to/from G or two pyrimidines C to/from T) are half as
likely again
as
tranversions (between purine and pyrimidine or vice versa A to/from C,
A to/from
T, G to/from C, G to/from T),
but random chance would give transversions to be twice
as likely as transitions.
Row (from)
totals = ca.33% for G and C, and ca.17% for A and T. Column (to) totals =
ca.17% for
G and C,
and ca.33% for A and T. Hence, there is an overall tendency for G and
C to change to A and T, and indeed
pseudogenes and non-genic regions are indeed AT
rich.
Such patterns
reflect mutational frequencies, eg, C to T is relatively easy at 5'CG3'
(C methylates
and then
methyl-C spontanesously deaminates to T), so C to T frequency is hence
high. However, C to T change
must also give G to A in complementary strand, so G to
A is also hence high, and such symmetry repeated
throughout table.
Various
mathematical corrections can be applied to correct for factors like 'concealed'
substitutions
and 'inherent bias' of substitution, and thus estimate 'actual real' substitutions,
eg, simplest
is the
two-parameter model with transitions more likely than tranversions.
Phylogenetic
analyses assume rate constancy, ie, that substitution rates in equivalent
sequences
are
the same for all lineages. Otherwise reliable conclusions cannot be constructed,
eg, divergences could
seem later or earlier than actuality if rates respectively slower
or faster than others analysed.
Linus Pauling's original "molecular clock" concept was
that substitution rate are a global constant,
at least for equivalent sequences. This
was, even in the 1960's, a controversial idea since the fossil
record and macroevolutionary ideas
suggest an erratic rate of structural and functional evolution.
A test
for rate constancy is the relative rate test that compares two species A and
B using
a
third species C known to have diverged before A and B. Note that T is not required.
DIFFERENCES
IN THE NUMBER OF SYNONYMOUS SUBSTITUTIONS PER
100 SITES
(KAC-KBC) BETWEEN MICE (SPECIES A) AND RATS
(SPECIES B) WITH HUMANS
AS REFERENCE (SPECIES C).
| gene | number of sites | KAC-KBC | error |
| apolipoprotein E | 201 | 1.8 | 5.3 |
| actin-alpha | 249 | -0.9 | 4.8 |
| actin-beta | 233 | 5.0 | 4.6 |
| thy-1 antigen | 116 | -5.5 | 6.9 |
| lactate dehydrogenase A | 219 | 0.1 | 8.2 |
| glycoprotein hormone, alpha-subunit | 58 | 13.4 | 18.5 |
| insulin-like growth factor II | 130 | -3.9 | 2.8 |
| atrial natriuretic factor II | 107 | 12.3 | 8.3 |
| growth hormone | 124 | 1.7 | 7.7 |
| thyroglobulin beta | 90 | -15.3 | 12.9 |
| proopiomelanocortin | 154 | 8.8 | 6.5 |
| aldolase A | 184 | -5.8 | 5.3 |
| creatine kinase M | 251 | -3.6 | 4.3 |
| metallothionein II | 35 | 8.8 | 10.2 |
| mean | 0.4 | 1.5 | |
| total | 2187 |
DIFFERENCES
IN THE NUMBER OF SYNONYMOUS SUBSTITUTIONS PER
100 SITES
(KAC-KBC) BETWEEN OLD WORLD MONKEY LINEAGE
(A) AND THE HUMAN LINEAGE (B).
| sequence | number of sites | KAC-KBC | error | reference species (C) |
| eta-globin pseudogene | 2000 | 2.1 | 0.7 | owl monkey |
| SYNONYMOUS SITES | ||||
| beta-globin | 71 | 2.8 | 5.6 | lemur |
| apolipoprotein A-I | 158 | -5.3 | 4.8 | rodentia |
| erythropoietin | 145 | 5.1 | 5.9 | rodentia |
| alpha1-antitrypsin | 140 | 6.7 | 6.8 | rodentia |
| insulin | 84 | -7.5 | 7.2 | dog |
| INTRONS | ||||
| delta-globin | 601 | 3.4 | 1.4 | lemur |
| UNTRANSLATED AND FLANKING REGIONS | ||||
| beta-globin | 179 | 1.2 | 1.7 | lemur |
| delta-globin | 172 | 6.1 | 3.2 | lemur |
| mean | 2.3 | 0.6 | ||
| total | 3550 |
So, with A = mice, B = rats and C = humans; KAC-KBC = 0 within error, showing rate constancy holds.
And, with
A = monkeys, B = humans and C = various; KAC-KBC > 0 within error,
showing monkey
lineage evolved faster than human lineage and rate constancy does not hold.
Thus substitution
rates for equivalent sections of equivalent genes can vary between
lineages.
Proposed causes for this include the following.
Thus, it
is generally found that rate constancy is only local between groups of closely
related species,
and there is no global molecular clock.
Compared
sequences must also be correctly aligned for analysis. This can be difficult
where substitution
is extensive and where indels are present in sequences. A simple, widely
used alignment method is
' dot matrix' where a grid is constructed with one sequence
as the columns and the other sequence as
the rows, with a dot in each grid square
where the column and row nucleotides are the same. The correct
alignment appears
as a diagonal line of dots, substitutions as gaps in this line, and indels as
disjoints
(dog-legs) in this line. Mathematical and/or IT-based methods also available
for alignment.
Evolutionary
history and kinship is commonly expressed via phylogenetic trees. Phylogenetic
trees
diagrammatically express the evolutionary history and relatedness of
sequences (or genes or species).
External nodes ('OTUs') are the extant modern sequences
(or genes or species). Internal nodes are the
ancestral sequences (or genes or
species). The branches are the evolutionary relationships and branch
lengths the evolutionary
distances. Phylogenetic
trees can either be rooted (where a unique evolutionary
path is defined) or unrooted
(where a path is not defined, and the tree only shows relatedness).
Phylogenetic trees can either be scaled (where distances show relatedness,
sometimes
called 'additive')
or unscaled (where relatedness not shown).
It should
be that the sequence tree is the gene tree and is the true species tree. This
is often
assumed,
but the period between the mutation event and fixation (ie, while there
is polymorphism at that site)
means there will be some error. This is usually small
compared to scale of tree. Closer time intervals
between divergences will increase
the chance of error, both within sequence tree and in inference of
true species
tree. Comparison of more sequences of longer length will decrease error
probability.
Various
tree construction methods are available. Some use sophisticated mathematics,
often now
employing IT.
The simplest are distance methods based on observed differences. More
complicated are the character-state
methods based on analysing the potential pathways that could
have led to observed sequences. Many of the
tree construction methods require one
or more outgroups to root the tree. An outgroup is an OTU known to have
diverged
before all the others in the tree. An outgroup must not be too distant or comparison
is not feasible.
The human-chimp-gorilla
(H-C-G) relationship a good example of molecular phylogenetics applied
to a difficult
and controversial evolutionary issue. This was studied using globin gene
sequences, with rhesus monkey as
the outgroup, and distance methods for tree construction.
C seems closest to H, but H-C common ancestor
is very close to (HC)-G common
ancestor, so chance of error highly significant. Nevertheless, this tree has
now been
confirmed using further sequence data and character-state tree construction
methods.
Note that
the orangutan is confirmed as outgroup to H-C-G, as its Asian geography suggests.
Another
good example of molecular phylogenetics was the use of the very slowly changing
ribosomal
RNA gene sequences to study very ancient evolutionary events such as the
divergence
of the three
domains of life and the endosymbiotic origins of the plastids. See
the preceding section.
Another
good example of molecular phylogenetics is use of the very rapidly changing
viral coat
protein
sequences of certain viruses like HIV to study patterns of progression and infection.
The extremely
high rate of change required to evade immune system makes a time scale of a
few years
possible for
such studies. One can construct phylogenetic trees of numerous
sub-types of HIV from patient groups,
or even within a single patient.
DISCLAIMER
The content, learning and assessment of these modules, as detailed herein, may
be subject to alteration without notice, should circumstances necessitate.
COPYRIGHT
Page created and maintained by Dr Andrew J White, Department of Biological Sciences, Staffordshire
University, College Road,
Stoke-on-Trent ST4 2DE, United Kingdom. Tel +44 1782 294613, email a.j.white@staffs.ac.uk