Evolution
A support site for modules SC0060-3 Evolution

and SCM009-M Molecular Evolution

Page last updated 14/02/02

SC0060-3
topic/lecture schedule
SCM009-M topic/lecture schedule SC0060-3
assessment
SCM009-M
assessment
resource
centre
prebiotic
evolution
early
evolution
sequence phylogenetics gene
phylogenetics
further topics in phylogenetics
patterns of nucleotide substitution protein
evolution
introductory
lecture
HOME PAGE

Sequence phylogenetics
[for SC0060-3 and SCM009-M]

Phylogenetics is the study of evolutionary history and relatedness. Molecular phylogenetics is the study of
evolutionary history and relatedness via DNA base or protein amino acid
sequences, or occasionally via
protein 3D conformation.
Before the 1960s, there was no sequencing, and study was only possible at
phenotypic level.
In the 1960s and 1970s, there was no DNA sequencing, but amino acid sequences could
be studied. In the 1980s and 1990s, DNA sequencing capability grew rapidly, and DNA
sequences were
increasingly studied. In 2000, the landmark sequence of the human genome
(ca 3,300 million bp) and
several others were obtained. In the 2000s, there is the emergent
science of bioinformatics, which is the
study of patterns in and relationship between genomic
sequences using IT.

Phylogenetic trees are used to diagrammatically represent phylogenetic relationships. Phylogenetics is a
' cladistic approach'. Cladistics is a traditional taxonomic term meaning
to categorise on the basis of
evolutionary relationships (as compared to phenetics which
is categorising on the basis of observed
similarities and difference between organisms).
A clade is a group of species sharing a common ancestor
not shared by any other species
outside the clade. For example, comparison of DNA sequences shows
the classical
taxonomic class Reptilia is not a clade, since the three reptile orders share a common
ancestor with birds, ie, the class Aves. However, Aves and order Crocodilia together form the clade
Archosauria since they share a common ancestor not shared by any other species
outside the clade.

The basis of molecular phylogenetics is comparison of equivalent extant sequences that diverged from a
common ancestor within the time period for which significant comparison
is possible. Sequences accumulate
changes over time through mutation (note that only
inherited and therefore of evolutionary relevance when in
germ cells).
The types of mutations are as follows.

Substitutions are the commonest mutation; but all mutations are rare. Mutations occur in an individual,
and are then either fixed (ie, after a time be present in every individual in the
population) or lost (ie, after
a time be eliminated from the population). Fixation probability
depends which if the following applies to
the mutation.

Mutations can be in coding regions of genes and so could alter protein amino acid sequence; or in
non-coding regions of genes such as 5’ and 3’ flanking and untranslated regions and
introns (could
affect gene expression); or in non-genic regions. Substitutions in coding regions
can be either one
of the following.

Substitution rate is usually very gradual over millions of years, so cannot be directly observed. One can
only compare extant sequences and imply phylogeny (note that PCR of ancient DNA is a very
limited
tool with best results of only ca 40,000 ya, and most sequences of just a few
mitochondrial bases).
Substitution rate = K / 2T, where K is substitutions per site between
two compared sequences and
T is time since divergence of the two compared sequences
(ie, since they last shared a common ancestor).
It's 2T rather than 1T because the
comparison is of a divergence event. T is often estimated from the
fossil record, but this can be
unreliable or even impossible.

RATES OF NUCLEOTIDE SUBSTITUTION PER SITE PER 1000 MILLION YEARS BETWEEN
VARIOUS HUMAN AND RODENT PROTEINS-CODING GENES WITH
DIVERGENCE SET AT
80 MILLION YEARS BASED ON FOSSIL EVIDENCE.

gene number of codons non-synonymous rate synonymous rate
HISTONES      
histone 3 135 0.00 6.38
histone 4 101 0.00 6.12
ACTINS      
actin alpha 376 0.01 3.68
actin beta 349 0.03 3.13
SIGNALS      
somatostatin-28 28 0.00 3.97
insulin 51 0.13 4.02
thyrotropin 118 0.33 4.66
erythropoietin 191 0.72 4.34
insulin C peptide 35 0.91 6.77
parathyroid hormone 90 0.94 4.18
luteinizing hromone 141 1.02 3.29
growth hormone 189 1.23 4.95
interleukin I 265 1.42 4.60
relaxin 54 2.51 7.49
GLOBINS      
alpha-globin 141 0.55 5.14
myoglobin 153 0.56 4.44
beta-globin 144 0.80 3.05
APOLIPOPROTEINS      
E 283 0.98 4.04
A-I 243 1.57 4.47
A-IV 371 1.58 4.15
IMMUNOGLOBULINS      
Ig-VH 100 1.07 5.66
Ig-gamma1 321 1.46 5.11
Ig-kappa 106 1.87 5.90
INTERFERONS      
alpha-1 166 1.41 3.53
beta-1 159 2.21 5.88
gamma 136 2.79 8.59
ENZYMES      
aldolase A 363 0.07 3.59
creatine kinase M 380 0.15 3.08
GAPDH 331 0.20 2.84
lactate dehydrogenase A 331 0.20 5.03
mean   0.85 4.61
SD   0.73 1.44

Substitution rate will depend on mutational input rate and fixation probability. The former varies too little
to account for differences above in observed rates (eg, there is only a two-fold difference
in the minimum
and maximum mutation rates observed across mammalian genomes, this
variation probably being due
to differences in GC richness). The latter can account for the differences in
observed rates through
differences in functional constraint, ie, how likely a substitution is
to alter protein function.

For each protein, synonymous rate > non-synonymous rate (5.4-fold greater on average). Non-synonymous
substitutions are very likely to be deleterious and thus not fixed in population,
whilst synonymous substitutions
should be neutral and so are much more often fixed in population.
Whilst most non-synonymous substitutions
are deleterious, a few are neutral, and very rarely
they enhance fitness, and these latter might be crucial to evolution.

The non-synonymous rate is very low for histones and actins, low for enzymes and some hormones like insulin,
moderate for globins and some hormones, and, high for apolipoproteins,
immunoglobulins, interferons and some
hormones (ie, hormones vary greatly). A stronger
functional constraint leads to slower non-synonymous rate
(and vice versa).
For example, histones intimately and tightly bind DNA in eukaryotic nucleus, and so must be
compact and alkaline, with almost every amino acid residue involved in an interaction
with DNA or another histone.
Histones are thus highly intolerant of amino acid changes
(ie, a tight functional constraint), and consequently their
non-synonymous rate is nearly zero.
For example, apolipoproteins are protein-lipid complexes in vertebrate blood,
and exchange
of almost any hydrophobic amino acid residues in the lipid binding domain for one of the other
hydrophobic amino acid residues does not affect protein function, The lipid binding
domain of apolipoproteins is
thus tolerant of such substitutions, and consequently their
non-synonymous rate is relatively high.

Within a single protein, different regions can have different substitution rates depending on functional constraint.
For example, insulin has A and B domains with tight functional
constraint giving a low non-synonymous rate.
Proinsulin has a third domain C, cleaved
from A and B to give active insulin, with a non-synonymous rate seven
times greater
than A and B, since C is not in the active hormone and so has a less tight functional constraint.
However, the non-synonymous rate for C is still quite modest because of functional constraint that C is required
to fold A and B into active conformation.
Another instance of rate variation with a protein is the hypervariable
regions of
immunoglobulins, where non-synonymous rate > synonymous rate. This is the only known instance
of this yet found. This is because amino acid changes are positively
selected for in this antigen binding region
of immunoglobulins
.

Synonymous rate also varies, but less so than non-synonymous rate. However, variation of synonymous rate
is still more than expected from random chance.
One explanation is that selection operates, especially in
highly expressed genes, for translational efficiency by favouring the synonymous codon with the most
abundant tRNA (this is known as 'codon usage bias').

MEAN RATES OF NUCLEOTIDE SUBSTITUTION PER SITE PER 1000 MILLION YEARS
IN DIFFERENT PARTS OF GENES
AND IN PSEUDOGENES SUMARRISING DATA
FROM A
WIDE RANGE OF DIFFERENT STUDIES

gene region rate  
5' flanking regions 2.05 non-coding
5' untranslated regions 1.95 non-coding (mRNA)
non-degenerate sites 0.70 coding
two-fold degenerate sites 2.05 coding
four-fold degenerate sites 3.35 coding
introns 3.20 non-coding
3' untranslated regions 2.15 non-coding (mRNA)
3' flanking regions 3.10 non-coding
     
pseudogenes 3.60  

Substitution rates vary within the different parts of a single gene. Non-degenerate sites (ie, where all
three possible substitutions are non-synonymous) have the lowest rate.
Two-fold generate sites
(ie, where one of the three possible substitutions is synonymous),
5' flanking regions and both
5' and 3' untranslated regions have the next highest rates.
Four-fold degenerate sites (ie, where all
three possible substitutions are synonymous) and
introns have the next highest rate.
Peudogenes have the highest rate.

Again, the key influence here is functional constraint. Pseudogenes are not expressed and so
have
no functional constraint, giving the highest rate. Non-coding regions of genes (ie, flanking,
untranslated and introns) have some functional
constraint due to their containing signals that
control gene expression and
processing, hence giving intermediate rates.
Completely non-synonymous sites
have the fullest functional constraint and so lowest rate,
with 1/3 non-synonymous
sites subject to lesser functional constraint and so give a higher rate.

The number of substitutions observed is generally less than the actual number of substitutions due
to the following.

These concealed substitutions become increasingly significant as time since divergence (T) increases.

PERCENTAGE OF BASE CHANGES FOR NUCLEOTIDE SUBSTITUTION IN THIRTEEN
MAMMALIAN PSEUDOGENES (ITALICISED FIGURES ARE
EXCLUDING ALL CG DINUCLEOTIDES).

  to A to T to C to G row totals
from A   4.7 5.0 9.4 19.1
    5.3 5.6 10.3 21.2
from T 4.4   8.2 3.3 15.9
  4.8   9.2 3.6 17.6
from C 6.5 21.0   4.2 31.7
  7.1 18.2   4.2 29.5
from G 20.7 7.2 5.3   33.2
  18.6 7.7 5.5   31.8
column totals 31.6 32.9 18.5 16.9  
  30.5 31.2 20.3 18.1  

Pseudogenes show non-random substitution (ie, row and columns in table do not total 25).
Pseudogenes are duplicated copies of functional genes that have themselves become
devoid of function, and thus have substitution without any functional constraint.

Transitions (between two purines A to/from G or two pyrimidines C to/from T) are half as likely again as
tranversions (between purine and pyrimidine or vice versa A to/from C,
A to/from T, G to/from C, G to/from T),
but random chance would give transversions to be
twice as likely as transitions.

Row (from) totals = ca.33% for G and C, and ca.17% for A and T. Column (to) totals = ca.17% for G and C,
and ca.33% for A and T. Hence, there is an overall tendency for G
and C to change to A and T, and indeed
pseudogenes and non-genic regions are indeed
AT rich.

Such patterns reflect mutational frequencies, eg, C to T is relatively easy at 5'CG3' (C methylates and then
methyl-C spontanesously deaminates to T), so C to T frequency is
hence high. However, C to T change
must also give G to A in complementary strand, so G
to A is also hence high, and such symmetry repeated
throughout table
.

Various mathematical corrections can be applied to correct for factors like 'concealed' substitutions
and 'inherent bias' of substitution, and thus estimate 'actual real' substitutions,
eg, simplest is the
two-parameter model with transitions more likely than tranversions.

Phylogenetic analyses assume rate constancy, ie, that substitution rates in equivalent sequences are
the same for all lineages. Otherwise reliable conclusions cannot be
constructed, eg, divergences could
seem later or earlier than actuality if rates respectively
slower or faster than others analysed.
Linus Pauling's original "molecular clock" concept
was that substitution rate are a global constant,
at least for equivalent sequences.
This was, even in the 1960's, a controversial idea since the fossil
record and macroevolutionary
ideas suggest an erratic rate of structural and functional evolution.

A test for rate constancy is the relative rate test that compares two species A and B using
a third species C known to have diverged before A and B. Note that T is not required.

DIFFERENCES IN THE NUMBER OF SYNONYMOUS SUBSTITUTIONS PER 100 SITES
(KAC-KBC) BETWEEN MICE (SPECIES A) AND RATS
(SPECIES B) WITH HUMANS
AS REFERENCE (SPECIES C).

gene number of sites KAC-KBC error
apolipoprotein E 201 1.8 5.3
actin-alpha 249 -0.9 4.8
actin-beta 233 5.0 4.6
thy-1 antigen 116 -5.5 6.9
lactate dehydrogenase A 219 0.1 8.2
glycoprotein hormone, alpha-subunit 58 13.4 18.5
insulin-like growth factor II 130 -3.9 2.8
atrial natriuretic factor II 107 12.3 8.3
growth hormone 124 1.7 7.7
thyroglobulin beta 90 -15.3 12.9
proopiomelanocortin 154 8.8 6.5
aldolase A 184 -5.8 5.3
creatine kinase M 251 -3.6 4.3
metallothionein II 35 8.8 10.2
mean   0.4 1.5
total 2187    

DIFFERENCES IN THE NUMBER OF SYNONYMOUS SUBSTITUTIONS PER 100 SITES
(KAC-KBC) BETWEEN OLD WORLD MONKEY
LINEAGE (A) AND THE HUMAN LINEAGE (B).

sequence number of sites KAC-KBC error reference species (C)
eta-globin pseudogene 2000 2.1 0.7 owl monkey
SYNONYMOUS SITES        
beta-globin 71 2.8 5.6 lemur
apolipoprotein A-I 158 -5.3 4.8 rodentia
erythropoietin 145 5.1 5.9 rodentia
alpha1-antitrypsin 140 6.7 6.8 rodentia
insulin 84 -7.5 7.2 dog
INTRONS        
delta-globin 601 3.4 1.4 lemur
UNTRANSLATED AND FLANKING REGIONS        
beta-globin 179 1.2 1.7 lemur
delta-globin 172 6.1 3.2 lemur
mean   2.3 0.6  
total 3550      

So, with A = mice, B = rats and C = humans; KAC-KBC = 0 within error, showing rate constancy holds.

And, with A = monkeys, B = humans and C = various; KAC-KBC > 0 within error,
showing
monkey lineage evolved faster than human lineage and rate constancy does not hold.

Thus substitution rates for equivalent sections of equivalent genes can vary between lineages.
Proposed causes for this include the following.

Thus, it is generally found that rate constancy is only local between groups of closely related species,
and there is no global molecular clock.

Compared sequences must also be correctly aligned for analysis. This can be difficult where substitution
is extensive and where indels are present in sequences. A simple,
widely used alignment method is
' dot matrix' where a grid is constructed with one
sequence as the columns and the other sequence as
the rows, with a dot in each grid
square where the column and row nucleotides are the same. The correct
alignment
appears as a diagonal line of dots, substitutions as gaps in this line, and indels as disjoints
(dog-legs) in this line. Mathematical and/or IT-based methods also available
for alignment.

Evolutionary history and kinship is commonly expressed via phylogenetic trees. Phylogenetic trees
diagrammatically express the evolutionary history and relatedness
of sequences (or genes or species).
External nodes ('OTUs') are the extant modern
sequences (or genes or species). Internal nodes are the
ancestral sequences (or genes
or species). The branches are the evolutionary relationships and branch
lengths the
evolutionary distances. Phylogenetic trees can either be rooted (where a unique evolutionary
path is defined) or
unrooted (where a path is not defined, and the tree only shows relatedness).
Phylogenetic trees can either be scaled (where distances show relatedness,
sometimes called 'additive')
or unscaled (where relatedness not shown).

It should be that the sequence tree is the gene tree and is the true species tree. This is often assumed,
but the period between the mutation event and fixation (ie, while
there is polymorphism at that site)
means there will be some error. This is usually
small compared to scale of tree. Closer time intervals
between divergences will
increase the chance of error, both within sequence tree and in inference of
true
species tree. Comparison of more sequences of longer length will decrease error probability.

Various tree construction methods are available. Some use sophisticated mathematics, often now employing IT.
The simplest are distance methods based on observed differences.
More complicated are the character-state
methods based on analysing the potential pathways that
could have led to observed sequences. Many of the
tree construction methods require
one or more outgroups to root the tree. An outgroup is an OTU known to have
diverged
before all the others in the tree. An outgroup must not be too distant or comparison is not feasible.

The human-chimp-gorilla (H-C-G) relationship a good example of molecular phylogenetics applied to a difficult
and controversial evolutionary issue. This was studied using globin
gene sequences, with rhesus monkey as
the outgroup, and distance methods for tree
construction. C seems closest to H, but H-C common ancestor
is very close to (HC)-G
common ancestor, so chance of error highly significant. Nevertheless, this tree has
now
been confirmed using further sequence data and character-state tree construction methods. Note that
the orangutan is confirmed as outgroup to H-C-G, as its Asian geography suggests.

Another good example of molecular phylogenetics was the use of the very slowly changing ribosomal
RNA gene sequences to study very ancient evolutionary events such as the
divergence of the three
domains of life and the endosymbiotic origins of the plastids.
See the preceding section.

Another good example of molecular phylogenetics is use of the very rapidly changing viral coat protein
sequences of certain viruses like HIV to study patterns of progression and infection.
The extremely
high rate of change required to evade immune system makes a time scale of a
few years possible for
such studies. One can construct phylogenetic trees of numerous
sub-types of HIV from patient groups,
or even within a single patient.

top of page

DISCLAIMER
The content, learning and assessment of these modules, as detailed herein, may be subject to alteration without notice, should circumstances necessitate.
COPYRIGHT
Page created and maintained by Dr Andrew J White, Department of Biological Sciences, Staffordshire University, College Road,
Stoke-on-Trent ST4 2DE, United Kingdom. Tel +44 1782 294613, email
a.j.white@staffs.ac.uk