Evolution
A support site for modules SC0060-3 Evolution
and SCM009-M Molecular Evolution
Page last updated 23/02/02
Gene
phylogenetics
[for SC0060-3 and SCM009-M]
Proteins
are the primary gene product and functional basis of life. Evolution can be
studied via analysis
of
protein as well as via DNA sequence analysis. Amino
acid sequences have a lower information density
than DNA sequences, since the former
do not show synonymous and non-coding substitutions. However,
amino acid substitutions
can show functional constraints more explicitly. For example, one can
observe
amino acids that are crucial to protein function and/or structure as invariant
(ie, the
same residue in the
same place in this protein in all lineages) or conservative (ie,
the residue in the same place in this protein
in all lineages can alter, but only to an
amino acid of similar size or chemistry).
3D structure
of proteins (as determined by X-Ray Crystallography) can also be used to
study evolution.
This is done by comparison of two 3D structures via superposition, where
one structure is translated and
rotated relative to the other until the sum of distances
between equivalent atoms is minimised, and then
the difference is quantified as
root mean square (rms) of distances between equivalent atoms. Where the
sequences align
to statistical significance, then the 3D structures will always superpose to
statistical
significance. This is a useful fact, since if the 3D structure of a protein
is unknown, but its
sequence
aligns significantly to the sequence of a protein of known 3D structure, then
the unknown
3D structure
must hence be similar to that of the known.
Significant
superposition of protein 3D structures is often possible where significant
sequence
alignment
can no longer be detected, ie, a protein's 3D structure changes less
over time than its sequence. This is
not surprising given that the 3D structure generates the
function. Thus, comparisons using 3D structure
should be of greater robustness and
possible over a longer time period than those using sequence.
A single
gene is transcribed and translated into a single protein, and so any consideration
of proteins
amounts to a consideration of genes themselves.
Genes seem
to have increased in complexity and number through DNA duplication events,
ie, where
sequence segments are copied and the copy then inserted elsewhere. DNA
duplications can be within
a gene ("internal duplications"), parts of genes copied to elsewhere
("exon shuffling"), complete genes
copied ("gene duplication"), several genes, parts
of chromosomes, whole chromosomes (the latter
giving "aneuploidy"), or even the whole genome
(giving "polyploidy").
Internal
duplications are often of one or more exons. Exons often approximately correspond
to functional
protein domains, so that internal exon duplication can increase the number of
domains,
ie, 'gene elongation'.
This can enhance protein complexity / sophistication, for instance
by providing more binding sites.
Take ovomucin
for example. This is an inhibitor of trypsin in avian egg whites, and is one
polypeptide
with
three domains. Each of the three domains binds one trypsin and is coded
by two exons.
The overall DNA
sequence for each domain is similar and aligns with the others, but domains
I and II are
more similar than
domain III. Thus, the modern ovomucin gene seems to have derived from
an ancestral gene for a primitive
single domain protein by two internal duplications of the
two exons (with III produced first, and then I and II).
Also take
the immunoglobulins for example. Each immunoglobulin contains twelve domains,
with four
in
each of the two H chains and two in each of the two L chains. These domains
all have
a very similar 3D
structure with a seven stranded anti-parallel beta-barrel. Thus, it seems
almost certain that immunoglobulins
arose via internal duplications of this domain.
Exon shuffling
is where one or more exons is copied and inserted elsewhere. This probably
occurs via
recombination at introns, and this could be a possible evolutionary role for
introns, ie,
to facilitate the
production of altered or new genes with novel arrangements of exons.
Since exons
often approximately
correspond to functional protein domains, so exon shuffling might
provide a means of as it where 'reusing'
domains in a another evolving protein where that or a similar
function is needed, ie, avoids 'having to
reinvent the wheel'. Many
instances of domain duplications from exon shuffling now recognised.
For example,
lactate dehydrogenase, alcohol dehydrogenase, phosphoglycerate kinase,
pyruvate
kinase,
phosphorylase, flavodoxin, and dozens of other enzymes have all been
found to
contain a particular
domain of doubly wound parallel beta-sheet of six strands connected
by alpha-helices, which is known
as the 'Rossman Fold'. For
another example, the glycolytic enzyme pyruvate kinase has three domains,
one of which superposes
on whole of the fellow glycolytic enzyme triose phosphate isomerase,
one of
which superposes on the immunoglobulin domain, and one of which superposes
on the Rossman
Fold.
It has even
been suggested that large complex modern proteins may be derived from
duplications
(eg, via exon shuffling) of a limited number of smaller, simpler primordial
proteins.
As few as eight
basic overall arrangements of secondary structure (ie,
alpha-helices and beta-sheets) have been
recognised, and even these can in principle
be derived from simple folding motifs of a polypeptide.
It could even perhaps be
possible construct phylogenetic trees for proteins based on 3D arrangements.
The extra
copies of genes produced by gene duplications may retain their original
function
to increase
synthesis of the gene product (called 'invariant repeats' or 'dose
repetition'), or gain a new function through
sequence changes (called 'variant
repeats'), or become non-functional through being incapacitated by
sequence change (called 'pseudogenes'). For
example, tRNA, rRNA and histone genes often have a
number of invariant repeats. These
numerous identical genes act to increase production of these key
molecules. The
number of tRNA and rRNA genes tends to be proportional to genome size, eg,
ca.300 tRNA and 7 rRNA genes in E.coli, but ca.1,300 tRNA and ca.300 rRNA
genes in
humans.
Note that all these genes are the same or very similar due to the tight
functional
constraint.
For another
example, thrombin cleaves fibrinogen as part of the blood clotting mechanism,
and trypsin
cleaves dietary proteins during digestion. These two enzymes are variant
repeats in that they and have
similar sequence and 3D structure but slightly different functions. For
yet another example, lactalbumin
is a subunit of the enzyme for synthesis of lactose sugar and lysozyme
is a monomeric enzyme that
cleaves the bacterial cell wall polysaccharide. These
are variant repeats of similar sequence and 3D
structure, despite one being a subunit
of another enzyme and the other being an enzyme in its own right.
'Gene families'
can be defined, and are groups of genes for proteins that all have over
50% amino
acid
sequence homology. The genes in a family are often in relatively close
proximity
and have similar function.
Such families likely arose by gene duplications. For
example, the alpha-globin family and beta-globin family.
'Gene superfamilies'
can be defined, and are groups of genes for proteins where most of the
proteins have
over 50% amino acid homology, but some of the proteins have less than
50% amino acid homology.
The genes in a superfamily usually have greater diversity
of location and function than is typically the
case for gene families. Nevertheless,
superfamilies also likely arose by gene duplications, but some of
the duplication
events may be quite ancient. For example, the globin superfamily, which
includes
the
alpha-globin family, beta-globin family, and the single protein myoglobin.
Gene families
can either be 'lowly repetitive'with just a few genes, like isozymes and
the colour
pigment
proteins for example, or 'highly repetitive'with many genes like the
tRNAs and
rRNAs for example.
Isozymes
are the same enzyme catalysing the same reaction, but with different kinetic
and/or regulatory
properties, and with different locations and/or temporal distributions.
For example,
lactate dehydrogenase
(LDH) is a tetramer. There are two kinds of LDH subunit
in vertebrates, H and M. This thus gives five
possible LDH isoforms, H4, H3M, H2M2,
HM3, and M4. These LDH isoforms have different kinetic
properties and corresponding
physiological roles, with H4 found in cardiac muscle and M4 in skeletal
muscle. Duplication of the LDH gene early in vertebrate evolution gave the H
and M subunits,
and thereby
increased the versatility of this enzyme. There
are three colour photopigment proteins (CPP) in humans,
apes and old world monkeys.
These are red, green and blue, and this gives trichromatic vision. The red
and
green CPP genes show 96% sequence homology to each other and are both
X-linked
genes, whilst the
blue CPP gene shows 43% sequence homology to red and
green CPP genes, and is an autosomal gene.
New world monkeys have only one
X-linked red/green CPP gene, plus the autosomal blue CPP gene, thereby
giving dichromatic vision. Thus, the red and green CPP seems to have been produced
by a gene
duplication
event after the divergence of old world monkeys from new world monkeys.
However, female squirrel monkeys,
which are new world monkeys, can be heterozygous
at the red/green locus, thereby giving trichromatic vision,
ie, there are both
red and green CPP genes but as two different alleles at same locus ('allozymes').
The globin
superfamily comprises the alpha-globin family on chromosome 16 and the
beta-globin
family on
chromosome 11. Both of these families contain several genes that, in
various combinations between the two,
give the variants of the tetrameric red blood cell protein
haemoglobin. The globin superfamily also includes
myoglobin on chromosome 22, which
is a monomeric muscle protein from a single gene. Myoglobin
appears
to have diverged from the globins 600-800 million years ago (ie, just prior
to or during the 'Vendian period'
as multicellular animals first evolved), and the alpha and beta globins
to have diverged from each other
450-500 million years ago (ie, during the Ordovician period, as vertebrates
first evolved). Agnatha, the
jawless fish, have myoglobin
and a haemoglobin that has just one kind of globin, and so hence divergence
of these species must predate the alpha-beta globin divergence. Haemoglobin
variants are crucial during
embryonic and foetal development in vertebrates. These
variants are formed by expression of different
genes in the alpha and beta globin families
at different times. Each variant has biochemical properties that
match the physiological
oxygen carriage requirements during the developmental stage during which
it is
expressed, eg, foetal haemoglobin is mostly 2x alpha-globin + 2x gamma-globin
(the latter being a member
of the beta-globin family), which has a higher oxygen affinity than the 2x alpha-globin
+ 2x beta-globin form
of haemoglobin that predominates from about three weeks post-partum.
The globin
families include several
pseudogenes, ie, non-functional 'unprocessed' duplicate copies
of a functional gene. These have multiple
defects such as frameshifts, premature stop
codons and obliterated regulatory and splicing sites.
DISCLAIMER
The content, learning and assessment of these modules, as detailed herein, may
be subject to alteration without notice, should circumstances necessitate.
COPYRIGHT
Page created and maintained by Dr Andrew J White, Department of Biological Sciences, Staffordshire
University, College Road,
Stoke-on-Trent ST4 2DE, United Kingdom. Tel +44 1782 294613, email a.j.white@staffs.ac.uk