Evolution
A support site for modules SC0060-3 Evolution

and SCM009-M Molecular Evolution

Page last updated 23/02/02

SC0060-3
topic/lecture schedule
SCM009-M topic/lecture schedule SC0060-3
assessment
SCM009-M
assessment
resource
centre
prebiotic
evolution
early
evolution
sequence phylogenetics gene
phylogenetics
further topics in phylogenetics
patterns of nucleotide substitution protein
evolution
introductory
lecture
HOME PAGE

Gene phylogenetics
[for SC0060-3 and SCM009-M]

Proteins are the primary gene product and functional basis of life. Evolution can be studied via analysis of
protein as well as via DNA sequence analysis.
Amino acid sequences have a lower information density
than DNA sequences, since the
former do not show synonymous and non-coding substitutions. However,
amino acid
substitutions can show functional constraints more explicitly. For example, one can observe
amino acids that are crucial to protein function and/or structure as invariant
(ie, the same residue in the
same place in this protein in all lineages) or conservative
(ie, the residue in the same place in this protein
in all lineages can alter, but only to
an amino acid of similar size or chemistry).

3D structure of proteins (as determined by X-Ray Crystallography) can also be used to study evolution.
This is done by comparison of two 3D structures via superposition,
where one structure is translated and
rotated relative to the other until the sum of
distances between equivalent atoms is minimised, and then
the difference is quantified
as root mean square (rms) of distances between equivalent atoms. Where the
sequences
align to statistical significance, then the 3D structures will always superpose to statistical
significance. This is a useful fact, since if the 3D structure of a protein is unknown, but
its sequence
aligns significantly to the sequence of a protein of known 3D structure, then the
unknown 3D structure
must hence be similar to that of the known.

Significant superposition of protein 3D structures is often possible where significant sequence alignment
can no longer be detected, ie, a protein's 3D structure changes
less over time than its sequence. This is
not surprising given that the 3D structure generates
the function. Thus, comparisons using 3D structure
should be of greater robustness
and possible over a longer time period than those using sequence.

A single gene is transcribed and translated into a single protein, and so any consideration of proteins
amounts to a consideration of genes themselves.

Genes seem to have increased in complexity and number through DNA duplication events, ie, where
sequence segments are copied and the copy then inserted elsewhere.
DNA duplications can be within
a gene ("internal duplications"), parts of genes copied to
elsewhere ("exon shuffling"), complete genes
copied ("gene duplication"), several genes,
parts of chromosomes, whole chromosomes (the latter
giving "aneuploidy"), or even the whole
genome (giving "polyploidy").

Internal duplications are often of one or more exons. Exons often approximately correspond to functional
protein domains, so that internal exon duplication can increase the number of
domains, ie, 'gene elongation'.
This can enhance protein complexity / sophistication, for
instance by providing more binding sites.

Take ovomucin for example. This is an inhibitor of trypsin in avian egg whites, and is one polypeptide with
three domains. Each of the three domains binds one trypsin and is coded
by two exons. The overall DNA
sequence for each domain is similar and aligns with the others, but domains I and II
are more similar than
domain III. Thus, the modern ovomucin gene seems to have derived
from an ancestral gene for a primitive
single domain protein by two internal duplications of
the two exons (with III produced first, and then I and II).
Also take the immunoglobulins for example. Each immunoglobulin contains twelve domains, with four in
each of the two H chains and two in each of the two L chains. These domains
all have a very similar 3D
structure with a seven stranded anti-parallel beta-barrel. Thus, it
seems almost certain that immunoglobulins
arose via internal duplications of this domain.

Exon shuffling is where one or more exons is copied and inserted elsewhere. This probably occurs via
recombination at introns, and this could be a possible evolutionary role for introns,
ie, to facilitate the
production of altered or new genes with novel arrangements of exons.
Since exons often approximately
correspond to functional protein domains, so exon shuffling
might provide a means of as it where 'reusing'
domains in a another evolving protein where that or a
similar function is needed, ie, avoids 'having to
reinvent the wheel'.
Many instances of domain duplications from exon shuffling now recognised.

For example, lactate dehydrogenase, alcohol dehydrogenase, phosphoglycerate kinase, pyruvate kinase,
phosphorylase, flavodoxin, and dozens of other enzymes have all been
found to contain a particular
domain of doubly wound parallel beta-sheet of six strands
connected by alpha-helices, which is known
as the 'Rossman Fold'.
For another example, the glycolytic enzyme pyruvate kinase has three domains,
one of which
superposes on whole of the fellow glycolytic enzyme triose phosphate isomerase, one of
which superposes on the immunoglobulin domain, and one of which superposes
on the Rossman Fold.

It has even been suggested that large complex modern proteins may be derived from duplications
(eg, via exon shuffling) of a limited number of smaller, simpler primordial
proteins. As few as eight
basic overall arrangements of secondary structure
(ie, alpha-helices and beta-sheets) have been
recognised, and even these can in
principle be derived from simple folding motifs of a polypeptide.
It could even perhaps
be possible construct phylogenetic trees for proteins based on 3D arrangements.

The extra copies of genes produced by gene duplications may retain their original function to increase
synthesis of the gene product (called 'invariant repeats' or
'dose repetition'), or gain a new function through
sequence changes (called
'variant repeats'), or become non-functional through being incapacitated by
sequence change (called 'pseudogenes').
For example, tRNA, rRNA and histone genes often have a
number of invariant repeats.
These numerous identical genes act to increase production of these key
molecules.
The number of tRNA and rRNA genes tends to be proportional to genome size, eg,
ca.300 tRNA and 7 rRNA genes in E.coli,  but ca.1,300 tRNA and ca.300 rRNA genes
in humans.
Note that all these genes are the same or very similar due to the tight
functional constraint.
For another example, thrombin cleaves fibrinogen as part of the blood clotting mechanism, and trypsin
cleaves dietary proteins during digestion. These two enzymes are
variant repeats in that they and have
similar sequence and 3D structure but slightly different functions.
For yet another example, lactalbumin
is a subunit of the enzyme for synthesis of lactose sugar and
lysozyme is a monomeric enzyme that
cleaves the bacterial cell wall polysaccharide.
These are variant repeats of similar sequence and 3D
structure, despite one being a
subunit of another enzyme and the other being an enzyme in its own right.

'Gene families' can be defined, and are groups of genes for proteins that all have over 50% amino acid
sequence homology. The genes in a family are often in relatively close
proximity and have similar function.
Such families likely arose by gene duplications.
For example, the alpha-globin family and beta-globin family.
'Gene superfamilies' can be defined, and are groups of genes for proteins where most of the proteins have
over 50% amino acid homology, but some of the proteins have less
than 50% amino acid homology.
The genes in a superfamily usually have greater
diversity of location and function than is typically the
case for gene families.
Nevertheless, superfamilies also likely arose by gene duplications, but some of
the
duplication events may be quite ancient. For example, the globin superfamily, which includes the
alpha-globin family, beta-globin family, and the single protein myoglobin.

Gene families can either be 'lowly repetitive'with just a few genes, like isozymes and the colour pigment
proteins for example, or 'highly repetitive'with many genes like the
tRNAs and rRNAs for example.

Isozymes are the same enzyme catalysing the same reaction, but with different kinetic and/or regulatory
properties, and with different locations and/or temporal distributions.
For example, lactate dehydrogenase
(LDH) is a tetramer. There are two kinds of LDH
subunit in vertebrates, H and M. This thus gives five
possible LDH isoforms, H4, H3M,
H2M2, HM3, and M4. These LDH isoforms have different kinetic
properties and
corresponding physiological roles, with H4 found in cardiac muscle and M4 in skeletal
muscle. Duplication of the LDH gene early in vertebrate evolution gave the H and M
subunits, and thereby
increased the versatility of this enzyme.
There are three colour photopigment proteins (CPP) in humans,
apes and old world
monkeys. These are red, green and blue, and this gives trichromatic vision. The red and
green CPP genes show 96% sequence homology to each other and are both
X-linked genes, whilst the
blue CPP gene shows 43% sequence homology to red
and green CPP genes, and is an autosomal gene.
New world monkeys have only
one X-linked red/green CPP gene, plus the autosomal blue CPP gene, thereby
giving dichromatic vision. Thus, the red and green CPP seems to have been produced
by a gene duplication
event after the divergence of old world monkeys from new world
monkeys. However, female squirrel monkeys,
which are new world monkeys, can be
heterozygous at the red/green locus, thereby giving trichromatic vision,
ie, there are
both red and green CPP genes but as two different alleles at same locus ('allozymes').

The globin superfamily comprises the alpha-globin family on chromosome 16 and the beta-globin family on
chromosome 11. Both of these families contain several genes that,
in various combinations between the two,
give the variants of the tetrameric red blood cell
protein haemoglobin. The globin superfamily also includes
myoglobin on chromosome 22,
which is a monomeric muscle protein from a single gene. Myoglobin appears
to have diverged from the globins 600-800 million years ago (ie, just prior to or during the 'Vendian period'
as multicellular animals first evolved), and the alpha and beta
globins to have diverged from each other
450-500 million years ago (ie, during the Ordovician period, as vertebrates first evolved). Agnatha, the
jawless fish, have
myoglobin and a haemoglobin that has just one kind of globin, and so hence divergence
of these species must predate the alpha-beta globin divergence.
Haemoglobin variants are crucial during
embryonic and foetal development in vertebrates.
These variants are formed by expression of different
genes in the alpha and beta globin
families at different times. Each variant has biochemical properties that
match the
physiological oxygen carriage requirements during the developmental stage during which it is
expressed, eg, foetal haemoglobin is mostly 2x alpha-globin + 2x gamma-globin (the latter being a member
of the beta-globin family), which has a higher oxygen affinity than the 2x alpha-globin + 2x beta-globin form
of haemoglobin that predominates from about three weeks post-partum.
The globin families include several
pseudogenes, ie, non-functional 'unprocessed' duplicate
copies of a functional gene. These have multiple
defects such as frameshifts, premature
stop codons and obliterated regulatory and splicing sites.

top of page

DISCLAIMER
The content, learning and assessment of these modules, as detailed herein, may be subject to alteration without notice, should circumstances necessitate.
COPYRIGHT
Page created and maintained by Dr Andrew J White, Department of Biological Sciences, Staffordshire University, College Road,
Stoke-on-Trent ST4 2DE, United Kingdom. Tel +44 1782 294613, email
a.j.white@staffs.ac.uk