Tải bản đầy đủ
6: Genomics Determines and Analyzes the DNA Sequences of Entire Genomes
14.21 Physical maps are often used to order cloned DNA
fragments. A part of a physical map of a set of overlapping YAC
(yeast artificial chromosome) clones from one end of the human Y
number of base pairs, kilobases, or megabases (Figure
14.21). A common type of physical map is one that connects
isolated pieces of genomic DNA that have been cloned in
bacteria or yeast. Physical maps generally have higher resolution and are more accurate than genetic maps. A physical
map is analogous to a neighborhood map that shows the
location of every house along a street, whereas a genetic map
is analogous to a highway map that shows the locations of
major towns and cities.
One of the techniques for creating physical maps is
restriction mapping, which determines the position of
restriction sites on DNA. When a piece of DNA is cut with a
restriction enzyme and the fragments are separated by gel
electrophoresis, the number of restriction sites in the DNA
and the distances between them can be determined by the
number and positions of bands on the gel, but this information does not tell us the order or the precise location of the
restriction sites. To map restriction sites, a sample of the
DNA is cut with one restriction enzyme, and another sample
is cut with a different restriction enzyme. A third sample is
cut with both restriction enzymes together (a double digest).
The DNA fragments produced by these restriction digests are
then separated by gel electrophoresis, and their sizes are
compared. Overlap in size of fragments produced by the
digests can be used to position the restriction sites on the
original DNA molecule.
Both genetic and physical maps provide information about the relative positions and distances between genes, molecular markers,
and chromosome segments. Genetic maps are based on rates of
recombination and are measured in percent recombination, or
centimorgans. Physical maps are based on physical distances and
are measured in base pairs.
Sequencing an Entire Genome
The ultimate goal of structural genomics is to determine the
ordered nucleotide sequences of entire genomes of organisms. Earlier in this chapter, we considered the use of the
dideoxy sequencing method to sequence small fragments of
DNA. The main obstacle to sequencing a whole genome is
the immense size of most genomes. Bacterial genomes are
usually at least several million base pairs long; many eukaryotic genomes are billions of base pairs long and are distributed among dozens of chromosomes. Furthermore, for
technical reasons, sequencing cannot begin at one end of a
chromosome and continue straight through to the other
end; only small fragments of DNA—usually from 500 to 700
nucleotides—can be sequenced at one time. Therefore,
determining the sequence for an entire genome requires that
the DNA be broken into thousands or millions of smaller
fragments that can then be sequenced. The difficulty lies in
putting these short sequences back together in the correct
order. Two different approaches have been used to assemble
the short sequenced fragments into a complete genome:
map-based sequencing and whole-genome shotgun
sequencing. We will consider these two approaches in the
context of the Human Genome Project.
The Human Genome Project
By 1980, methods for mapping and sequencing DNA fragments had been sufficiently developed that geneticists began
seriously proposing that the entire human genome could be
sequenced. An international collaboration was planned to
undertake the Human Genome Project (Figure 14.22); initial estimates suggested that 15 years and $3 billion would be
required to accomplish the task.
The Human Genome Project officially got underway in
October 1990. Initial efforts focused on developing new and
Molecular Genetic Analysis, Biotechnology, and Genomics
The initial effort to sequence the genome was a public
project consisting of the international collaboration of 20
research groups and hundreds of individual researchers who
formed the International Human Genome Sequencing
Consortium. This group used a map-based strategy for
sequencing the human genome.
14.22 A rough draft of the complete sequence of the
human genome was completed in 2000. Craig Venter (left),
President of Celera Genomics, and Francis Collins (right), Director of
the National Human Genome Research Institute, NIH, announced this
landmark achievement at a press conference in Washington on June
26, 2000. [Alex Wong/Newsmakers/Getty Images.]
automated methods for cloning and sequencing DNA and
on generating detailed physical and genetic maps of the
human genome. The methods described earlier for mapping,
sequencing, and assembling DNA fragments were pivotal in
these early stages of the project. By 1993, large-scale physical
maps were completed for all 23 pairs of human chromosomes. At the same time, automated sequencing techniques
(Figure 14.23) had been developed that made large-scale
Map-based sequencing In map-based sequencing, short
sequenced fragments are assembled into a whole-genome
sequence by first creating detailed genetic and physical maps
of the genome, which provide known locations of genetic
markers (restriction sites, other genes, or known DNA
sequences) at regularly spaced intervals along each chromosome. These markers are later used to help align the short
sequenced fragments into their correct order.
After the genetic and physical maps are available, chromosomes or large pieces of chromosomes are separated by
pulsed-field gel electrophoresis (PFGE) or by flow cytometry
in which chromosomes are sorted optically by size. Each
chromosome (or sometimes the entire genome) is then cut
up by partial digestion with restriction enzymes (Figure
14.24). Partial digestion means that the restriction enzymes
are allowed to act for only a limited time so that not all
restriction sites in every DNA molecule are cut. Thus, partial
digestion produces a set of large overlapping DNA fragments,
which are then cloned with the use of cosmids, yeast artificial
chromosomes (YACs), or bacterial artificial chromosomes
Next, these large-insert clones are put together in their
correct order on the chromosome (see Figure 14.24). This
assembly can be done in several ways. One method relies on
the presence of a high-density map of genetic markers. A
complementary DNA probe is made for each genetic marker,
and a library of the large-insert clones is screened with the
probe, which will hybridize to any colony containing a clone
with the marker. The library is then screened for neighboring markers. Because the clones are much larger than the
markers used as probes, some clones will have more than one
marker. For example, clone A might have markers M1 and
M2, clone B markers M2, M3, and M4, and clone C markers
M4 and M5. Such a result would indicate that these clones
contain areas of overlap, as shown here:
14.23 Automated sequencers and powerful computers
allowed the human genome sequence to be completed in
just 13 years. [Sam Ogden/Photo Researchers.]
A set of two or more overlapping DNA fragments that
form a contiguous stretch of DNA is called a contig. This
approach was used in 1993 to create a contig consisting of
1 Partial digestion of DNA
results in overlapping
fragments that are then
cloned in bacteria.
2 These large-insert
clones are analyzed for
markers or overlapping
3 …which allows the
large-insert clones to
be assembled into
a contig, a continuous
stretch of DNA.
4 A subset of overlapping clones
that cover the entire chromosome
are selected and fractured.
These pieces are then cloned.
5 Each of these small-insert clones
is sequenced, and overlap in
sequences is used to assemble
them in the correct order.
6 The final sequence is assembled
by putting together the
sequences of the large clones
and filling in any gaps.
14.24 Map-based approaches to whole-genome sequencing rely on detailed genetic and
physical maps to align sequenced fragments.
196 overlapping YAC clones (see Figure 14.21) of the human
The order of clones can also be determined without the
use of preexisting genetic maps. For example, each clone can
be cut with a series of restriction enzymes and the resulting
fragments then separated by gel electrophoresis. This
method generates a unique set of restriction fragments,
called a fingerprint, for each clone. The restriction patterns
for the clones are stored in a database. A computer program
is then used to examine the restriction patterns of all the
clones and look for areas of overlap. The overlap is then used
to arrange the clones in order, as shown here:
Other genetic markers can be used to help position contigs
along the chromosome.
When the large-insert clones have been assembled into
the correct order on the chromosome, a subset of overlapping clones that efficiently cover the entire chromosome can
be chosen for sequencing. Each of the selected large-insert
clones is fractured into smaller overlapping fragments,
which are themselves cloned (see Figure 14.24). These
smaller clones (called small-insert clones) are then
sequenced. The sequences of the small-insert clones are
examined for overlap, which allows them to be correctly
assembled to give the sequence of the larger insert clones.
Enough overlapping small-insert clones are usually
sequenced to ensure that the entire genome is sequenced several times. Finally, the whole genome is assembled by putting
together the sequences of all overlapping contigs (see Figure
14.24). Often, gaps in the genome map still exist and must be
filled in by using other methods.
The Human Genome Sequencing Consortium used a
map-based approach to sequencing the human genome;
many copies of the human genome were cut up into fragments of about 150,000 bp each, which were inserted into
bacterial artificial chromosomes. Restriction fingerprints
were used to assemble the BAC clones into contigs, which
were positioned on the chromosomes with the use of genetic
markers and probes. The individual BAC clones were
sheared into smaller overlapping fragments and sequenced,
and the whole genome was assembled by putting together
the sequence of the BAC clones.
In 1998, Craig Venter announced that he would lead a
company called Celera Genomics in a private effort to
sequence the human genome. He proposed using a shotgun
sequencing approach, which he suggested would be quicker
Molecular Genetic Analysis, Biotechnology, and Genomics
✔ Concept Check 7
A contig is
a. a set of molecular markers used in genetic mapping.
b. a set of overlapping fragments that form a continuous stretch
1 Genomic DNA is cut
into numerous small
and cloned in bacteria.
c. a set of fragments generated by a restriction enzyme.
d. a small DNA fragment used in sequencing.
2 Each fragment
TTACC AC GGGGA
3 Overlap in sequence
is used to order the
TTACC AC GGGGA
GGGGA CGA TCCT
4 … and the entire genomic
sequence is assembled by
powerful computer programs.
TCCT GCG AGAC
AGAC GTG TCAA
TTACC ACGGGGACGA TCCT GCG AGAC GTG TCAA
14.25 Whole-genome shotgun sequencing utilizes
sequence overlap to align sequenced fragments.
than the map-based approach employed by the Human
Genome Sequencing Consortium.
Whole-genome shotgun sequencing In whole-genome
shotgun sequencing (Figure 14.25), small-insert clones are
prepared directly from genomic DNA and sequenced. Powerful computer programs then assemble the entire genome
by examining overlap among the small-insert clones. One
advantage of shotgun sequencing is that the small-insert
clones can be placed into plasmids, which are simple and
easy to manipulate. The requirement for overlap means that
most of the genome will be sequenced multiple (often from
10 to 15) times. Shotgun sequencing can be carried out in a
highly automated way, with few decisions to be made by the
researcher, because the computer assembles the final draft of
Shotgun sequencing was initially used for assembling
small genomes such as those of bacteria. When Venter proposed the use of this approach for sequencing the human
genome, it was not at all clear that the approach could successfully assemble a complex genome consisting of billions
of base pairs such as the human genome.
For several years, the public effort by the Human
Genome Sequencing Consortium, using a map-based
approach, and the private Celera effort, using shotgun
sequencing, moved forward simultaneously. In the summer
of 2000, both public and private sequencing projects
announced the completion of a rough draft that included
most of the sequence of the human genome, 5 years ahead of
schedule. Analysis of this sequence was published 6 months
later. The human genome sequence was declared completed
in the spring of 2003, although some gaps still remain. For
most chromosomes, the finished sequence is 99.999% accurate, with less than one base-pair error per 100,000 bp, which
is 10 times as accurate as the initial goal.
The availability of the complete sequence of the human
genome is proving to be of enormous benefit. It has made it
easier to identify and isolate genes that contribute to many
human diseases and to create probes that can be used in
genetic testing, diagnosis, and drug development. The
sequence is also providing important information about
many basic cellular processes. Comparisons of the human
genome with those of other organisms are adding to our
understanding of evolution and the history of life.
The Human Genome Project was an effort to sequence the entire
human genome. Begun in 1990, a rough draft of the sequence was
completed by two competing teams, an international consortium
of publicly supported investigators and a private company, both of
which finished a rough draft of the genome sequence in 2000. The
entire sequence was completed in 2003.
✔ Concept Check 8
Sequencing a genome requires breaking it up into small overlapping fragments whose DNA sequences can be determined in a
sequencing reaction. In map-based sequencing, sequenced fragments are ordered into the final genome sequence with the use of
genetic and physical maps. In whole-genome shotgun sequencing,
the genome is assembled by comparing overlap in the sequences
of small fragments.
The Human Genome Sequencing Consortium used which approach
in sequencing the human genome?
a. Whole-genome shotgun sequencing
b. Map-based sequencing
c. A combination of whole-genome shotgun sequencing and
Since the completion of the sequencing of the human
genome, much of the effort of sequencers has focused on
mapping differences among people in their genomic
Imagine that you are riding the elevator with a random
stranger. How much of your genome do you have in common with this person? Studies of variation in the human
genome indicate that you and the stranger will be identical
at about 99.9% of your DNA sequences. This difference is
very small in relative terms but, because the human genome
is so large (3.2 billion base pairs), you and the stranger will
actually be different at more than 3 million base pairs of
your genomic DNA. These differences are what makes each
of us unique, and they greatly affect our physical features,
our health, and possibly even our intelligence and
A site in the genome where individual members of a
species differ in a single base pair is called a singlenucleotide polymorphism (SNP, pronounced “snip”).
Arising through mutation, SNPs are inherited as allelic
variants (just as are alleles that produce phenotypic differences, such as blood types), although SNPs do not usually
produce a phenotypic difference. Single-nucleotide polymorphisms are numerous and are present throughout
genomes. In a comparison of the same chromosome from
two different people, a SNP can be found approximately
every 1000 bp.
Most SNPs present within a population arose once
from a single mutation that occurred on a particular chromosome and subsequently spread through the population.
Thus, each SNP is initially associated with other SNPs (as
well as other types of genetic variants or alleles) that were
present on the particular chromosome on which the mutation arose. The specific set of SNPs and other genetic variants observed on a single chromosome or part of a
chromosome is called a haplotype (Figure 14.26). SNPs
within a haplotype are physically linked and therefore tend
to be inherited together. New haplotypes can arise through
mutation or crossing over, which breaks up the particular
set of SNPs in a haplotype.
Because of their variability and widespread occurrence
throughout the genome, SNPs are valuable as markers in
linkage studies. When a SNP is physically close to a diseasecausing locus, it will tend to be inherited along with the disease-causing allele. People with the disease will tend to have
different SNPs from those of healthy people. A comparison
of SNP haplotypes in people with a disease and in healthy
people can reveal the presence of genes that affect the disease; because the disease gene and the SNP are closely linked,
the location of the disease-causing gene can be determined
from the location of associated SNPs. This approach is the
same as that used in gene mapping with RFLPs, but there are
many more SNPs than RFLPs, providing a dense set of
The chromosomes are identical
for most of the DNA sequences.
Variation in a single base
constitutes each SNP.
1a AACACGCCA. . .TTCGGGGTC. . .AGTCGACCG. . .
1b AACACGCCA. . .TTCGAGGTC. . .AGTCAACCG. . .
1c AACATGCCA. . .TTCGGGGTC. . .AGTCAACCG. . .
1d AACACGCCA. . .TTCGGGGTC. . .AGTCAACCG. . .
a C G G
b C A A
c T G A
d C G A
Each haplotype is made up of a
particular set of alleles at each SNP.
14.26 A haplotype is a specific set of single-nucleotide
polymorphisms (SNPs) and other genetic variants observed
on a single chromosome or part of a chromosome.
Chromosomes 1a, 1b, 1c, and 1d represent different copies of a
chromosome that might be found in a population.
variable markers covering the entire genome that can be used
more effectively in mapping.
As expected, the availability of SNPs has greatly facilitated the search for genes that cause human diseases. In one
of the most successful applications of SNPs for finding disease associations, a consortium of 50 researchers genotyped
each of 17,000 people in the United Kingdom for 500,000
SNPs in 2007. They detected strong associations between 24
genes and chromosome segments and the incidence of seven
common diseases, including coronary artery disease, Crohn
disease, rheumatoid arthritis, bipolar disorder, hypertension,
and two types of diabetes. The importance of this study is its
demonstration that genomewide association studies utilizing SNPs can successfully locate genes that contribute to
complex diseases caused by multiple genetic and environmental factors.
Complete genome sequences have now been determined for
more than 1000 organisms, with many additional projects
underway. These studies are producing tremendous quantities of sequence data. Cataloging, storing, retrieving, and
analyzing this huge data set are major challenges of modern
genetics. Bioinformatics is an emerging field consisting of
molecular biology and computer science that centers on
developing databases, computer-search algorithms, geneprediction software, and other analytical tools that are used
to make sense of DNA-, RNA-, and protein-sequence data.
Bioinformatics develops and applies these tools to “mine the
Molecular Genetic Analysis, Biotechnology, and Genomics
data,” extracting the useful information from sequencing
projects. The development and use of algorithms and
computer software for analyzing DNA-, RNA-, and proteinsequence data have helped to make molecular biology a
more quantitative field. Sequence data in publicly available
databases, freely searchable with an Internet connection,
enable scientists and students throughout the world to access
this tremendous resource.
Genomic projects are collecting databases of nucleotides that vary
among individual organisms (single-nucleotide polymorphisms,
SNPs). Bioinformatics is a interdisciplinary field that combines
molecular biology and computer science. It develops databases of
DNA, RNA, and protein sequences and tools for analyzing those
search, which relies on comparisons of DNA and protein
sequences from the same organism and from different
organisms. Genes that are evolutionarily related are said to
be homologous. Databases containing genes and proteins
found in a wide array of organisms are available for
homology searches. Powerful computer programs, such as
BLAST, have been developed for scanning these databases
to look for particular sequences. Suppose a geneticist
sequences a genome and locates a gene that encodes a protein of unknown function. A homology search conducted
on databases containing the DNA or protein sequences of
other organisms may identify one or more homologous
sequences. If a function is known for a protein encoded by
one of these sequences, that function may provide information about the function of the newly discovered
14.7 Functional Genomics
Function of Genes by
A genomic sequence is, by itself, of limited use. Merely knowing the sequence would be like having a huge set of encyclopedias without being able to read: you could recognize the
different letters but the text would be meaningless.
Functional genomics characterizes what the sequences do—
their function. The goals of functional genomics include the
identification of all the RNA molecules transcribed from a
genome, called the transcriptome of that genome, and all
the proteins encoded by the genome, called the proteome.
Functional genomics exploits both bioinformatics and laboratory-based experimental approaches in its search to define
the function of DNA sequences.
Predicting Function from Sequence
The nucleotide sequence of a gene can be used to predict the
amino acid sequence of the protein that it encodes. The protein can then be synthesized or isolated and its properties
studied to determine its function. However, this biochemical
approach to understanding gene function is both time consuming and expensive. A major goal of functional genomics
has been to develop computational methods that allow gene
function to be identified from DNA sequence alone, bypassing the laborious process of isolating and characterizing
One computational method (often the first employed)
for determining gene function is to conduct a homology
The function of an unknown gene can sometimes be determined
by finding genes with similar sequence whose function is known.
Gene Expression and Microarrays
Many important clues about gene function come from
knowing when and where the genes are expressed. The development of microarrays has allowed the expression of thousands of genes to be monitored simultaneously.
Microarrays rely on nucleic acid hybridization, in
which a known DNA fragment is used as a probe to find
complementary sequences (Figure 14.27). Numerous
known DNA fragments are fixed to a solid support in an
orderly pattern or array, usually as a series of dots. These
DNA fragments (the probes) usually correspond to known
After the microarray has been constructed, mRNA,
DNA, or cDNA isolated from experimental cells is labeled
with fluorescent nucleotides and applied to the array. Any of
the DNA or RNA molecules that are complementary to
probes on the array will hybridize with them and emit fluorescence, which can be detected by an automated scanner. An
array containing tens of thousands of probes can be applied
to a glass slide or silicon wafer just a few square centimeters
Used with cDNA, microarrays can provide information
about the expression of thousands of genes, enabling scientists to study which genes are active in particular tissues.
They can also be used to investigate how gene expression
changes in the course of biological processes such as development or disease progression. In one study, researchers
used microarrays to examine the expression patterns of
25,000 genes from primary tumors of 78 young women who
1 A microarray consists of DNA probes fixed to a solid
support, such as a nylon membrane or glass slide.
2 Each spot has a
different DNA probe.
cDNA (single stranded)
3 RNA is extracted
4 …and reverse transcription in the
presence of a labeled nucleotide
produces cDNA molecules with
a fluorescent tag.
5 The tagged cDNA will
pair with any
6 After hybridization, the
color of the dot indicates
the relative amount of
mRNA in the samples.
7 A microarray can be
thousands of different
14.27 Microarrays are used to simultaneously detect the expression of many genes.
[After D. Lockhart and E. Winzeler, Nature 405:827, 2000.]
had breast cancer (Figure 14.28). Messenger mRNA from
cancer cells and noncancer cells was converted into cDNA
and labeled with red fluorescent nucleotides and with green
fluorescent nucleotides, respectively. The labeled cDNAs
were mixed and hybridized to a DNA chip, which contains
DNA probes from different genes. Hybridization of the red
(cancer) and green (noncancer) cDNAs is proportional to
the relative amounts of mRNA in the samples. The fluorescence of each spot is assessed with microscopic scanning and
appears as a single color. Red indicates the overexpression of
a gene in the cancer cells relative to that in the noncancer
cells (more red-labeled cDNA hybridizes), whereas green
indicates the underexpression of a gene in the cancer cells
relative to that in the noncancer cells (more green-labeled
cDNA hybridizes). Yellow indicates equal expression in both
types of cells (equal hybridization of red- and green-labeled
cDNAs), and no color indicates no expression in either type
In 34 of the 78 patients, the cancer later spread to other
sites; the other 44 patients remained free of breast cancer for
5 years after their initial diagnoses. The researchers identified
a subset of 70 genes whose expression patterns in the initial
tumors accurately predicted whether the cancer would later
spread (see Figure 14.28). This degree of prediction was
much higher than that of traditional predictive measures,
which are based on the size and histology of the tumor. These
results, though preliminary and confined to a small sample
of cancer patients, suggest that gene-expression data
obtained from microarrays can be a powerful tool in determining the nature of cancer treatment.
Microarrays, consisting of DNA probes attached to a solid support,
can be used to determine which RNA and DNA sequences are present in a mixture of nucleic acids. They are capable of determining
which RNA molecules are being synthesized and can thus be used
to examine changes in gene expression.
14.8 Comparative Genomics
Studies How Genomes
Genome-sequencing projects provide detailed information
about gene content and organization in different species and
even in different members of the same species, allowing
inferences about how genes function and genomes evolve.
They also provide important information about evolutionary relationships among organisms and about factors that
influence the speed and direction of evolution. Comparative
genomics is the field of genomics that compares similarities
and differences in gene content, function, and organization
among genomes of different organisms.
Hundreds of bacterial genomes have now been sequenced.
Most prokaryotic genomes consist of a single circular chromosome, but there are exceptions, such as Vibrio cholerae,