Tải bản đầy đủ - 0 (trang)
1 Positional Cloning: An Introduction to Genomics

1 Positional Cloning: An Introduction to Genomics

Tải bản đầy đủ - 0trang

wea25324_ch24_759-788.indd Page 761



22/12/10



9:02 AM user-f467



Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile

Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles



24.1 Positional Cloning: An Introduction to Genomics



761



Extent of probe



H



H

First individual:



4 kb



H

2 kb



Hindlll

4 kb



Electrophorese, blot, probe



2 kb



4 kb

2 kb



Missing site

H



H

Second individual:



6 kb

Hindlll

Electrophorese, blot, probe



6 kb



6 kb



Figure 24.1 Detecting a RFLP. Two individuals are polymorphic with respect to a HindIII restriction site (red).The first individual contains the site,

so cutting the DNA with HindIII yields two fragments, 2 and 4 kb long, that can hybridize with the probe, whose extent is shown at top. The second

individual lacks this site, so cutting that DNA with HindIII yields only one fragment, 6 kb long, which can hybridize with the probe. The results from

electrophoresis of these fragments, followed by blotting, hybridization to the radioactive probe, and autoradiography, are shown at right. The

fragments at either end, represented by dashed lines, do not show up because they cannot hybridize to the probe.



searching for ORFs is very laborious. Several more efficient methods are available, including a procedure invented by Alan Buckler called exon amplification or exon

trapping. Figure 24.2 shows how an exon trap works. We

begin with a plasmid vector such as pSPL1, which Buckler

designed for this purpose. This vector contains a chimeric

gene under the control of the SV40 early promoter. The

gene was derived from the rabbit b-globin gene by removing its second intron and substituting a foreign intron

from the human immunodeficiency virus (HIV), with its

own 59- and 39-splice sites. We insert human genomic

DNA fragments into a restriction site within the intron of

this plasmid, then place the recombinant vector into monkey cells (COS-7 cells) that can transcribe the gene from

the SV40 promoter. Now if any of the genomic DNA fragments we put into the intron are complete exons, with

their own 59- and 39-splice sites, this exon will become

part of the processed transcript in the COS cells. We purify the RNA made by the COS cells, reverse transcribe it

to make cDNA, then subject this cDNA to amplification

by PCR, using primers that are specific for the regions

surrounding the insert. Thus, any new exon inserted between the primer-binding sites will be amplified. Finally,

we clone the PCR products, which should represent only

exons. Any other piece of DNA inserted into the intron

will not have splicing signals; thus, after being transcribed,

it will be spliced out along with the surrounding intron

and will be lost.



CpG Islands Another gene-finding technique takes advantage of the fact that the control regions of active human

genes tend to be associated with unmethylated CpG sequences, whereas the CpGs in inactive regions are almost

always methylated. Moreover, many methylated CpG sites

have been lost over evolutionary time because of the following phenomenon, known as CpG suppression: Methyldeoxycytidine (methylC) in a methylCpG site can be

deaminated spontaneously to methylU, which is the same

as T. Thus, once a methylC is deaminated, it becomes a T.

If this change is not immediately recognized and repaired,

the T will take an A partner in the next round of DNA replication, and the mutation will be permanent. By contrast,

in an ordinary, unmethylated CpG sequence, deamination yields a U, which is subject to immediate recognition

and removal by a uracil-N-glycosylase (Chapter 20) and

replacement by an ordinary C. So unmethylated CpG

sequences have been retained in the genome.

Furthermore, the restriction enzyme HpaII cuts at the

sequence CCGG, but only if the second C is unmethylated. In other words, it will cut active genes that have

unmethylated CpGs within CCGG sites, but it will leave

inactive sequences (with methylated CCGGs) alone. Thus,

geneticists can scan large regions of DNA for “islands” of

sites that could be cut with HpaII in a “sea” of other DNA

sequences that could not be cut. Such a site is called a

CpG island, or an HTF island because it yields HpaII tiny

fragments.



wea25324_ch24_759-788.indd Page 762



762



22/12/10



9:02 AM user-f467



Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile

Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles



Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale



HIV tat



help clone exons only. Another is to use methylationsensitive restriction enzymes to search for CpG

islands—DNA regions containing unmethylated

CpG sequences.



Cloning site

P β-globin



β-globin



3′-ss



5′-ss

1.



Insert exon

3′-ss



P

5′-ss



3′-ss

2.



5′-ss



5′-ss



3′-ss



Transcribe and splice in COS cells

An



3.



Reverse transcribe and

PCR amplify

n



4.

Clone

Figure 24.2 Exon trapping. Begin with a cloning vector, such as

pSPL1, shown here in slightly simplified form. This vector has an SV40

promoter (P), which drives expression of a hybrid gene containing the

rabbit b-globin gene (orange), interrupted by part of the HIV tat gene,

which includes two exon fragments (blue) surrounding an intron

(yellow). The exon–intron borders contain 59- and 39-splice sites (ss).

The tat intron contains a cloning site, into which random DNA

fragments can be inserted. In step 1, an exon (red) has been inserted,

flanked by parts of its own introns, and its own 59- and 39-splice sites.

In step 2, insert this construct into COS cells, where it can be

transcribed and then the transcript can be spliced. Note that the

foreign exon (red) has been retained in the spliced transcript, because

it had its own splice sites. Finally (steps 3 and 4), subject the

transcripts to reverse transcription and PCR amplification, with

primers indicated by the arrows. This gives many copies of a DNA

fragment containing the foreign exon, which can now be cloned and

examined. Note that a non-exon will not have splice sites and will

therefore be spliced out of the transcript along with the intron. It will

not survive to be amplified in step 3, so one does not waste time

studying it.



SUMMARY Positional cloning begins with mapping

studies (Chapter 1) to pin down the location of the

gene of interest to a reasonably small region of

DNA. Mapping depends on a set of landmarks to

which the position of a gene can be related. Sometimes such landmarks are genes, but more often

they are RFLPs—sites at which the lengths of restriction fragments generated by a given restriction

enzyme vary from one individual to another. Several methods are available for identifying the genes

in a large region of unsequenced DNA. One of

these is the exon trap, which uses a special vector to



Identifying the Gene Mutated

in a Human Disease

Let us conclude this section with a classic example of

positional cloning: pinpointing the gene for Huntington

disease.

Huntington disease (HD) is a progressive nerve disorder. It begins almost imperceptibly with small tics and

clumsiness. Over a period of years, these symptoms intensify and are accompanied by emotional disturbances.

Nancy Wexler, an HD researcher, describes the advanced

disease as follows: “The entire body is encompassed by

adventitious movements. The trunk is writhing and the

face is twisting. The full-fledged Huntington patient is

very dramatic to look at.” Finally, after 10–20 years, the

patient dies.

Huntington disease is controlled by a single dominant

gene. Therefore, a child of an HD patient has a 50:50

chance of being affected. People who have the disease could

avoid passing it on by not having children, except that the

first symptoms usually do not appear until after the childbearing years.

Because they did not know the nature of the product of

the HD gene (HD), geneticists could not look for the gene

directly. The next best approach was to look for a gene or

other marker that is tightly linked to HD. Michael Conneally and his colleagues spent more than a decade trying

to find such a linked gene, but with no success.

In their attempt to find a genetic marker linked to HD,

Wexler, Conneally, and James Gusella turned next to

RFLPs. They were fortunate to have a very large family to

study. Living around Lake Maracaibo in Venezuela is a

family whose members have suffered from HD since the

early nineteenth century. The first member of the family to

be so afflicted was a woman whose father, presumably a

European, carried the defective gene. So the pedigree of this

family can be traced through seven generations, and the

number of individuals is unusually large: It is not uncommon for a family to have 15–18 children.

Gusella and colleagues knew they might have to test

hundreds of probes to detect a RFLP linked to HD, but

they were amazingly lucky. Among the first dozen probes

they tried, they found one (called G8) that detected a RFLP

that is very tightly linked to HD in the Venezuelan family.

Figure 24.3 shows the locations of HindIII sites in the

stretch of DNA that hybridizes to the probe. We can see

seven sites in all, but only five of these are found in all family members. The other two, marked with asterisks and



wea25324_ch24_759-788.indd Page 763



22/12/10



9:02 AM user-f467



Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile

Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles



24.1 Positional Cloning: An Introduction to Genomics



763



Extent of G8 probe

H



H*(1)



H*(2) H



H



H



H



Polymorphic Hindlll sites

1

2



Haplotype

A



17.5



3.7



1.2



2.3



8.4



2.3



8.4



2.3



8.4



2.3



8.4



B

17.5



4.9



C

15.0



3.7



1.2



D

15.0



4.9



Figure 24.3 The RFLP associated with the Huntington disease

gene. The HindIII sites in the region that hybridizes to the G8 probe

are shown. The families studied show polymorphisms in two of these

sites, marked with an asterisk and numbered 1 (blue) and 2 (red).

Presence of site 1 results in a 15-kb fragment plus a 2.5-kb fragment

that is not detected because it lies outside the region that hybridizes

to the G8 probe. Absence of this site results in a 17.5-kb fragment.

Presence of site 2 results in two fragments of 3.7 and 1.2 kb. Absence



of this site results in a 4.9-kb fragment. Four haplotypes (A–D) result

from the four combinations of presence or absence of these two sites.

These are listed at right, beside a list of polymorphic HindIII sites and a

diagram of the HindIII restriction fragments detected by the G8 probe

for each haplotype. For example, haplotype A lacks site 1 but has

site 2. As a result, HindIII fragments of 17.5, 3.7, and 1.2 are produced.

The 2.3- and 8.4-kb fragments are also detected by the probe, but we

ignore them because they are common to all four haplotypes.



numbered 1 and 2, may or may not be present. These latter

two sites are therefore polymorphic, or variable.

Let us see how the presence or absence of these two

restriction sites gives rise to a RFLP. If site 1 is absent, a

single fragment 17.5 kb long will be produced. However, if

site 1 is present, the 17.5-kb fragment will be cut into two

pieces having lengths of 15 kb and 2.5 kb, respectively.

Only the 15-kb band will show up on the autoradiograph

because the 2.5-kb fragment lies outside the region that

hybridizes to the G8 probe. If site 2 is absent, a 4.9-kb fragment will be produced. On the other hand, if site 2 is present, the 4.9-kb fragment will be subdivided into a 3.7-kb

fragment and a 1.2-kb fragment.

There are four possible haplotypes (clusters of alleles

on a single chromosome) with respect to these two polymorphic HindIII sites, and they have been labeled A–D:



fragments will be present in both cases. However, the true

genotype can be deduced by examining the parents’ genotypes. Figure 24.4 shows autoradiographs of Southern

blots of two families, using the radioactive G8 probe. The

17.5- and 15-kb fragments migrate very close together, so

they are difficult to distinguish when both are present, as

in the AC genotype; nevertheless, the AA genotype with

only the 17.5-kb fragment is relatively easy to distinguish

from the CC genotype with only the 15-kb fragment. The

B haplotype in the first family is obvious because of the

presence of the 4.9-kb fragment.

Which haplotype is associated with the disease in the

Venezuelan family? Figure 24.5 demonstrates that it is C.

Nearly all individuals with this haplotype have the disease.

Those who do not have the disease yet will almost certainly

develop it later. Equally telling is the fact that no individual

lacking the C haplotype has the disease. Thus, this is a very

accurate way of predicting whether a member of this family is carrying the Huntington disease gene. A similar study

of an American family showed that, in this family, the A

haplotype was linked with the disease. Therefore, each

family varies in the haplotype associated with the disease,

but within a family, the linkage between the RFLP site and

HD is so close that recombination between these sites is

very rare. Thus we see that a RFLP can be used as a genetic

marker for mapping, just as if it were a gene.

Finding linkage between HD and the DNA region

that hybridizes to the G8 probe also allowed Gusella and

colleagues to locate HD to chromosome 4. They did this

by making mouse–human hybrid cell lines, each containing only a few human chromosomes. They then prepared

DNA from each of these lines and hybridized it to the



Haplotype

A

B

C

D



Site 1



Site 2



Absent

Absent

Present

Present



Present

Absent

Present

Absent



Fragments Observed

17.5; 3.7; 1.2

17.5; 4.9

15.0; 3.7; 1.2

15.0; 4.9



The term haplotype is a contraction of haploid genotype,

which emphasizes that each member of the family will inherit two haplotypes, one from each parent. For example,

an individual might inherit the A haplotype from one parent and the D haplotype from the other. This person would

have the AD genotype. Sometimes different genotypes

(pairs of haplotypes) can be indistinguishable. For example, a person with the AD genotype will have the same

RFLP pattern as one with the BC genotype because all five



wea25324_ch24_759-788.indd Page 764



764



22/12/10



9:02 AM user-f467



Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile

Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles



Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale



Genotypes



AC AA CC AC CC CC AC AA AA CC AC AC



AC AC BC BC BC AA BC



Hin dIII Site #1 1

2

Alleles



17.5 kb

15.0 kb

8.4 kb



17.5 kb

15.0 kb

8.4 kb



2



4.9 kb

3.7 kb



4.9 kb

3.7 kb



2.3 kb



2.3 kb



1.2 kb



1.2 kb



Hin dIII Site #2

Alleles

1



Figure 24.4 Southern blots of HindIII fragments from members

of two families, hybridized to the G8 probe. The bands in the

autoradiographs represent DNA fragments whose sizes are listed at

right. The genotypes of all the children and three of the parents are

shown at top. The fourth parent was deceased, so his genotype could



not be determined. (Source: Gusella, J.F., N.S. Wexler, P.M. Conneally, S.L.

Naylor, M.A. Anderson, R.E. Tauzi, et al., A polymorphic DNA marker genetically

linked to Huntington’s disease. Nature 306:236. Copyright © 1983 Macmillan

Magazines Limited.)



I



II

III

IV

AB



AA



V

AA AB



AB AB



BC AB AB AB



AB BC AB AB BC BB BC AC AA



BC CD BB BC



VI

AC AB AC AC AC AC AA



BC



AA BC AA



BC BC



CC



VII

AC BC



BC



Figure 24.5 Pedigree of the large Venezuelan family with

Huntington disease. Family members with confirmed disease are

represented by purple symbols. Notice that most of the individuals

with the C haplotype already have the disease, and that no sufferers



of the disease lack the C haplotype. Thus, the C haplotype is strongly

associated with the disease, and the corresponding RFLP is tightly

linked to the Huntington disease gene.



radioactive G8 probe. Only the cell lines having chromosome 4 hybridized; the presence or absence of all other

chromosomes did not matter. Therefore, human chromosome 4 carries HD.

At this point, the HD mapping team’s luck ran out. One

long detour arose from a mapping study that indicated the

gene lay far out at the end of chromosome 4. This made the

search much more difficult because the tip of the chromosome is a genetic wasteland, full of repetitive sequences,

and apparently devoid of genes. Finally, after wandering

for years in what he called a genetic “junkyard,” Gusella



and his group turned their attention to a more promising

region. Some mapping work suggested that HD resided,

not at the tip of the chromosome, but in a 2.2-Mb region

several megabases removed from the tip. Unless you know

the DNA sequence, over 2 Mb is a tremendous amount of

DNA to sift through to find a gene, so Gusella decided to

focus on a 500-kb region that was highly conserved among

about one-third of HD patients, who seemed to have a

common ancestor.

On average, a 500-kb region of the human genome contains about five genes. To find them, Gusella and colleagues



wea25324_ch24_759-788.indd Page 765



22/12/10



9:02 AM user-f467



Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile

Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles



24.2 Techniques in Genomic Sequencing



used an exon-trapping strategy and identified a handful of

exon clones. They then used these exons to probe a cDNA

library to identify the DNA copies of mRNAs transcribed

from the target region. One of the clones, called IT15, for

“interesting transcript number 15,” hybridized to cDNAs

that identified a large (10,366 nt) transcript that codes for

a large (3144 amino acid) protein called huntingtin. The

presumed protein product did not resemble any known

proteins, so that did not provide any evidence that this is

indeed HD. However, the gene had an intriguing repeat of

23 copies of the triplet CAG (one copy is actually CAA),

encoding a stretch of 23 glutamines.

Is this really HD? Gusella’s team’s comparison of the

gene in affected and unaffected individuals in 75 HD families demonstrated that it is. In all unaffected individuals,

the number of CAG repeats ranged from 11 to 34, and

98% of these unaffected people had 24 or fewer CAG repeats. In all affected individuals, the number of CAG repeats had expanded to at least 42, up to a high of about

100. Thus, we can predict whether an individual will be

affected by the disease by looking at the number of CAG

repeats in this gene.

Furthermore, the severity, or age of onset of the disease

correlates at least roughly with the number of CAG repeats. People with a number of repeats at the low end of

the affected range (now known to be 36–40) generally survive well into adulthood before symptoms appear, whereas

people with a number of repeats at the high end of the

range tend to show symptoms in childhood. In one extreme

example, an individual with the highest number of repeats

detected (about 100) started showing disease symptoms at

the extraordinarily early age of 2.

Finally, two people were affected, even though their parents were not. In both cases, the affected individuals had

expanded CAG repeats, whereas their parents did not. New

mutations (expanded CAG repeats), although a rare occurrence in HD, apparently caused both these cases of disease.

Another way of demonstrating that this gene is really

HD would be to deliberately mutate it and show that the

mutation has neurological effects. Obviously, one cannot

perform such an experiment in humans, but it would be

feasible in mice, if the gene corresponding to HD is known.

Fortunately, HD is conserved in many species, including

the mouse, where the gene is known as Hdh. In 1995, a

team of geneticists led by Michael Hayden created knockout mice (Chapter 5) with a targeted disruption in exon 5

of Hdh. Mice that are homozygous for this mutation die in

utero. Heterozygotes are viable, but they show loss of neurons with corresponding lowering of intelligence. This reinforces the notion that Hdh, and therefore HD, plays an

important role in the brain—exactly what we would expect

of the gene that causes HD.

How can we put this new knowledge to work? One

obvious way is to perform accurate genetic screening to

detect people who will be affected by the disease. In fact, by



765



counting the CAG repeats, we may even be able to predict

the age of onset of the disease. However, that kind of information is a mixed blessing, as it can be psychologically

devastating. What we really need, of course, is a cure, but

that may be a long way off.

The Advantage of Genomic Data The positional cloning

study we have just examined took years, and much of that

time was spent sequencing DNA in the suspected regions

and trying to determine which gene in the sequence was the

most likely culprit. With the human genome now finished,

that job has become much easier. Just how much easier is

indicated by Neal Copeland, a mouse geneticist who has

been doing positional cloning in mice for years. He says, “It

took us 15 years to get 10 possible cancer genes before we

had the sequence. And it took us a few months to get 130

genes once we had the sequence.” He was talking about the

mouse sequence, of course, but the same principle applies

to humans, and mouse positional-cloning studies very often identify genes that cause similar problems in humans.

So one of the biggest anticipated payoffs of genomics research will be the acceleration of discovery of disease genes

in humans. You should not conclude from this discussion

that positional cloning is obsolete. It will be important as

long as we are curious about finding genes responsible for

traits in any organism. Sequenced genomes simply make

positional cloning much easier.

SUMMARY Using RFLPs, geneticists mapped the



Huntington disease gene (HD) to a region near the

end of chromosome 4. Then they used an exon trap

to identify the gene itself. The mutation that causes

the disease is an expansion of a CAG repeat from the

normal range of 11–34 copies, to the abnormal range

of at least 38 copies. The extra CAG repeats cause

extra glutamines to be inserted into huntingtin.



24.2 Techniques in Genomic

Sequencing

The first genome to be sequenced, as you might expect, was

a very simple one: The small DNA genome of an E. coli

phage called fX174. Frederick Sanger, the inventor of the

dideoxy chain termination method of DNA sequencing,

obtained the sequence of this 5375-nt genome in 1977.

What kind of information can we glean from this sequence? First, we can locate exactly the coding regions for

all the genes. This tells us the spatial relationships among

genes and the distances between them to the exact nucleotide. How do we recognize a coding region? It contains an

ORF that is long enough to code for one of the phage



wea25324_ch24_759-788.indd Page 766



766



22/12/10



9:02 AM user-f467



Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile

Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles



Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale



proteins. Furthermore, the ORF must start with an ATG

(or occasionally a GTG) triplet, corresponding to an AUG

(or GUG) translation initiation codon, and end with the

DNA equivalent of a stop codon (UAG, UAA, or UGA). In

other words, an ORF in a bacterium or phage is the same

as a gene’s coding region.

The base sequence of the phage DNA also tells us the

amino acid sequences of all the phage proteins. All we have

to do is use the genetic code to translate the DNA base sequence of each open reading frame into the corresponding

amino acid sequence. This may sound like a laborious process, but a personal computer can do it in a split second.

Sanger’s analysis of the open reading frames of the

fX174 DNA revealed something unexpected and fascinating: Some of the phage genes overlap. Figure 24.6a shows

that the coding region for gene B lies within gene A and the

coding region for gene E lies within gene D. Furthermore,

genes D and J overlap by 1 bp. How can two genes occupy

the same space and code for different proteins? The answer

is that the two genes are translated in different reading

frames (Figure 24.6b). Because entirely different sets of codons will be encountered in these two frames, the two protein products will also be quite different.

This was certainly an interesting finding, and it raised

the question of how common this phenomenon would be.

So far, major overlaps seem to be confined almost exclusively to viruses, which is not surprising because these simple infectious agents have small genomes in which the

premium is on efficient use of the genetic material. Moreover, viruses have prodigious power to replicate, so enormous numbers of generations have passed during which

evolution has honed the viral genomes.

With the advent of automated sequencing, geneticists

have added much larger genomes to the list of total known

sequences. In 1988, D.J. McGeoch and colleagues published



B

(a)



A



the sequence of an important human virus (herpes simplex

virus I) with a relatively large genome: 152,260 bp. In

1995, Craig Venter and Hamilton Smith and colleagues

determined the entire base sequences of the genomes of

two bacteria: Haemophilus influenzae and Mycoplasma

genitalium. The H. influenzae (strain Rd) genome contains

1,830,137 bp and it was the first genome from a freeliving organism to be completely sequenced. The

M. genitalium genome, at only 580,000 bp, is the smallest

of any known free-living organism and contains only

about 470 genes.

In April 1996, the leaders of an international consortium of laboratories announced another milestone: The

12-million-bp genome of baker’s yeast (Saccharomyces

cerevisiae) had been sequenced. This was the first eukaryotic genome to be entirely sequenced. Later in 1996, the

first genome of an organism (Methanococcus jannaschii)

from the third domain of life, the archaea, was sequenced.

Then, in 1997, the long-awaited sequence of the 4.6

million-bp E. coli genome was reported. This is only about

one-third the size of the yeast genome, but the importance

of E. coli as a genetic tool made this a milestone as well.

In 1998, the sequence of the first animal genome, from

the roundworm Caenorhabditis elegans, was reported. The

first plant genome (from the mustard family member Arabidopsis thaliana) was completed in 2000. C. elegans and

A.  thaliana are both model organisms chosen for study

because of their small genome size, short generation time,

and their ease of manipulation in genetic experiments.

C. elegans has the additional advantages of having fewer than

1000 cells, and being transparent, so the development of

each of its cells can be tracked visually. Two other famous

model organisms are the fruit fly Drosophila melanogaster

and the house mouse Mus musculus. The sequences of the

genomes of these two organisms were reported in 2000



E

C



D



J



G F



Gene E

1

6

ATGAGT

Met Ser

1 2

(b)



1 2

Met Val



89 90

Lys Glu Stop



H



Gene J

1 2

Met Ser



175

465

445

GTTTATGGTA

GAAGGAGTGATGTAATGTCTA

184

Val Tyr Gly

Glu Gly Val Met Stop

59 60 61

149 150 151 152

Gene D



Figure 24.6 The genetic map of phage fX174. (a) Each letter

stands for a phage gene. (b) Overlapping reading frames of fX174.

Gene D (pink) begins with the base numbered 1 in this diagram and

continues through base number 459. This corresponds to amino acids

1–152 plus the stop codon TAA. Dots represent bases or amino acids

not shown. Only the nontemplate strand is shown. Gene E (blue)



begins at base number 179 and continues through base number 454,

corresponding to amino acids 1–90 plus the stop codon TGA. This

gene uses the reading frame one base to the right, relative to the

reading frame of gene D. Gene J (gray) begins at the base number 459

and uses the reading frame one base to the left, relative to gene D.



wea25324_ch24_759-788.indd Page 767



22/12/10



9:02 AM user-f467



Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile

Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles



24.2 Techniques in Genomic Sequencing



and 2002, respectively. Also in 2000, the eagerly awaited

rough draft of the human genome sequence was announced.

By 2001, this “working draft” of the human genome was

published.

In 2002, several important genomes were reported,

in at least draft form. These included the genomes of the

single-celled parasite Plasmodium falciparum, which

causes  malaria, and the mosquito Anopheles gambiae,

which is the major carrier of the parasite. Together, these

genomes promise to help in designing better ways of combating the terrible scourge of malaria. The year 2002 also

saw the publication of draft sequences of the genomes of

two common varieties of rice (Oryza sativa). This is the

first cereal plant genome to be sequenced, and it has enormous potential significance for human nutrition. Much of

the world’s population relies on cereals, and rice in particular, for the bulk of their food.

The genomic sequences of two more vertebrates also

appeared in 2002: The tiger pufferfish (Fugu rubripes), and

the house mouse (Mus musculus). Comparison of these sequences to that of the human genome has already shed

light on vertebrate evolution. Additional help on this evolutionary investigation has come from the sequence of the

genome of the sea squirt, Ciona intestinalis. The adult of

this species is a sessile marine organism that attaches itself

to rocks and pier pilings. It bears scant resemblance to a

vertebrate, but its larval form resembles a tadpole, complete with a dorsal column made of cartilage that bears

some resemblance to a spine. Thus, the sea squirt is a chordate, in the same phylum with the vertebrates. Comparison

of the genome of this organism with those of vertebrates

and invertebrates, such as nematodes and fruit flies, will

give us additional insight into vertebrate evolution.

Most molecular evolution studies depend on comparisons of base sequences of parts of genomes from different

organisms. The guiding principle is that there is a relationship between the divergence of the genomic sequences between any two organisms and the evolutionary distance

between those two organisms. Thus, the genomes of organisms that diverged relatively recently, such as the mouse and

human, should be more similar than the genomes of organisms that diverged longer ago, such as the sea squirt and

human. In general, this is certainly true, but genomic studies

on these and other organisms have revealed some unexpected features. For example, the rate of evolution of the

human genome is not constant throughout. Instead, there

are regions of relatively rapid change interspersed with regions that have changed relatively slowly over time. It will

be fascinating to discover the reasons for these differences.

Another lesson from the genomes sequenced so far is

that the size of an organism’s genome tends to correlate

with the organism’s complexity. (On the other hand, we

discovered in Chapter 2 when we discussed the C-value

paradox that there are many exceptions to this general

rule.) In accord with the rule, prokaryotic genomes tend



767



to  be much smaller than eukaryotic ones. However, it is

interesting that there is some overlap. For example, the

smallest eukaryotic genome sequenced to date is that of the

obligate intracellular parasite of humans and other mammals,

Encephalitozoon cuniculi. This organism has a genome

comprising only about 2.9 Mb, and has only 1997 ORFs

that could potentially code for proteins. (Of course, a parasitic lifestyle enables an organism to survive with fewer

genes because it can rely on its host for many of its needs.)

By contrast, the largest bacterial genome, as of 2008, is that

of the social bacterium Sorangium cellulosum. It has a

genome composed of about 13 Mb, which is even larger

than the genome of budding yeast.

On April 14, 2003, the International Human Genome

Sequencing Consortium announced that it had produced a

“finished” human genome sequence—two years ahead of

schedule. That is, it had done 99% of the sequencing that

was possible with 2003 technology, the sequence was subject to an error rate of only one in 100,000, and all sequences

were in the proper order. This was a significant improvement over the rough draft announced two years earlier. Several hundred gaps remained to be filled, but they were

mostly very challenging repetitive regions and centromeres.

As of December 6, 2010, more than 1440 complete

genomes had been sequenced, of which 1372 were from

microbes, according to the NCBI website (www.ncbi.nlm

.nih.gov/genome). Table 24.1 presents a time line of some of

the most important achievements in genome sequencing. In

the following sections we will discuss the lessons we have

learned from these sequences.

SUMMARY The base sequences of viruses and organisms ranging from phages to bacteria to animals

and plants have been obtained. A rough draft and

finished version of the human genome have also

been obtained. Comparison of the genomes of

closely related and more distantly related organisms

can shed light on the evolution of these species.



The Human Genome Project

In 1990, American geneticists embarked on an ambitious

quest: to map and ultimately sequence the entire human

genome. This effort, which quickly became an international program, was somewhat controversial at first, partly

because of the enormous effort and cost of carrying it

through to its ultimate goal: knowing the entire base sequence of every one of the human chromosomes. The reason for the high cost, of course, is that the human genome

is huge—more than 3 billion bp. To get an idea of the magnitude of this task, consider that if all 3 billion bases were

written down, it would take about 500,000 pages of the

journal Nature to contain all the information. If you could



wea25324_ch24_759-788.indd Page 768



768



22/12/10



9:02 AM user-f467



Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile

Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles



Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale



Table 24.1



Milestones in Genomic Sequencing



Genome (Importance)



Size (bp)



Phage fX174 (first genome)

Phage l (large-DNA phage)

Herpes simplex virus I (large-DNA eukaryotic virus)

Haemophilus influenzae (bacterium, first organism)

Mycoplasma genitalium (smallest bacterial genome)

Saccharomyces cerevisiae (yeast, first eukaryote)

Methanococcus jannaschii (first archaeon)

Escherichia coli (best studied bacterium)

Caenorhabditis elegans (first animal, roundworm)

Human chromosome 22 (first human chromosome)

Arabidopsis thaliana (first plant, mustard family)

Drosophila melanogaster (a favorite genetic model)

Human (working draft of the “holy grail” of genomics)

Plasmodium falciparum (the malaria parasite)

Anopheles gambiae (the major mosquito malaria carrier)

Fugu rubripes (tiger pufferfish)

Mus musculus (house mouse)

Ciona intestinalis (sea squirt, a primitive chordate)

Canis lupus familiaris (dog, working draft)

Gallus gallus (chicken, first farm animal)

Human (finished sequence)

Oryza sativa (rice, first cereal grain)

Pan troglodytes (chimpanzee, our closest relative, working draft)

Three trypanosomatids (Trypanosoma cruzi, T. brucei, and

Leishmania major, parasites that cause severe human illness)

Populus trichocarpa

(black cottonwood, first tree)

First individual humans (two Caucasians,

one African, and one Han Chinese)

Homo Neanderthalensis (our closest evolutionary relative, working draft)



stand the boredom, it would take you about 60 years,

working 8 h/day, every day, at 5 bases a second, to read it

all. Assuming a 1990 cost of about a dollar a base, the

project would consume more than $3 billion, vastly more

than we are used to devoting to a single biological project.

In the end, more efficient sequencing methods allowed the

project to be completed much sooner and at a lower cost

than originally estimated.

The original plan for the Human Genome Project was

systematic and conservative: First, geneticists would prepare genetic and physical maps of the genome. These would

contain the markers, or signposts, that would allow DNA

sequences to be pieced together in the proper order. The

bulk of the sequencing would be done only after the mapping was complete and clones representing all points on the

map were in hand—systematically stored in freezers around



Year



5375

48,513

152,260

1,830,000

580,000

12,068,000

1,660,000

4,639,221

97,000,000

53,000,000

120,000,000

180,000,000

3,200,000,000

23,000,000

278,000,000

365,000,000

2,500,000,000

117,000,000

,2,400,000,000

1,050,000,000

3,200,000,000

489,000,000

,3,000,000,000

25–55,000,000



1977

1983

1988

1995

1995

1996

1996

1997

1998

1999

2000

2000

2001

2002

2002

2002

2002

2002

2003

2004

2004

2005

2005

2005



,485,000,000



2006



3,200,000,000



2007 and 2008



,3,000,000,000



2010



the world. The original target date for completion of the

sequence was 2005.

Then, in May of 1998, Craig Venter, who had established a private, for-profit company, Celera, to sequence the

human genome (and other genomes), shocked the genomics community by announcing that Celera would complete

a rough draft of the human genome by the end of 2000.

That timetable was astonishing enough, but the method by

which he proposed to do the sequencing was even more

arresting. Instead of relying on a map, with the ordered

clones used to build it, Venter proposed a shotgun sequencing approach in which the whole human genome would be

chopped up and cloned, then the clones would be sequenced at random, and finally the sequences would be

pieced together using powerful computer programs that

find overlapping sequences. It was not long before Francis



wea25324_ch24_759-788.indd Page 769



22/12/10



9:02 AM user-f467



Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile

Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles



24.2 Techniques in Genomic Sequencing



Collins, director of the publicly financed Human Genome

Project, rose to Venter’s challenge and promised that he

and his colleagues would also produce a rough draft by

the end of 2000, and a polished final draft by 2003, using

the map-then-sequence strategy.

The upshot of this race was a tie of sorts. Venter and

Collins appeared with President Clinton and other dignitaries at a ceremony in the East Room of the White House on

June 26, 2000, to announce the completion of a rough draft

of the human genome. We will examine the two approaches

to sequencing large genomes: mapping, then sequencing

(clone by clone); and shotgun sequencing. But first, let us

examine the cloning vectors that have been developed for

massive projects like the Human Genome Project.



Vectors for Large-Scale Genome Projects

No matter which sequencing strategy is used, one must first

clone fragments of the genome in appropriate vectors, and

large fragments are particularly valuable. We will describe

two of the most popular here: yeast artificial chromosomes

and bacterial artificial chromosomes. The early mapping

work relied on yeast artificial chromosomes, so we will

begin with those.

Yeast Artificial Chromosomes The main problem with

the cloning tools described in Chapter 4 is that they do not

hold enough DNA for large-scale physical mapping of the

human genome. Even the cosmids accommodate DNA inserts up to only about 50 kb, which is too small for efficient

mapping of regions spanning more than a million bases.

Vectors called yeast artificial chromosomes, or YACs,

were very useful in mapping the human genome because

they could accommodate hundreds of thousands of kilobases each. YACs containing a megabase or more are

known as “megaYACs.” A YAC contains a left and right

yeast chromosomal telomere (Chapter 21), which are both

necessary to protect the chromosome’s ends, and a yeast

centromere, which is necessary for segregation of sister

chromatids to opposite poles of the dividing yeast cell. The

centromere is placed adjacent to the left telomere, and a

huge piece of human (or any other) DNA can be placed in

between the centromere and the right telomere, as shown

in Figure 24.7. The large DNA inserts are prepared by

slightly digesting long pieces of human DNA with a restriction enzyme. The YACs, with their huge DNA inserts, can

then be introduced into yeast cells, where they will replicate just as if they were normal yeast chromosomes.

Using YACs, geneticists made great strides in the mapping phase of the Human Genome Project. They produced

a genetic map of the whole genome that provided an average resolution of 0.7 centimorgan. A centimorgan (cM) is

the distance that yields a 1% recombination frequency

between two markers and corresponds to an average of

about 1 Mb in humans. These researchers also produced



L



C



769



R



+



+



Ligate

L



C



R



Figure 24.7 Cloning in yeast artificial chromosomes. We

begin with two tiny pieces of DNA from the two ends of a yeast

chromosome. One of these, the left arm, contains the left telomere

(yellow, labeled L) plus the centromere (red, labeled C). The right arm

contains the right telomere (yellow, labeled R). These two arms are

ligated to a large piece of foreign DNA (blue)—several hundred

kilobases of human DNA, for example—to form the YAC, which can

replicate in yeast cells along with the real chromosomes.



relatively high-resolution physical maps of two of the

smallest chromosomes, 21 and Y. These maps were especially useful in that they represented long stretches of overlapping DNA segments cloned in YACs. Thus, in the days

before the human genome was sequenced, if you were interested in a disease gene that mapped to one of these chromosomes, you had a much simplified task. You needed only to

discover two markers flanking the gene of interest, look on the

map to find which YAC or YACs contained these markers,

obtain the YACs, and begin your final search for the gene.

Bacterial Artificial Chromosomes Despite all the success

they made possible in human genome mapping, YACs suffer from several serious drawbacks: They are inefficient

(not many clones are obtained per microgram of DNA);

they are hard to isolate from yeast cells; they are unstable;

and they tend to contain scrambled inserts that are really

composites of DNA fragments from more than one site.

Bacterial artificial chromosomes (BACs) solve all of these

problems and were therefore the vector of choice for

much of the sequencing phase of the Human Genome

Project.

BACs are based on a well-known natural plasmid that

inhabits E. coli cells: the F plasmid. This plasmid allows

conjugation between bacterial cells. In some conjugation

events, the F plasmid itself is transferred from a donor F1

cell to a recipient F2 cell, converting the latter to an F1

cell. In other events, a small piece of host DNA is transferred as an insert in the F plasmid (which is called an F9

plasmid if it has an insert of foreign DNA). And in still

other events, the F plasmid inserts into the host chromosome and mobilizes the whole chromosome to pass from

the donor cell to the recipient cell. Thus, because the E. coli

chromosome contains over 4 million bp, the F plasmid can

obviously accommodate a large insert of DNA. In

practice, BACs usually have inserts less than 300,000 bp

(average about 150,000 bp), and these plasmids are stable



wea25324_ch24_759-788.indd Page 770



770



22/12/10



9:02 AM user-f467



Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile

Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles



Chapter 24 / Introduction to Genomics: DNA Sequencing on a Genomic Scale



Sal I



HindIII

Not I



BamHI

Not I



Sal I



CmR



oriS



pBAC108L

(6.9 kb)

ParB



ParA



repE

Figure 24.8 Map of the BAC vector, pBAC108L. Key features include the cloning sites HindIII and BamHI, at top; the chloramphenicol resistance

gene (CmR), used as a selection tool; the origin of replication (oriS); and the genes governing partition of plasmids to daughter cells (ParA and ParB).



in vivo and in vitro. Unlike the linear YACs, which tend to

break under shearing forces, the circular, supercoiled BACs

resist breakage.

Figure 24.8 shows the map of one of the first BACs,

which was developed by Melvin Simon and colleagues in

1992. It has an origin of replication, a cloning site with two

restriction sites (for HindIII and BamHI) into which large

DNA fragments may be inserted. It also has genes (the Par

genes) that govern plasmid partition to the daughter cells

that keep the plasmid copy number at about two per cell.

This contributes to the stability of the plasmid, and it has a

chloramphenicol-resistance gene to enable selection of cells

that have the plasmid.

SUMMARY Two high-capacity vectors have been



used extensively in the Human Genome Project.

Much of the mapping work was done with yeast

artificial chromosomes (YACs), which can accept

inserts of a million or more base pairs. Most of the

sequencing work was performed with bacterial artificial chromosomes (BACs) which can accept up to

about 300,000 bp. The BACs are more stable and

easier to work with than the YACs.



The Clone-by-Clone Strategy

This strategy has inherent appeal because it is so systematic. First, the whole genome is mapped by finding markers

regularly spaced along each chromosome. A by-product of

the mapping is a collection of clones corresponding to the

markers. Because we already know the order of these

clones, we can sequence each one and put that sequence in

its proper place in the whole genome. Thus, this method is

commonly called the clone-by-clone sequencing strategy.

Aside from their usefulness in cloning, genetic and physical

maps have another important benefit: They give us signposts to use when searching for the genes responsible for

diseases. In the next section, we will consider some of the

most powerful methods used in mapping large genomes in

preparation for sequencing. As you read this section, bear

in mind that these techniques are designed to map markers

that are not genes but simply stretches of DNA that vary

from one individual to another. We have already seen one

example of such markers: restriction fragment length polymorphisms (RFLPs).

Variable Number of Tandem Repeats The greater the degree of polymorphism of a RFLP, the more useful it will be.

If only 1 person in 100 has one form of the RFLP (the 6-kb



wea25324_ch24_759-788.indd Page 771



22/12/10



9:02 AM user-f467



Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile

Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefiles



24.2 Techniques in Genomic Sequencing



fragment in Figure 24.1, for example), and the other 99

have the other form (the 4-kb and 2-kb fragments), one

must screen many individuals before finding the one rare

variant. This makes mapping very tedious. However, some

RFLPs, called variable number tandem repeats, or VNTRs,

are more useful. These derive from minisatellites (Chapter 5),

stretches of DNA that contain a short core sequence repeated over and over in tandem (head to tail). Because the

number of repeats of the core sequence in a VNTR is likely

to be different from one individual to another, VNTRs are

highly polymorphic, and therefore relatively easy to map.

However, VNTRs have a disadvantage as genetic markers:

They tend to bunch together at the ends of chromosomes,

leaving the interiors of the chromosomes relatively devoid

of markers.

Sequence-Tagged Sites Another kind of anonymous

marker, which is very useful to genome mappers, is the

sequence-tagged site (STS). STSs are short sequences,

about 60–1000 bp long, that can be detected by PCR.

Figure 24.9 illustrates how to use PCR to detect an STS.

One must first know enough about the DNA sequence in

the region being mapped to design short primers that will



250 bp



PCR



n



Electrophoresis



250 bp



Figure 24.9 Sequence-tagged sites. We start with a large cloned

piece of DNA, extending indefinitely in either direction. The sequences

of small areas of this DNA are known, so one can design primers that

will hybridize to these regions and allow PCR to produce doublestranded fragments of predictable lengths. In this example, two PCR

primers (red) spaced 250 bp apart have been used. Several cycles of

PCR generate many copies of a double-stranded PCR product that is

precisely 250 bp long. Electrophoresis of this product allows one to

measure its size exactly and confirm that it is the correct one.



771



hybridize a few hundred base pairs apart and cause amplification of a predictable length of DNA in between. One can

then apply PCR with these two primers to any unknown

DNA; if the proper size amplified DNA fragment appears,

then the unknown DNA has the STS of interest. Notice

that hybridization of the primers to the unknown DNA is

not enough; they must hybridize a specific number of base

pairs apart to give the right size PCR fragment. This provides a check on the specificity of hybridization. One great

advantage of STSs as a mapping tool is that no DNA must

be cloned and examined and kept in someone’s freezer.

Instead, the sequences of the primers used to generate an

STS are published and then anyone in the world can order

those same primers and find the same STS in an experiment that takes just a few hours. Another big advantage is

that it takes much less DNA to perform PCR than to do a

Southern blot.

Microsatellites STSs are very useful in physical mapping

or locating specific sequences in the genome. But they are

worthless as markers in traditional genetic mapping unless

they are polymorphic. Only then can we use them to determine genetic linkage. Fortunately, geneticists have discovered a class of STSs called microsatellites that are highly

polymorphic. Microsatellites are similar to minisatellites in

that they consist of a core sequence repeated over and over

many times in a row. However, whereas the core sequence

in typical minisatellites is a dozen or more base pairs long,

the core in microsatellites is much smaller—usually only

2–4 bp long. In 1992, Jean Weissenbach and his colleagues

produced a linkage map of the entire human genome based

on 814 microsatellites containing a C–A dinucleotide repeat. They isolated cloned DNAs containing these microsatellites and used their sequences to design PCR primers

that flank the repeats at each locus. A given pair of primers

yielded a PCR product whose size depended on the number

of C–A repeats in a given individual’s microsatellite at that

locus. Happily, the number of repeats varied quite a bit

from one individual to another. Besides the fact that microsatellites are highly polymorphic, they are also widespread

and relatively uniformly distributed in the human genome.

Thus, they are ideal as markers for both linkage and physical mapping.

Genetic (linkage) mapping with microsatellites is done

by the same technique outlined in Chapter 1 for traditional

genetic markers in fruit flies. Instead of determining the

recombination frequency between, say, wing shape and eye

color, geneticists would determine the recombination frequency between two microsatellites. For example, consider

an example in which a man’s DNA yields a microsatellite at

one locus that is 78 bp long and a microsatellite at a nearby

locus that is 42 bp long. His wife has a microsatellite at the

first locus that is 102 bp long and a microsatellite at

the second locus that is 36 bp long. Within limits, the more

their children show nonparental combinations of these two



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

1 Positional Cloning: An Introduction to Genomics

Tải bản đầy đủ ngay(0 tr)

×