Tải bản đầy đủ - 0 (trang)
1 Functional Genomics: Gene Expression on a Genomic Scale

1 Functional Genomics: Gene Expression on a Genomic Scale

Tải bản đầy đủ - 0trang

wea25324_ch25_789-826.indd Page 791



23/12/10



8:43 AM user-f467



/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile



25.1 Functional Genomics: Gene Expression on a Genomic Scale



791



Light

Mask



Light



O OO O O O



H

H

O OO O O O



Chemical

coupling



G



(with



O OO O O O



)

G



Glass



First cycle



G



Light

H

O

G



G



O OO O O O



Light



G



G

H

H

O OO O O O



A



Chemical

coupling



G



(with



O OO O O O



)

A



A G



A



Second cycle



Figure 25.2 Growing oligonucleotides on a glass substrate.

The glass is coated with a reactive group that is blocked with a

photosensitive agent (red). This blocking agent can be removed with

light, but parts of the plate are masked (blue) so the light cannot get

through. In the first cycle, four of the six spots pictured are masked,

so the light reaches only two unmasked spots and removes the

blocking agent. Then a blocked guanosine nucleotide is chemically

coupled to the unblocked spots. In the second cycle, three spots are

masked, and the other three are therefore exposed to the light. This



removes the blocking agent from three spots, including the first one,

which already has a G attached. Thus, after a blocked adenosine

nucleotide is chemically coupled to the three unblocked spots, the

first spot has a G–A dinucleotide, the third and sixth spots have an

A mononucleotide, the fourth has a G mononucleotide, and the

second and fifth spots, which were masked in both cycles, have no

nucleotides attached yet. As the cycle is repeated over and over with

different masking patterns and different nucleotides, unique

oligonucleotides are built up in each spot.



subset of spots, illuminated the others to remove the blocking groups, and attached another nucleotide. On the spots

that were unmasked in both steps, dinucleotides were

formed. By repeating this process, they could build up different oligonucleotides on each spot.

The resulting chip is known as a DNA microchip or oligonucleotide array, although these terms and “DNA microarray” are often used interchangeably. In fact, the generic

term “microarray” can be used to refer to any kind of DNA

or oligonucleotide microarray. The technology is so miniaturized that about 300,000 oligonucleotides can be built on

a chip only 1.28 3 1.28 cm (about ½0 square). And the

process is so efficient that a set of 4n different oligonucleotides can be built in only 4 3 n cycles. So if our goal is

to  generate all the possible 9-mers (49, or about 250,000

different oligonucleotides), we can do it in only 4 3 9 5

36 cycles. How long must an oligonucleotide be to uniquely

identify one human gene product in a mixture of all the others? Knowing the sequence of the human genome helps us

answer this question with great accuracy. However, even

without that information, we can do a calculation to give us

a minimum estimate. A given sequence of n bases will occur

in a DNA about every 4n bases. In other words, a DNA

sequence needs to be n bases long to occur about once in a

DNA 4n bases long. Thus, we need to solve the following

equation for n to find the minimum size of an oligonucleotide we would expect to find only once in the whole human

genome, which may be as much as 3.5 3 109 bases long:



would require 4 3 16 5 64 cycles to build them all on an

oligonucleotide array. Again, however, this is a minimum

estimate, so it would be a good idea to start with longer

oligonucleotides to be reasonably sure that they occur only

once in the human genome and therefore uniquely identify

human genes.

Even before the publication of the sequence of the first

human chromosome, scientists at Affymetrix, Inc. were already producing microchips containing 25-mers designed

to recognize single genes. They based their design on the

sequence that was available, including the many ESTs

already in the database. To enhance the reliability of their

chips, they included multiple oligonucleotides designed to

hybridize to single transcripts, so the results obtained with

each of these oligonucleotides could be checked against

one another.

The oligonucleotides on a microchip or the cDNAs on

a microarray can be hybridized to labeled RNA isolated

from cells (or to corresponding cDNAs) to see which genes

in the cell were being transcribed. For example, consider a

study by Patrick Brown and colleagues in which they used

the DNA microarray technique to examine the effect of

serum on the RNAs made by a human cell. They isolated

RNA from cells grown in the presence and absence of

serum, then reverse transcribed the two RNA samples in the

presence of nucleotides tagged with fluorescent dyes, so the

cDNA products would be labeled with the fluorescent tags.

They used a green-fluorescing nucleotide to label the cDNA

from serum-deprived cells, and a red-fluorescing nucleotide

to label the cDNA from serum-stimulated human cells.

Then they mixed the cDNAs, hybridized them to DNA

microarrays containing unlabeled cDNAs corresponding

to 8613 different human genes, and detected the resulting



4n 5 3.5 3 109



The answer is that if n 5 16, 4n . 3.5 3 109. So our

oligonucleotides need to be at least 16 bases long, and that



wea25324_ch25_789-826.indd Page 792



792



23/12/10



8:43 AM user-f467



/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile



Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics



mentally regulated genes, ordered by time of onset of

the first increase in expression. That is, the topmost

genes in the figure were stimulated earliest in the life

cycle, and the bottommost genes were stimulated last.



Figure 25.3 Using a DNA chip. Brown and colleagues made cDNAs

from RNAs from serum-starved and serum-stimulated human cells.

They labeled the cDNAs corresponding to RNAs from serum-starved

cells with a green fluorescent nucleotide; they labeled the cDNAs

corresponding to RNAs from serum-stimulated cells with a red

fluorescent nucleotide. Then they hybridized these fluorescent cDNAs

together to DNA chips containing cDNAs corresponding to over 8600

human genes. The figure shows the same part of the DNA chip from

three different hybridizations. The red spots (e.g., spots 2 and 4)

correspond to genes that are more active in the presence of serum.

The green spots (e.g., spot 3) correspond to genes that are more

active in the absence of serum. The yellow spots (e.g., spot 1)

correspond to genes that are roughly equally active in the presence or

absence of serum. (Source: Lyer, V.R., M.B. Eisen, D.T. Ross, G. Schuler,







More than 88% of the developmentally regulated genes

are active during the first 20 h of development, which is

before the end of the embryonic phase (see Figure 25.4c).







RNAs from about 33% of the developmentally regulated genes are already present at the very earliest

time point (Figure 25.4c). These represent maternal

genes, or maternal effect genes, those that are expressed

during oogenesis in the mother. Thus, the maturing

oocyte either transcribes these genes or receives their

transcripts from surrounding nurse cells so the

mRNAs are already present in the egg and are available for translation as soon as fertilization occurs.







As illustrated in Figure 25.4d, expression of some

genes is maintained throughout the life cycle, whereas

expression of others peaks and declines. In particular,

as further illustrated in Figure 25.4e, genes that reach

peak expression during early embryonic life tend to

peak again in early pupal development, whereas genes

that peak in the late embryonic phase tend to achieve

another peak in late pupal development. A related

phenomenon, not illustrated here, is that genes that

peak in larval development tend to reach another

peak of expression during adult life.







Genes encoding components of a given supramolecular

complex tended to be coexpressed. Thus, the genes

encoding the ribosomal proteins tended to be regulated

coordinately, as did the genes encoding the proteins

in the mitochondrion.







Genes encoding proteins with related functions tended

to be coexpressed, even if the proteins did not form

complexes. Thus, genes encoding transcription factors,

or cell cycle regulators, tended to be expressed together.







Coexpression of some genes was tissue-specific. For

example, one cluster of 23 coregulated genes included

eight genes that were already known to be expressed

in muscle cells. Upon further examination, the control

regions of 15 of the genes in this cluster had pairs of

binding sites for the transcription factor dMEF2,

which is known to activate genes in differentiating

muscle cells. Seven of the genes in the cluster had unknown function, and six of these had dMEF2-binding

sites and were expressed in differentiating muscle.

Thus, this analysis allowed White and colleagues to

assign a function in muscle differentiation to these six

unknown genes. This is important because it is very

difficult to determine the function of genes based

solely on their sequences. The additional clues about

timing and location of expression are a tremendous

help. Indeed, they allowed White and colleagues to

assign functions to 53% of the genes they analyzed.



T. Moore, J.C. Lee, et al., The transcriptional program in the response of human

fibroblasts to serum. Science 283 (1 Jan 1999) f. 1, p. 83. Copyright © AAAS.)



fluorescence. Figure 25.3 shows the same region of the microarray from triplicate hybridizations. The red spots correspond to genes that are turned on by serum, and the

green spots represent genes that are active in serumdeprived cells. The yellow spots result from hybridization

of both probes to the same spot (the green and red fluorescence together produce a yellow color). Thus, the yellow

spots correspond to genes that are active in both the presence and absence of serum.

Microarrays allow one to examine changes in gene expression in systems much more complex than the one we

have just described. For example, our knowledge of the

complete yeast genome sequence has enabled molecular

biologists to use DNA chips to analyze the expression of

every yeast gene at once, under a variety of conditions.

In another example, Kevin White and colleagues used

DNA chips in 2002 to follow the expression of 4028 Drosophila genes during 66 distinct periods throughout the

fly’s life cycle. Figure 25.4a shows the 66 developmental

stages at which RNAs were collected for gene expression

analysis. Notice that almost half (30) of these time points

were in the embryonic phase of development, in which

gene expression changes most rapidly. In fact, early in the

embryonic phase, when gene expression is most dynamic,

RNAs were collected every half-hour. This analysis yielded

several conclusions:





A large number of genes (3219) experienced a substantial change in expression (four-fold or more) during the

fly’s life cycle. Figure 25.4b shows all of these develop-



wea25324_ch25_789-826.indd Page 793



23/12/10



8:43 AM user-f467



/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile



25.1 Functional Genomics: Gene Expression on a Genomic Scale



793



(b)



(a)

Fertilization

Blastoderm



E



Gastrulation



L



P



A



Muscle I

Embryo

RNA collections



0



5



0



1

E



10



5

Larva



Hatching



15



20



10

Pupa



15



24 h



20



25



30



35



40 days



Adult



Metamorphosis Eclosion



Muscle I



Muscle II

Eye

Testis



Ovary



(c)



Fraction of genes used



Fraction of genes used



1



0.75



0.50



0.25



Maternal genes



1

0.75

0.50

0.25

0

0



5



10



15



20



Developmental time (h)

0

0



5



10



15

20

25

Developmental time (days)



30



35



40



Developmental time

<0.25 0.33 0.5 1

2

Fold induction



Figure 25.4 Patterns of expression of Drosophila genes during

development. (a) Outline of RNA collection periods. White and

colleagues collected RNAs from whole animals at the indicated times

during development (E, embryonic; L; larval; P, pupal; A, the first

40 days of the adult phase). The embryonic period is expanded to

show all of the overlapping collection periods. They purified Poly(A)1

RNA by oligo(dT)-cellulose chromatography and made fluorescent

cDNAs by reverse transcribing the poly(A)1 RNAs in the presence of

a fluorescent nucleotide. Then they hybridized the fluorescent cDNA

from a given time point to a microarray and measured the extent of

hybridization. They normalized all such hybridization values against

the extent of hybridization of a reference standard cDNA prepared

from a mixture of RNAs from all phases of the life cycle. (b) Gene



3



>4



expression profiles. The profiles of 3219 genes whose expression

levels changed by more than four-fold during the fly life cycle are

arranged in order of the onset of the first increase in abundance of

transcript. The developmental phase is indicated at top, with the same

abbreviations and color coding as in (a). The expression level is indicated

by color, as indicated at bottom, blue stands for low expression and

yellow stands for high expression. (c) Graphic representation of the

cumulative fraction of genes that have shown a strong increase in

expression. Note that a large fraction (about 33%) of genes are

already represented by a large amount of RNA at the earliest time

point. These are labeled maternal genes. The inset is an expansion of

the first 20 h of the embryonic phase, which also shows the large

proportion of transcripts already present in the first hour of development.



wea25324_ch25_789-826.indd Page 794



794



23/12/10



8:43 AM user-f467



/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile



(d)



Fold change



Induced and maintained

16



CG5958



Early embryo/early pupa



16



Amalgam



4



0



0

–4



–16



–16

Transiently induced

Fold change



16

4



CG1733



4



0



0



–4



–4



–16



Late embryo/late pupa



16



–16



E



L



P



Figure 25.4 Continued (d) Expression patterns of four selected

genes. At upper left, gene CG5958 shows an induction in early

embryonic phase to a high level that is largely maintained throughout

the life cycle. At upper right, the Amalgam gene shows an induction in

the early embryonic phase, a decrease in the larval phase, and a

reinduction at the boundary between the larval and pupal stages. At

lower left, gene CG1733 shows a distinct peak of expression at the

larval–pupal boundary. At lower right, gene CG17814 shows one burst

of induction that begins in the late embryonic phase and lasts through



SUMMARY Functional genomics is the study of the



expression of large numbers of genes. One branch

of this study is transcriptomics, which is the study of

transcriptomes—all the transcripts an organism

makes at any given time. One approach to transcriptomics is to create DNA microarrays or DNA

microchips, holding thousands of cDNAs or oligonucleotides, then to hybridize labeled RNAs (or corresponding cDNAs) from cells to these arrays or

chips. The intensity of hybridization to each spot

reveals the extent of expression of the corresponding gene. With a microarray one can canvass the

expression patterns (both temporal and spatial) of

many genes at once. The clustering of expression of

genes in time and space suggests that the products

of these genes collaborate in some process. This can

give clues about the functions of genes of unknown

function if the unknown gene is expressed together

with one or more well-studied genes.



Serial Analysis of Gene Expression In 1995, Victor

Velculescu, working with Kenneth Kinzler and colleagues,

developed a novel method of analyzing the range of genes

expressed in a given cell. They called this method serial

analysis of gene expression (SAGE). The underlying strategy of SAGE is to synthesize short cDNAs, or tags, from all

the mRNAs in a cell, and then link these tags together in

clones that can be sequenced to learn the nature of the tags,



CG17814

A



(e)



% embryonic genes with 2nd peak in interval



Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics



1st embryonic peak:

50



Early (0–3 h)

Late (9–19 h)



40

30

20

10

0



L1



L2



L3



P1



P2



P3



A1



A2



A3



Developmental intervals



the larval phase, and a reinduction in the late pupal phase.

(e) Reinduction patterns. The percent of genes expressed either early

(blue) or late (red) in the embryonic phase that show a reinduction at

the given times later in development. Note that the genes expressed

in early embryogenesis tend to be reinduced in the early pupal stage

(P1, bracket over blue bar), whereas the genes expressed in late

embryogenesis tend to be reinduced in the late pupal stage (P3,

bracket over red bar). (Source: Adapted from Arbeitman et al., Science 297,

2002. Fig. 1, p. 2271. © 2002 by the AAAs.)



and therefore the nature of the genes expressed in the cell,

and the extent of expression of each gene.

Figure 25.5 shows how Velculescu and colleagues carried out this strategy. First, they used a biotinylated

oligo(dT) primer to prime reverse transcription of the

mRNAs present in human pancreatic tissue, yielding doublestranded cDNAs. The goal was to reduce the size of the

cDNAs to short tags that could be ligated together and sequenced readily. Because of the shortness of the tags (9 bp

in the example in Figure 25.5), it is important to confine

them to a small region of the cDNAs to increase the chance

that they will uniquely identify one cDNA. To begin the

shortening process, Velculescu and colleagues cleaved

the biotinylated cDNAs with an anchoring enzyme (AE) to

chop off a short 39-terminal fragment. They chose as their

anchoring enzyme NlaIII, which recognizes 4-base restriction sites and therefore yields fragments averaging 250 bp

long. They bound these biotinylated 39-fragments to streptavidin beads, which bind biotin.

Next, they divided the bead-bound cDNA fragments into

two pools and ligated one pool to a linker (Y) and the other

pool to a second linker (Z). Both linkers contained the recognition site for a type IIS restriction endonuclease (the tagging

enzyme [TE]) that cuts 20 bp downstream of this recognition

site. The result of cleavage of the cDNA fragments with the

tagging enzyme FokI was a set of short fragments, each containing the linker (Y or Z) followed by the 4-bp anchoring

enzyme site, followed by 9 bp from the cDNA. That 9-bp

piece of cDNA is the tag. If the tagging enzyme leaves overhangs, these can be filled in to yield blunt ends.



wea25324_ch25_789-826.indd Page 795



23/12/10



8:43 AM user-f467



/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile



25.1 Functional Genomics: Gene Expression on a Genomic Scale



795



(a) Synthesize double-stranded cDNAs using a

biotinylated oligo (dT) primer.

AAAAA

TTTTT

AAAAA

TTTTT

AAAAA

TTTTT

(b) Cleave with anchoring enzyme (AE).

Bind 3′-terminal fragments to streptavidin beads.

GTAC

GTAC

GTAC



AAAAA

TTTTT

AAAAA

TTTTT

AAAAA

TTTTT



(c) Divide in half.

Ligate to linkers (Y and Z).

AAAAA

TTTTT

AAAAA

TTTTT

AAAAA

TTTTT



Y CATG

GTAC

Y CATG

GTAC

Y CATG

GTAC



Z CATG

GTAC

Z



CATG

GTAC



Z CATG

GTAC



AAAAA

TTTTT

AAAAA

TTTTT

AAAAA

TTTTT



(d) Cleave with tagging enzyme (TE),

and blunt the ends.

Primer Y



GGATGCATGCATCATCAT

CCTACGTACGTAGTAGTA

TE



AE



Primer Z



Tag



GGATGCATGGAGGAGGAG

CCTACGTACCT C CTC CTC

TE



AE



Tag



(e) Ligate and amplify by PCR with

primers Y and Z.

GGATGCATGCATCATCATGAGGAGGAGCATGCATCC

Primer Z

CCTACGTACGTAGTAGTACTC CTC C TCGTACGTAGG

Ditag

(f) Cleave with anchoring enzyme.

Isolate ditags.

Join together and clone.

-----CATGCATCATCATGAGGAGGAG CATG CATCATCAT GAGGAGGAGCATG---------GTACGTAGTAGTACTC CTC C TC GTAC GTAGTAGTA CT C CTC C TCGTAC----Tag 1

Tag 2

Tag 3

Tag 4

AE

AE

AE

Ditag

Ditag

Primer Y



Figure 25.5 Serial analysis of gene expression (SAGE). (a) Doublestranded cDNAs are formed from cellular mRNAs, using biotinylated

oligo(dT) to prime first-strand cDNA synthesis. Orange balls represent

biotin. (b) Biotinylated cDNAs are cleaved with an anchoring enzyme

(AE, NlaIII in this case), and the biotinylated 39-end fragments are

bound to streptavidin beads (blue). (c) The bead-bound fragments

are divided into two pools; the fragments in one pool are ligated to

linker Y (blue) and the fragments in the other pool are ligated to linker

Z (pink). (d) The fragments are cleaved with the tagging enzyme (TE),

and ends are filled in if necessary to create blunt ends. In this case,

the tagging enzyme is FokI, which leaves 9-bp tags attached to the

linkers. The tag attached to linker Y is represented by the arbitrary

sequence CATCATCAT and its complement highlighted in yellow, and

the tag attached to linker Z is represented by the arbitrary sequence

GAGGAGGAG and its complement (light purple highlight).



(e) Tag-containing fragments are blunt-end-ligated together and

amplified by PCR with primers that hybridize to primer Y and primer

Z regions in each linker. Only fragments ligated with tags joined tail to

tail (ditags) will be amplified by PCR. (f) The amplified ditag-containing

fragments are cleaved with the anchoring enzyme to yield ditags with

sticky ends. The ditags are ligated together to form concatemers,

which are cloned. Part of a concatemer of ditags is shown, with the

4-base recognition sites for the anchoring enzyme shown in green.

Note that these 4-base sites set off each ditag so it can be recognized

easily. The clones are then sequenced to discover which tags are

represented, and in what quantity. This tells which genes are

expressed, and how actively. (Source: Adapted from Velculescu, V.E.,

L. Zhang, B. Vogelstein, and K.W. Kinsler, Serial analysis of gene expression.

Science 270:484, 1995.)



wea25324_ch25_789-826.indd Page 796



796



23/12/10



8:43 AM user-f467



/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile



Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics



Velculescu and colleagues’ next task was to ligate the

tags together, along with defined DNA so they could tell

where one tag left off and another began. To do this, they

blunt-end-ligated the tagged fragments together to form

fragments with two tags abutting each other in the middle

(forming a ditag) and linkers on each end. The linkers contain sites that are complementary to a pair of primers that

can be used to amplify the whole fragment by PCR. After

the PCR amplification, Velculescu and colleagues cleaved

the products with the anchoring enzyme, ligated these

restriction fragments together, and cloned the products.

Now the ditags can be easily identified because each one is

flanked by the 4-bp anchoring enzyme recognition sites.

And, of course, half of each ditag belongs to one tag, and

half to the other. Clones with at least 10 tags (some had

more than 50) can be identified by PCR analysis and

sequenced. If enough clones are sequenced, we can get an

idea of the range of genes expressed, and tags that show up

repeatedly indicate genes that are very actively expressed.

Velculescu and colleagues’ examination of expression

in the human pancreas by SAGE had predictable, and

therefore encouraging, results. The most common tags

(GAGCACACC and TTCTGTGTG) corresponded to the

genes for procarboxypeptidase A1 and pancreatic trypsinogen 2, respectively. These are two abundantly expressed

pancreatic proenzymes, which, after cleavage to the mature

enzyme forms, digest proteins in the small intestine. Many

other familiar pancreatic genes were identified among the

plentiful tags, but many of the tags did not match any gene

sequences in the database, so their identities were unknown. As the database expands to include all human

genes, all tags should at least be correlated to genes, even if

the functions of some of those genes remain obscure.

SUMMARY SAGE allows us to determine which



genes are expressed in a given tissue and the extent

of that expression. Short tags, characteristic of particular genes, are generated from cDNAs and ligated

together between linkers. The ligated tags are then

sequenced to determine which genes are expressed

and how abundantly.



Cap Analysis of Gene Expression (CAGE) SAGE is a useful method for global analysis of gene expression, but it

focuses on the 39-ends of transcripts. Sometimes it is necessary to identify the 59-ends of transcripts—for example, if

one is interested in identifying promoters on a genomic

scale. In that case, a related method known as cap analysis

of gene expression (CAGE, Figure 25.6) is available.

The CAGE procedure starts with reverse transcription

(RT), as SAGE does, but with two important differences

that ensure production of full-length cDNAs that copy the

mRNA all the way to the 59-end. First, the RT reaction includes a disaccharide known as trehalose. This substance



mRNA

Cap



AAA - - - AAAAA

(a) Reverse transcription



AA

AAA - - - AAA

TTT - - - GAGCTC(GA),



Cap

Full-length



+



AA

AAA - - - AAA

TTT - - - GAGCTC(GA),



Cap

Non-full-length

(b) Biotinylation



AAAAA - - - AAA

TTT - - - GAGCTC(GA),



Cap



+



AAAAA - - - AAA

TTT - - - GAGCTC(GA),



Cap

(c) RNase I



AAAAA

AAA - - TTT - - - GAGCTC(GA),

AAAAA

AAAA - - TTT - - - GAGCTC(GA),



Cap



+



Cap



(d) Magnetic bead capture

Cap



AAA - - TTT - - - GAGCTC(GA),

(e) Base hydrolysis



Cap

TTT - - - GAGCTC(GA),

Linker



(f) Biotin-linker ligation



TCCGAC

AGGCTG

MmeI



TTT - etc.

(g) Second-strand synthesis



TCCGAC

AGGCTG



TTT - etc.

(h) MmeI digestion

20 nt



TCCGAC

AGGCTG

XmaJI

TCCGAC

AGGCTG



CTAGGTCCGAC

CAGGCTG



+

18 nt



discard

TTT - etc.



Magnetic bead capture

and ligation to linker 2

TCTAGA

TTT - etc.

AGATCT

XbaI

XmaJI and XbaI digestion

T

AGATC

20-nt tag



Figure 25.6 Use of CAGE to produce 20-nt tags representing the

59-ends of mRNAs. The procedure is described in the text. After the

tags are produced as shown here, they can be ligated together via their

identical sticky ends to form concatemers, cloned, and sequenced.



wea25324_ch25_789-826.indd Page 797



23/12/10



8:43 AM user-f467



/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile



25.1 Functional Genomics: Gene Expression on a Genomic Scale



stabilizes reverse transcriptase at high temperature, so the

RT reaction can be run at 608C. This elevated temperature

weakens mRNA secondary structure that otherwise would

stop the RT reaction before it reached the 59-end of the

mRNA. Second, a cap trapper method is used: The caps of

the mRNAs in the mRNA–cDNA hybrids are tagged with

biotin. As we will see, this allows hybrids with full-length

cDNAs to be purified away from hybrids containing lessthan-full-length cDNAs.

Figure 25.6 shows how the tagging works. First, the RT

priming is done, not with oligo(dT), but with oligo(dT), preceded by a stretch of random nucleotides that do not hybridize with the poly(A) tail. The importance of this feature will

become apparent shortly. After first strand cDNA synthesis,

both ends of the mRNA are tagged with biotin by reacting

the RNA–DNA hybrid with a biotin-containing reagent that

attaches to diols. There are only two diols (adjacent hydroxyl

groups) in a capped mRNA: the free 29- and 39-hydroxyl

groups in the cap and the 39-terminal nucleotide.

One would like to tag just the cap, but the 39-terminal

nucleotide is unavoidably tagged in the same step. But that

problem is resolved in the next step, in which the hybrids

are treated with RNase I. The RNase degrades any singlestranded RNA that is not hybridized to the cDNA. Thus, it

not only removes the biotin tag from any hybrids that contain incomplete cDNAs, it also removes the biotin tag from

the 39-hydroxyl group at the end of every mRNA’s poly(A)

tail, which cannot hybridize to the random tail at the beginning of the primer. After the RNase treatment, the only

remaining biotin-tagged hybrids are those containing fulllength cDNAs, and these are collected using magnetic

beads coated with the biotin-binding protein streptavidin.

After the hybrids are purified, their mRNA parts, including

the biotin-tagged caps, are destroyed by base hydrolysis,

leaving just the single-stranded cDNAs.

Next, the full-length, single-stranded cDNAs are ligated

to biotin-tagged linkers that contain a recognition site for

the tagging enzyme MmeI, which dictates cleavage 20 and

18 nt away. Thus, after second-strand cDNA synthesis, the

tagged cDNAs can be cut with MmeI to yield 20-nt tags

that can be purified via their biotin parts, and ligated to a

second linker (linker 2) via their 2-nt overhangs. Linker 1

also contains a recognition site for XmaJI and linker 2 contains a recognition site for XbaI, so the tags can be cut with

those two enzymes, ligated together into concatemers,

cloned, and sequenced as in the SAGE procedure.

The 20-nt tags would be expected to be found every 420,

or about 1.1 3 1012 base-pairs. Thus, since the human

genome contains only about 3 3 109 bp, most of the 20-nt

tags should identify a unique sequence in even the large

human genome, which can be found by consulting the

known human genome sequence. This sequence should

begin with the transcription start site, so the promoter

should be in the immediate neighborhood. When Piero

Carninci and colleagues performed this kind of CAGE

analysis on mouse mRNAs from whole brain and three



797



distinct brain regions, they found many CAGE tags that

mapped close to previously mapped start sites, but many

more that did not. This could help identify a number of

new promoters and alternative start sites.

SUMMARY Cap analysis of gene expression (CAGE)



gives the same information as SAGE about which

genes are expressed, and how abundantly, in a given

tissue. Because it focuses on the 59-ends of mRNAs,

it also allows the identification of transcription start

sites and, therefore, helps locate promoters.



Whole Chromosome Transcriptional Mapping Transcriptomics studies have become sophisticated enough that they

can map transcripts with great accuracy to sites in whole

chromosomes. This kind of study, called transcriptional

mapping, is shedding light on a paradox mentioned earlier

in this chapter: The number of protein-encoding genes in

humans is scarcely larger than the number of such genes in

a lowly roundworm! How can we reconcile that fact with

the vastly greater complexity of human beings? One emerging answer is that transcripts of protein-encoding genes

make up only a small fraction of the whole human transcriptome. And the closer we look at this problem, the more

complex the human transcriptome becomes.

If we consider only exons in protein-coding genes, we

would predict that only 1–2% of the whole human genome

would be expressed in RNAs found in the cytoplasm of

cells. However, as early as 2002, Thomas Gingeras and colleagues, using microarrays to study expression of human

chromosomes 21 and 22, discovered that polyadenylated

RNAs in the cytoplasm of human cells covered about an

order of magnitude more of those two chromosomes than

could be accounted for by protein-encoding exons. This

excess of unexpected transcripts has been dubbed transcripts of unknown function, or TUFs. All of the transcribed regions (exons and TUFs alike) detected by such

arrays are called transcribed fragments, or transfrags.

Furthermore, approximately two-thirds of the transcripts in human cells and hamster cells have been reported

to be nonpolyadenylated [poly(A)2]. These poly(A)2 transcripts therefore represent another chunk of the human

genome, whose extent is unknown, but apparently large.

Taken together, these findings suggest that protein-encoding

exons make up only a small fraction of the total genomic

sequences represented by cytoplasmic RNAs.

To investigate this intriguing conclusion further, Gingeras

and colleagues used high-density oligonucleotide arrays

with 25-mers spaced on average only 5 bp apart, thus providing an average of a 20-bp overlap. Why use such a high

density? For one thing, it allows one to detect shorter

exons, and, for another, hybridizations to overlapping oligonucleotides give greater confidence that transcription in

that region really occurs. The oligonucleotide on the arrays



wea25324_ch25_789-826.indd Page 798



798



23/12/10



8:43 AM user-f467



/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile



Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics



came from the sequences of ten human chromosomes (6, 7,

13, 14, 19, 20, 21, 22, X, and Y), representing 30% of the

total length of the human genome. To the arrays, Gingeras

and colleagues hybridized double-stranded cDNAs representing cytoplasmic poly(A)1 RNAs from eight different

human cell lines, or cytoplasmic and nuclear poly(A)1 and

poly(A)2 RNAs from a single cell line (HepG2). In all cases,

transfrags that overlapped pseudogenes or repetitive DNA

regions were dropped from consideration.

About 9% of more than 74 million probe pairs (both

strands) hybridized to cDNAs from poly(A)1 RNA, per

cell line. Applying a “1 of 8” rule, in which a probe pair

needs to hybridize to a cDNA from only one of the eight

cell lines, the percentage of positive probes rose to 16.5%.

This is the “1 of 8 map.” An average of 4.9% of the nucleotides in the 10 chromosomes were expressed as cytoplasmic RNA in each cell line. In the 1 of 8 map, this figure

rose to 10.1%. These findings suggest that about 10.1% of

the sequences in the 10 human chromosomes are expressed

as polyadenylated RNA in the cytoplasm in at least one



cell line. Furthermore, the difference between 4.9% and

10.1% indicates that considerable cell-line-specific transcription occurs.

Figure 25.7 shows the proportions of each of the 10

chromosomes from which cytoplasmic polyadenylated

transcripts are made. Such transcripts from intergenic regions and introns are, by definition, unannotated. And

these regions make up the majority (57%) of the transcripts

from the 10 chromosomes as a whole (central pie chart).

The annotated transcripts overlap with one of three annotations: Known, which is a combination of two exon databases; mRNA, which contains the mRNAs from a third

database that do not overlap with the Known exons; and

EST, which contains all publicly available ESTs that do not

overlap with either the Known or mRNA databases.

What about poly(A)2 transcripts? For this analysis,

Gingeras and colleagues focused on a single cell line,

HepG2. They looked for stable poly(A)1, poly(A)2, and

bimorphic transcripts in both the nucleus and cytoplasm of

these cells. (Bimorphic transcripts start out polyadenylated,



6

25%



32%



7

32%



21%

5%



4%



63%



13%



12%



Y 12%

2%

6%



27%



29%



13



17%

4%



43%



10%



17%

Combination of all

10 chromosomes

X

25%



36%



26%

14



Known

26%



Intergenic

31%



29%



29%



4%

23%



24%



EST 12%



22

22%



Intronic

26%



34%



25%



21



6%

13%



6%



mRNA 5%



11%



13%



13% 19

46%



21%

20



25%



32%



29%



4%

13%

26%

Figure 25.7 Transcription maps of 10 human chromosomes.

The percentages of different categories of sequences found in

polyadenylated cytoplasmic transcripts in the 1 in 8 map are

represented by the wedges of each pie chart. Each of the chromosomes

represented by the small pie charts is identified in boldface, as is the

collective of all 10 chromosomes (large pie chart in the middle).



29%



15%



5%



4%

26%



12%



Sequence categories are given in the collective pie chart, and the

same color coding is used throughout. The unannotated sequences are

intergenic and intronic. The annotated sequences are designated

Known, mRNAs, and ESTs. (Source: Cheng, J., T.R. Gingeras, et al. 2005.

Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution.

Science 308:1149–54.)



wea25324_ch25_789-826.indd Page 799



23/12/10



8:43 AM user-f467



/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile



25.1 Functional Genomics: Gene Expression on a Genomic Scale



but then lose their poly[A] tail.) They found that fully

15.4% of nucleotides in the 10 chromosomes are represented

in one of these classes of transcripts (almost half of which

are poly[A]2). Thus, about 10 times as much of the genome

is represented in stable transcripts than we would expect

on the basis of exons alone. Of course, the majority of most

human genes is in introns, so this result may not sound

surprising at first. But if spliced-out introns have no function, we would expect them to be degraded rapidly and not

contribute so heavily to the cDNAs made from presumably

stable nuclear RNAs.

Another conclusion from this study is that about half of

the human transcriptome appears to be overlapping. There

are two kinds of overlaps: those on the same strand, and

those on opposite strands. Of course, transcripts that overlap on opposite strands represent sense/antisense pairs,

which should invoke an RNAi response. Thus, this may

represent a kind of gene expression control mechanism.

Studies like this that show abundant cytoplasmic

poly(A)1 and poly(A)2 transcripts of non-exon regions

may help to explain the differences between organisms.

Although the exons of humans and chimpanzees are extremely similar, the non-exon regions have diverged considerably more. And transcription of those regions may give

rise to some of the differences we see in the two species.



SUMMARY High-density whole chromosome tran-



scriptional mapping studies have shown that the

majority of sequences in cytoplasmic polyadenylated RNAs derive from non-exon regions of 10 human chromosomes. Furthermore, almost half of the

transcription from these same 10 chromosomes is

nonpolyadenylated. Taken together, these results indicate that the great majority of stable nuclear and

cytoplasmic transcripts of these chromosomes

comes from regions outside the exons. This may

help to explain the great differences between species, such as humans and chimpanzees, whose exons

are almost identical.



Genomic Functional Profiling

The ultimate goal of genomic functional profiling is to determine the pattern of expression of all the genes in an organism at all stages of the organism’s life. That is a daunting task

even in the simplest of eukaryotes, but it is even more difficult in complex multicellular organisms. So far, the puzzle

for each organism is being put together piece by piece, with

each research group contributing its own piece. Let us consider some general techniques for attacking the problem.

Deletion Analysis Once all the genes in a genome have

been identified, one can investigate what happens when



799



each of them is removed. That kind of experiment is ethically

impossible in humans, of course, but it can be done in other

vertebrates as their genomes are completely sequenced—at

least in principle. Logistical problems may delay this kind of

analysis of a genome as large as that of a vertebrate, but the

yeast genome has already been profiled in this way.

In 2002, a large consortium of investigators led by Ronald

Davis reported that they had generated a set of yeast

mutants, in each of which one gene had been replaced with

an antibiotic resistance gene flanked by 20-mer sequences

that were different for each replaced gene. Thus, each gene

replacement has a “molecular barcode” so it can be

uniquely identified. In all, these investigators replaced over

96% of the annotated ORFs in Saccharomyces cerevisiae.

Next, they examined the mutants for ability to grow in a

mixed culture under six different conditions: high salt; sorbitol; galactose; pH 8; minimal medium; and the antifungal

agent nystatin. They also examined gene expression under

each of these conditions by hybridization of RNA to oligonucleotide microarrays.

To do this genomic functional profile, Davis and colleagues grew a mixed culture of all 5916 mutants under

each of the conditions and collected cells at various times

and tested for each barcode by hybridization to an oligonucleotide array containing sequences complementary to

the barcodes. If a gene is important for dealing with a given

condition, such as the presence of galactose, then mutants

lacking that gene should disappear rapidly from the mixture when that condition is imposed. In fact, the rate at

which the mutant disappears should correlate with the importance of the deleted gene in dealing with the condition.

When the investigators applied this kind of profiling to

yeast mutants responding to the presence of galactose, they

found several genes that were already known through years

of study to be involved in yeast metabolism of galactose.

But they also found 10 new genes that had previously not

been implicated in galactose metabolism. Wild-type yeast

and 11 of the mutants identified by the profiling as important in galactose metabolism were tested individually, and

the results are presented in Figure 25.8. As predicted, all 11

mutant strains grew more slowly in galactose than the

wild-type strain did. Their growth rates varied from 44%

to 91% of wild-type.



SUMMARY Genomic functional profiling can be

performed in several ways. In one kind of mutation

analysis, called deletion analysis, mutants are created by replacing genes one at a time with an antibiotic resistance gene flanked by oligomers that serve

as a barcode to identity each mutant. Then, a functional profile can be obtained by growing the whole

group of mutants together under various conditions

to see which mutants disappear most rapidly.



wea25324_ch25_789-826.indd Page 800



800



23/12/10



8:43 AM user-f467



/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile



Chapter 25 / Genomics II: Functional Genomics, Proteomics, and Bioinformatics



0.7

WT

gal4⌬

gal3⌬

gal1⌬

yml090w⌬

msn2⌬

gal2⌬

yml077wΔ

ykl037w⌬

ftr1⌬

fet3⌬

gef1⌬



0.6



Growth (A600)



0.5



0.4



100%

51.3%

53.4%

49.0%

44.2%

62.5%

73.5%

91.0%

60.1%

85.6%

86.9%

65.4%



0.3



0.2



0.1



0



5



10

Time (h)



15



20



Figure 25.8 Growth curves of various mutants discovered by

profiling to be deficient in response to galactose. Davis and

colleagues tested wild-type yeast cells and 11 deletion mutants

individually for growth in galactose-containing medium. All of the

mutants had been identified by profiling in a mixture of strains as

defective in growth with galactose. A600 (absorbance of 600-nm light)

is a measure of turbidity, which in turn is a measure of yeast growth.

(Source: Adapted from Giaever, G., A.M. Chu, L. Ni, C. Connelly, L. Riles,

S. Veronneau, et al., Functional profiling of the Saccharomyces cerevisiae genome.

Nature 418, 2002, p. 388, f. 2.)



RNAi Analysis “Knocking out” genes by mutagenesis is

laborious, and has so far been accomplished on a genomewide scale only in yeast. But some more complex organisms are amenable to a simpler alternative: “knocking

down” genes by RNA interference (RNAi, Chapter 16).

The nematode worm Caenorhabditis elegans is particularly

(a)



susceptible to RNAi, which even affects the progeny of

treated worms; it can reproduce parthenogenically, which

means that only one parent is required; it contains fewer

than 1000 cells, and its whole genome has been sequenced.

Thus, this organism is an obvious target for genomic functional profiling by RNAi analysis.

Birte Sönnichsen and colleagues have exploited this

technique to inactivate 19,075 of the worm’s genes, over

98% of the total, and observe the effects on early

embryogenesis—the first two cell divisions after fertilization. They injected 25-bp double-stranded RNAs into

worms and then followed the first two cell divisions in the

progeny of the injected worms by time-lapse microscopy.

They also checked for the viability of the embryos beyond

the two-cell stage and for gross phenotypic alterations

in the larval and adult stages.

In all, inactivation of 1668 genes by RNAi produced

detectable phenotypic defects. Of these 1668, inactivation

of 661 genes gave reproducible defects in the first two cell

divisions; the rest gave defects at later stages of development (Figure 25.9). (It is not surprising that inactivating

virtually all of the 661 genes that gave defects in early embryogenesis also produced embryonic lethality.)

One problem with RNAi is that it sometimes fails to inactivate genes (false-negatives), so negative results are difficult to interpret. As a check on their procedure, Sönnichsen

and colleagues evaluated the 65 genes that had previously

been shown by mutagenesis to affect the first cell division.

Of these genes, 62 (95%) had been detected by the RNAi

analysis. The three genes that had been missed the first time

were rechecked by RNAi analysis, and two were detected

the second time, increasing the success rate to 98%.

It is also true that mutations are detected only if they

give clear phenotypes, so the mutagenesis strategy also produces false-negatives. Thus, as another check on their procedure, the researchers compared their data to other RNAi

analyses that targeted early embryogenesis, and found that

(b)



Adult (134)

8%



Larva (268)

16%



Early embryo (661)

40%



Mutant (1668)

9%

No dsRNA (469)

2%

Wild-type (17,426)

89%

Figure 25.9 Distribution of phenotypes from a genomic

functional profile of C. elegans using RNAi. (a) Initial screen.

Sönnichsen and colleagues targeted 19,075 genes with dsRNAs. Of

these, 17,426 (“wild-type,” blue) caused no change in phenotype in

the screens the authors used, and 1,668 (“Mutant,” red) showed an

alteration in phenotype. Four hundred sixty-nine genes (“No dsRNA,”

yellow) were not targeted in this experiment. (b) Distribution of



Late embryo (605)

36%

mutant phenotypes. Starting with the 1668 genes whose inactivation

yielded mutant phenotypes, Sönnichsen and colleagues sorted the

developmental stages at which defects were seen. For example, 661

of these (red) exhibited defects in the early embryo stage (first two

cell divisions). (Source: Adapted from Sönnichsen, et al., Full-genome RNAi

profiling of early embryogenesis in Caenorhabditis elegans. Nature. Vol. 434

(2005) f. 2, p. 465.)



23/12/10



8:43 AM user-f467



/Volume/204/MHDQ268/wea25324_disk1of1/0073525324/wea25324_pagefile



801



25.1 Functional Genomics: Gene Expression on a Genomic Scale



Tissue-Specific Functional Profiling Another approach

to  genomic functional profiling is to observe the tissuespecificity of the genes that are inactivated by mutation or

other means. In one notable study, Lee Lim and colleagues

used two miRNAs to knock down expression of genes in

human (HeLa) cells in culture, and then looked at the profile

of genes whose expression was significantly reduced. Remarkably, miR-124, an miRNA expressed in brain, knocked

down expression of genes that are expressed at low levels

in brain, while miR-1, an miRNA expressed in muscle,

knocked down expression of genes that are expressed at

low levels in muscle. In other words, these two miRNAs

shifted the expression of genes in HeLa cells towards that

seen in the tissues in which the respective miRNAs are

prominent. This is exactly what we would expect if these

two miRNAs play a major role in turning down the expression of these same genes in vivo.

A further striking feature of this study is that the miRNAs

reduced the concentrations of the mRNAs in question, even

though, as we learned in Chapter 16, animal miRNAs

generally affect mRNA translation, not mRNA concentrations. Thus, Lim and colleagues introduced double-stranded

miRNAs into HeLa cells and then used microarrays to

measure the levels of mRNAs purified from the treated

cells. The result was clear reduction in the concentrations

of 100 or more mRNAs with each miRNA.

Here is how Lim and colleagues did their analysis, considering miR-124 first. They began by plotting the expression levels of 10,000 human genes in each of 46 tissues,

using data from a previous genome-wide survey. The histogram in Figure 25.10a contains the data for gene expression



Number of genes



Number of genes



10



250

200

150

100

50

0



8

6

4

2

0



10 20 30 40

Cerebral cortex rank

(d)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46



–15



–10

–5

P-value (Log10)



0



1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46



Heart



(c)



10 20 30 40

Cerebral cortex rank



Skeletal muscle



plex organisms can be done by inactivating genes

via RNAi. An application of this approach targeting the

genes involved in early embryogenesis in C. elegans

has identified 661 important genes, 326 of which

are involved in embryogenesis per se.



(b)

300



Tissues



SUMMARY Genomic functional analysis on com-



(a)



Brain tissues



they had detected 75% of the genes that others had found.

Accordingly, Sönnichsen and colleagues concluded conservatively that their RNAi analysis could detect 75–90% of

genes involved in early embryogenesis.

Next, the researchers grouped the 661 genes according

to their specific phenotypes. They found that inactivation

of about half (326) of the genes produced defects in embryogenesis per se, while the remainder (335) simply affected the general cell metabolism required to keep the

embryo alive long enough to divide twice. By careful annotation of the specific defects, the researchers were able to

group the former 326 genes into defects in 23 aspects of

embryogenesis, such as spindle assembly (9 genes) and sister chromatid separation (64 genes).



–8



–6 –4 –2

P-value (Log10)



Tissues



wea25324_ch25_789-826.indd Page 801



0



Figure 25.10 Tissue-specific down-regulation by miRNAs.

(a) Ranking of expression of genes in cerebral cortex. The rankings of all

10,000 genes in each of 46 tissues are plotted as follows: The left-most

bar (rank 1) represents the genes that are expressed at a higher level in

cerebral cortex than in any other tissue; the next bar (rank 2) represents

genes that are expressed at a higher level in cerebral cortex than in any

other tissue except one, and the last bar (rank 46) represents the genes

that are expressed at a lower level in cerebral cortex than in any other

tissue. (b) Ranking of genes whose mRNA levels are significantly

decreased by miR-124. Note the skew toward genes that are poorly

expressed in cerebral cortex compared to the background in panel (a),

which gives a P-value of significance of about 10212. (c) Plot of the Log10

of P-values derived from plots like that in panel (b) for all 46 tissues. The

only tissues with significant P-values (,0.001) are brain tissues: 5, whole

brain; 6, amygdala; 7, caudate nucleus; 8, cerebellum; 9, cerebral cortex;

10, fetal brain; 11, hippocampus; 12, postcentral gyrus; and 13, thalamus.

(d) Similar to (c), except that the analysis was performed on cells to

which miR-1, instead of miR-124, had been added. (Source: Adapted

from Lim et al., Microarray analysis shows that some microRNAs downregulate

large numbers of target mRNAs. Nature. Vol. 433 (2005) f. 1, p. 770.)



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

1 Functional Genomics: Gene Expression on a Genomic Scale

Tải bản đầy đủ ngay(0 tr)

×