Tải bản đầy đủ - 0 (trang)
Coarse-grained representation of protein flexibility. Foundations, successes, and shortcomings

Coarse-grained representation of protein flexibility. Foundations, successes, and shortcomings

Tải bản đầy đủ - 0trang

184



OROZCO ET AL.



I. Introduction

Biological macromolecules, and in particular proteins, are large and

flexible entities, which perform their biological action when embedded in

solvent, either water or the membrane phospholipids. Analysis of current

version of the Protein Data Bank (PDB; Berman et al., 2000; http://www.

pdb.org) illustrates that proteins of known experimental structure range

typically between 500 and 7000 atoms, but in some cases, protein systems

reach more than 16,000 atoms (see Fig. 1). As experimental resolution

techniques advance, the size histogram in Fig. 1 is expected to displace to

the right side due to the incorporation of large protein assemblies to the

database. However, the real problem in protein simulation originates from

the need to introduce solvent in the calculation, which dramatically

increases the number of particles in the system. For example, in our

MoDEL (Molecular Dynamics Extended Library) database, that contains

atomistic molecular dynamics (MD) simulations of representative PDB

proteins in water (largely enriched in domain-sized proteins), typical



Number atoms in proteins (PDB, 2010)



MoDEL number of atoms distribution

250

200



Population



3–519

520–1036

1037–1553

1554–2070

2071–2587

2588–3104

3105–3621

3622–4138

4139–4655

4656–5172

5173–5689

5690–6206

6207–6723

6724–7240

7241–7757

7758–8274

8275–8791

8792–9308

9309–9825

9826–10342

10343–10859

10860–11376

11377–11893

11894–12410

12411–12927

12928–13444

13445–13961

13962–14478

14479–14995

14996–15512

15513–16029

16030–16546



150

100

50

0

0



50,000



1e + 05



1.5e + 05



Number of atoms



FIG. 1. Distribution of protein atoms in 2010 version of the Protein Data Bank

(PDB). Inset corresponds to the distribution of atoms in solvated protein systems in our

MoDEL database (http://mmb.pcb.ub.es/MoDEL).



COARSE-GRAINED REPRESENTATION OF PROTEIN FLEXIBILITY



185



simulation systems range from 10,000 to 50,000 atoms, but some systems

have more than 150,000 atoms (see Fig. 1), that is, we are dealing with

systems with up to half a million degrees of freedom. If we are interested in

studying protein interactions, diffusion, or aggregation processes,

simulated systems can easily reach many millions degrees of freedom,

making atomistic simulation very complex.

As noted in the previous paragraph, size is a major limitation for the

atomistic simulation of proteins, but often even more dramatic than the

size problem is the time problem. Proteins are flexible, they move continuously, and therefore biological function cannot be understood without

considering protein dynamics. Unfortunately, proteins move as a result of

atomic vibrations happening in the nanosecond timescale, while most

biologically relevant protein motions happen in the millisecond to second

range. Thus, in order to follow, with atomistic detail, a biologically relevant

protein motion, its energy (and associated forces) should be computed at

least 1012 times. For a typical system of 50,000, the calculation of just

interatomic distances would require of the order of 1021 floating point

operations, not far from the Avogadro number.

Protein dynamics can be studied by different techniques, the most

rigorous one being atomistic MD. In this approach, all atoms of the system

are included at the same level of detail and their trajectories are determined by simple integration of Newton’s (or closely related) equations of

motion:

dE

mi !

ai ¼ À !

dr i



!

v i t ẳ 0ị ỵ

v i t ị ẳ !



!

r i t ẳ 0ị ỵ

r i t ị ¼ !



ð t ¼ dt

t ¼0



ð t ¼ dt

t ¼0



ð1Þ



!

a i ðt Þdt



ð2Þ



!

v i ðt Þdt



ð3Þ



where t stands for time, r for the position, v for the velocity and a for the

acceleration of atom i. The potential energy E is computed by using

potential functions containing both bonded (stretching, bending, and

torsion) and nonbonded interactions (van der Waals and electrostatic).



186



OROZCO ET AL.



Stretching and bending are represented by harmonic expressions, torsions

by Fourier series, electrostatics by Coulombic rÀ 1 term and van der Waals

by a Lennard–Jones rÀ 12, rÀ 6 term. These functional terms have been

carefully parametrized, using experimental data and high-level quantum

mechanical calculations as reference. It is not our purpose to comment

these methods here and we just address the reader to suitable reviews of

both atomistic MD and atomistic force fields (McCammon et al., 1977;

Brooks et al., 1988; Karplus and McCammon, 2002).

Since its development in the seventies, MD has increased its popularity

and it is now a technique used routinely by a large number of laboratories

around the world. Several programs running highly optimized codes are

available, often with free or almost free license scheme for (at least)

academic groups. The improvement of MD codes and the development

of more efficient and powerful computers have made MD simulations

possible in the microsecond timescale for small proteins, while for the

larger systems (more than million atom systems have been considered),

‘‘state-of-the art’’ simulations are at least one order of magnitude shorter.

Databases such as MoDEL (Rueda et al., 2007a; Meyer et al., 2010) or

Dynameomics (Van der Kamp et al., 2010) compile and make available to

the community the near-equilibrium (10–100 ns range) dynamics of proteins in water for nearly 2000 representative proteins (see Fig. 2), covering

a good percentage of unique-proteins PDB space (see Fig. 3).

In summary, MD is now a mature and widely used technique which

provides results of high quality. Unfortunately, despite its successes we

cannot ignore the existence of four fundamental problems that handicap

its practical applicability: (i) the use of MD requires access to large

computer resources and a notable degree of expertise in the setup of

simulations; (ii) MD simulations are very costly and even with the best

computer resources, human-collection time can extend into the months

(or even years) timescale; (iii) timescale accessible to MD simulations is

still far from that required to properly represent many biologically relevant

transitions; and finally, (iv) data mining of hundreds of gigabytes of

trajectories is complex, with a small signal/noise ratio and again requires

significant experience and special computer equipment.

Development of tools for the automatization of the setup of MD simulations, which take care of completing missing atoms in crystal structure, of

relaxing bad contacts, choosing suitable ionization states for titrable group

of proteins, detecting the placement of structural waters and ions,



COARSE-GRAINED REPRESENTATION OF PROTEIN FLEXIBILITY



187



2000



Number of simulations



1800

1600

1400

1200

1000

800

600

400

200

0

l



s



ta



s



er



To



m

no



o



M



ph



li



O



y



nl



ai



Al



M



ly



n



ai



Be



Be



M



r



ta



ta



a



er



m

go



p



/

ha



Al



la



gu



e

Irr



Protein classes



FIG. 2. Total number of protein simulations in 2010 version of MoDEL and

distribution across structural classes.



40



MoDEL coverage (%)



35

30

25

20

15

10

5

0

PDB

structures



Uniprot KBSeq



Uniprot

KBHuman



Drug bank

targets



FIG. 3. Coverage (measured from BLAST sequences comparison using a limit

e-value of 10À 5) of 2010 version of MoDEL on different structure and sequence

databases.



188



OROZCO ET AL.



defining topologies, creating the solvent environment, and performing

the thermalization and equilibration, will surely open-up MD to a broader

community. Initiatives such as MDWeb go in this direction (http://mmb.

pcb.ub.es/MDWeb; see Fig. 4). Similar initiatives, but centered in the data

mining of trajectories (Camps et al., 2009; see http://mmb.pcb.ub.es/

MoDEL and http://mmb.pcb.ub.es/FlexServ) will be of great help for

facilitating trajectory analysis to nonexperts. Finally, many other initiatives,

such as the Distributed European Infrastructure for Supercomputing

Applications (DEISA; http://www.deisa.eu) or Scalalife (http://www.scalalife.eu), are now being developed to facilitate the access of MD users to

high-performance computers. In parallel, software developers are making

a tremendous effort to develop programs able to use parallel architectures

(Phillips et al., 2005; Hess et al., 2008) and porting of all these codes to GPU

architectures is an ongoing process (Harvey et al., 2009; Voelz et al., 2010).

Major advance in the field will come from the use of MD-specific computers

(http://www.deshaw.com/), which can increase by two to three orders of

magnitude the size of the system or the length of the collected trajectory.

However, even with all these spectacular technical improvements, MD will

remain a technique far too slow and complicated to provide the interactivity

that experimental biologists often require. This is the main motivation for

the development of approximate coarse-grained models, where, in order to

increase computer efficiency, we accept a certain loss of accuracy with a

significant reduction in structural resolution. Such a loss of resolution might

in fact be beneficial for deriving more intuitive description of many biologically processes (such as large conformational transitions or protein aggregation) occurring in the mesoscopic scale.



II.



Coarse-Grained Potentials



The coarse graining of a protein implies always the compression of a

series of atoms in a pseudo-particle and a simplification in the representation of the solvent that is: (i) neglected, (ii) simulated as a continuum, or

(iii) represented by pseudo-particles which account for clusters of solvent

molecules. In all cases, the simplification implies the need to recalibrate

the potential function (the force field) or use information-based potentials to describe intraprotein interactions. The most common level of

coarse graining for proteins involves the representation of every residue

by a single particle located at the Ca. Refinements of the model that have



Unified PDB



PDB Entry



get entry

PDB DB



Strip Hydrogens



tLeap



Assign Residue Types



Dertermine Box Size & Number of

Waters

Neutralize



CMIP

Add Structural Waters (30 first ones)



YES



Contains unknown ligand?



Generate Ligand Parameters

add new ligand



tLeap



No



Add water box

“Titrated” PDB



Strip Bulk Waters



2



Solvent Minimization



50 kcal/(mol Å )



Heat solvent to 300K



40 ps

2

25 kcal/(mol Å )



Lower restraints on protein



20 ps

2

25 kcal/(mol Å )



Limit restraints to backbone



20 ps

2

10 kcal/(mol Å )



Strip Monovalent Ions

Reside & Ligands

Charges DB

Add Sulfur Bridges

tLeap



Add Missing Atoms/Add Charges



sander



Minimize



sander



Lower restraints



Neutralize

CMIP



Free MD without restraints



20 ps

2

1 kcal/(mol Å )

100 ps



Determine Protonation

Unified PDB

Entry

Parmtop &

Coordinates



FIG. 4. General workflow of the MDWeb Server (http://mmb.pcb.ub.es/MDWeb) for automatic generation of MD

trajectories. In the example shown, the workflow will submit an AMBER MD simulation using a PDB entry as input.



190



OROZCO ET AL.



been explored by some authors consist of using additional particles to

mimic the side chains or some backbone atoms.



A. Go¯-Like Potentials

Originating from the early works of Go¯ and coworkers (Taketomi et al.,

1975), these potentials are the basis of many of the currently used information-based potentials. Go¯ potentials are typically used in conjunction

with a Ca coarse-graining of the protein and consider that any two residues

that are in contact in the three-dimensional structure of the protein have a

favorable interaction, while if they are not in contact such interaction is

none or unfavorable:

X

Eẳ

Di;j eij

4ị

i;j



where i and j stand for protein particles (typically residue Ca), di, j is a Dirac

function which takes value À 1 if the two residues are in contact and 0 or ỵ 1

otherwise, and eij is a energy constant equal for all pairs (uncolored Go¯ potential;

eij ¼ e) or different (colored Go¯ potentials). Despite its extreme simplicity Go¯

potentials have been quite successfully used to study protein folding and have

been crucial in the development of some of today’s most accepted theories of

folding (see review in Go, 1983). Very recently, these potentials have increased

in complexity adopting a formalism, which resembles that of atomistic physical

models, like that in Onuchic’s functional (Clementi et al., 2000).

native

nonnative

V ẳ Vbonded ỵ Vangle ỵ Vdihedral ỵ Vnonbonded

ỵ Vnonbonded







P



V ẳ

n



P

bonds

1ị



Kr r r0 ị2 ỵ



P

angles

3ị



Ky y y0 ị2



Kf ẵ1 cosf f0 ị ỵ Kf ẵ1 cos3f f0 ị

2 0 112

0 110 3

0 112

P

P

s

s

s0

ij

ij

ỵ native e45@ A 6@ A 5 ỵ nonnative e@ A

rij

rij

rij

dihedrals



ð5Þ



o

ð6Þ



where in the three first terms, r, y, and f are the bond length, angle, and

dihedral angle, respectively. The corresponding subscripts ‘‘0’’ stand for

values in the experimental structure. The fourth term corresponds to the

Lennard–Jones-like (LJ) stabilization energy represented as a 12-10



COARSE-GRAINED REPRESENTATION OF PROTEIN FLEXIBILITY



191



function that acts on only those contacts present in the native state.

Here, rij and sij identify the distance between atoms

!snapshot



  i and j in one

 

rij  and sij ¼  rij0 ). The

and in the native state, respectively (rij ¼ !

fifth term is a excluded volume function that energetically disfavors any

close nonnative contact (s0 ¼ 4 A˚). In typical implementations of this

model, native contacts are identified with a 5-A˚ heavy-atom cutoff excluding up to i i ỵ 3 sequential neighbors. Such a contact calculation preserves the number of atomic contacts per residue that depends on the size

of the amino acid. Nearest-neighbor energy terms are usually defined by:

Kr ¼ 100e, Ky ¼ 20e, Kf(1) ¼ e, and Kf(3) ¼ 0.5e, where e sets the energy scale.

Onuchic’s Go¯-like potentials coupled, for example, to Langevin dynamics sampling algorithms (see below) are being extensively used to analyze

experimental biophysical measures on protein folding and unfolding

(Clementi et al., 2000, 2003).



B. Harmonic Potentials

They can be understood as an evolution of Go¯-like potentials, as they

penalize the deviation on native inter-residue distances. These potentials

have became very popular for the study of the ‘‘near-equilibrium’’ dynamics

of proteins when implemented in sampling techniques derived from normal

mode analysis (NMA). The basic assumption when using these potentials is

that a protein behaves as an elastic network model (ENM; Tirion, 1996;

Atilgan et al., 2001), where usually Cas act as network nodes which are

connected by harmonic springs. Note that the number of springs runs with

the number of residues in the protein (N), as (N À 1)!, and accordingly,

direct application of ENM will result in an artifactual over-rigidification of

the protein as the protein size is increased. This problem can be corrected by

using, for example, a distance-dependent cutoff that annihilates the interactions between remote residues, leading to an energy functional as that

developed in Eqs. (7)–(10), where the energy (E) to distort a protein from its

equilibrium conformation (rij0), considered a energy minimum, is given by

the pairwise Hookean potential (Tirion, 1996):

2

X 



Kij rij rij0

7ị

i6ẳj



where rij stands now for the distance between residues i and j (represented

by the corresponding Ca in the protein configuration), and Kij stands for



192



OROZCO ET AL.



the spring constant. The force of the spring restricting the motion of the ij

residue pair is computed as:

1

Kij ẳ kGij

2



8ị



k being a phenomenological constant (in energy/distance2 units) and G

being a Kirchhoff topology matrix of inter-residue contacts, where ijth

element for i 6¼ j ¼ 1, . . ., N, is equal to 1 if residues i and j are within the

cutoff distance rc, or zero otherwise:

&

1 if rij rc

9ị

Gij ẳ

0 if rij > rc

The diagonal elements (iith) are equal to the coordination number or

residue connectivity taken as:

Gii ẳ



M

X



Gij



10ị



jjj6ẳi



Despite their simplicity, functionals as those shown in Eqs. (7)–(10) are able

to provide quite accurate representation of the near-equilibrium dynamics

of many proteins but are extremely dependent on the selected cutoff for

remote interactions, which can have different optimal values for each protein (often difficult to predict a priori (Sen and Jernigan, 2006)). This led to

the derivation of new methods where the discrete Hamiltonian is replaced by

continuous functions, typically

on the inverse exponential of the

 dependent



rij ). Thus, Hinsen et al. (2000) derived a

inter-residue distance (rij ¼ !

function for the spring strength by fitting to a local minimum from a single

MD simulation. This procedure leads to a force constant definition with

stronger couplings for neighbors along the backbone, and a sixth power of

distance for the rest of the interactions. The distinction of short- and longrange terms was dependent on a short cutoff, and the formulation also

included a protein-fitted scaling factor for the global energetics, which limits

its general applicability. Kovacs et al. (2004) proposed a simpler sixth-power

exponential, which does not require any cutoff and has become very popular

in current elastic network implementations:

 0 6

r

Kij ẳ C

11ị

rij



COARSE-GRAINED REPRESENTATION OF PROTEIN FLEXIBILITY



193



where the proportionality constant C (usually taken as 40 kcal/(mol A˚2))

controls the global rigidity of protein contacts, and r0 is normally taken as

3.8 A˚, which is approximately the mean Ca–Ca distance between any pair

of consecutive residues.

Different authors have tried to improve harmonic potentials by, for

example, defining rigid blocks (Tama et al., 2000), by scaling differently

covalent and non-covalent contacts (Kondrashov et al., 2006), by adding

short-range terms (Moritsugu and Smith, 2007), or by defining distinguished chain interactions by a bond-cutoff ( Jeong et al., 2006). Following

these directions, we have recently (Orellana et al., 2010) developed a

hybrid approach calibrated using a large database of atomistic MD trajectories of representative proteins (Rueda et al., 2007a; Meyer et al., 2010).

The analysis of MD simulations showed that the topology of nearestneighbor interactions, the basis of the secondary structure, is the main

component in the large motions traced by ENM (Rueda et al., 2007a;

Meyer et al., 2010). Accordingly, the method (named essential-dynamics

elastic network model, ed-ENM; Orellana et al., 2010) treats differentially

the sequential and nonsequential (‘‘Cartesian’’) contacts. For the first M

sequential contacts, a fully connected matrix is used, while Cartesian

contacts are treated using a continuum distance-dependent function

with a calibrated size-dependent cutoff, which helps to remove artifactual

long-range interactions. Therefore, the elements of the topology matrix

are defined as:

8

if Sij& M ; ¼ À1

<

¼ 1 if rij rc

12ị

Gij

: otherwise

ẳ 0 otherwise

and the matrix G has always 2M ỵ 1 nonzero-diagonal entries defining

neighbor chained contacts. Accordingly, the force constants Kij are dependent, not only on the Cartesian but also on the sequential distance:

8

seq 

>

¼C

Sijns; if Sij M

>

<

(

 cart  nc

C

13ị

Kij



>

rij ; if rij

r

otherwise

>

c

:

ẳ 0 otherwise

where values for all terms (ns ¼ 2 and Cseq ¼ 60 kcal/(mol A˚2); nc ¼ 6 and

C cart ¼ 6 kcal/(mol A˚2), in energy units) were obtained by fitting to apparent force constants and structural variance profiles obtained in a large



194



OROZCO ET AL.



Cai

Cai + 3

Cai + 1



Cai + 2

i, i + 1 (»100 kcal/(mol Å2))

i, i + 2 (»10 kcal/(mol Å2))

i, i + 3 (»1 kcal/(mol Å2))



FIG. 5. Formulation of the ed-ENM model. The ed-ENM is a nearest-neighborbased model, maintaining the secondary structure stereochemistry, where the three firstorder constants acquire values close to a 100:10:1 ratio.



number of atomistic MD simulations. A value of M ¼ 3 was used for

sequential interactions based on MD simulations, which were also instrumental to define the cutoff radii (rc), which is computed using an empirical logarithmic relationship with the size of the protein. This formalism

guarantees sequential contacts which decay quickly with the number of

connecting bonds (see Fig. 5) and a continuum decay of the strength of

Cartesian contacts up to a cutoff. Attempts to improve the formalism by

adding ‘‘color’’ to the topological relations, that is, different spring constants for different physical interactions, or by adding differential weights

to different secondary elements, did not yield to clear improvements in

the results (Orellana et al., 2010).

The hybrid ENM outlined above can work coupled to any sampling technique (see below) and provides quite accurate representations of the nearequilibrium dynamics properties of proteins at both the global (essential

dynamics, global variance) and local levels (B-factor distribution), representing a significant improvement with respect to simpler schemes (see Fig. 6).



C.



Flat Potentials



In recent years, the use of discontinuous flat potentials (also named

stepwise potentials) has gained popularity due to its use in discrete MD

sampling algorithms (Zhou and Karplus, 1999; Ding et al., 2005;



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Coarse-grained representation of protein flexibility. Foundations, successes, and shortcomings

Tải bản đầy đủ ngay(0 tr)

×