Coarse-grained representation of protein flexibility. Foundations, successes, and shortcomings
Tải bản đầy đủ - 0trang
184
OROZCO ET AL.
I. Introduction
Biological macromolecules, and in particular proteins, are large and
flexible entities, which perform their biological action when embedded in
solvent, either water or the membrane phospholipids. Analysis of current
version of the Protein Data Bank (PDB; Berman et al., 2000; http://www.
pdb.org) illustrates that proteins of known experimental structure range
typically between 500 and 7000 atoms, but in some cases, protein systems
reach more than 16,000 atoms (see Fig. 1). As experimental resolution
techniques advance, the size histogram in Fig. 1 is expected to displace to
the right side due to the incorporation of large protein assemblies to the
database. However, the real problem in protein simulation originates from
the need to introduce solvent in the calculation, which dramatically
increases the number of particles in the system. For example, in our
MoDEL (Molecular Dynamics Extended Library) database, that contains
atomistic molecular dynamics (MD) simulations of representative PDB
proteins in water (largely enriched in domain-sized proteins), typical
Number atoms in proteins (PDB, 2010)
MoDEL number of atoms distribution
250
200
Population
3–519
520–1036
1037–1553
1554–2070
2071–2587
2588–3104
3105–3621
3622–4138
4139–4655
4656–5172
5173–5689
5690–6206
6207–6723
6724–7240
7241–7757
7758–8274
8275–8791
8792–9308
9309–9825
9826–10342
10343–10859
10860–11376
11377–11893
11894–12410
12411–12927
12928–13444
13445–13961
13962–14478
14479–14995
14996–15512
15513–16029
16030–16546
150
100
50
0
0
50,000
1e + 05
1.5e + 05
Number of atoms
FIG. 1. Distribution of protein atoms in 2010 version of the Protein Data Bank
(PDB). Inset corresponds to the distribution of atoms in solvated protein systems in our
MoDEL database (http://mmb.pcb.ub.es/MoDEL).
COARSE-GRAINED REPRESENTATION OF PROTEIN FLEXIBILITY
185
simulation systems range from 10,000 to 50,000 atoms, but some systems
have more than 150,000 atoms (see Fig. 1), that is, we are dealing with
systems with up to half a million degrees of freedom. If we are interested in
studying protein interactions, diffusion, or aggregation processes,
simulated systems can easily reach many millions degrees of freedom,
making atomistic simulation very complex.
As noted in the previous paragraph, size is a major limitation for the
atomistic simulation of proteins, but often even more dramatic than the
size problem is the time problem. Proteins are flexible, they move continuously, and therefore biological function cannot be understood without
considering protein dynamics. Unfortunately, proteins move as a result of
atomic vibrations happening in the nanosecond timescale, while most
biologically relevant protein motions happen in the millisecond to second
range. Thus, in order to follow, with atomistic detail, a biologically relevant
protein motion, its energy (and associated forces) should be computed at
least 1012 times. For a typical system of 50,000, the calculation of just
interatomic distances would require of the order of 1021 floating point
operations, not far from the Avogadro number.
Protein dynamics can be studied by different techniques, the most
rigorous one being atomistic MD. In this approach, all atoms of the system
are included at the same level of detail and their trajectories are determined by simple integration of Newton’s (or closely related) equations of
motion:
dE
mi !
ai ¼ À !
dr i
!
v i t ẳ 0ị ỵ
v i t ị ẳ !
!
r i t ẳ 0ị ỵ
r i t ị ¼ !
ð t ¼ dt
t ¼0
ð t ¼ dt
t ¼0
ð1Þ
!
a i ðt Þdt
ð2Þ
!
v i ðt Þdt
ð3Þ
where t stands for time, r for the position, v for the velocity and a for the
acceleration of atom i. The potential energy E is computed by using
potential functions containing both bonded (stretching, bending, and
torsion) and nonbonded interactions (van der Waals and electrostatic).
186
OROZCO ET AL.
Stretching and bending are represented by harmonic expressions, torsions
by Fourier series, electrostatics by Coulombic rÀ 1 term and van der Waals
by a Lennard–Jones rÀ 12, rÀ 6 term. These functional terms have been
carefully parametrized, using experimental data and high-level quantum
mechanical calculations as reference. It is not our purpose to comment
these methods here and we just address the reader to suitable reviews of
both atomistic MD and atomistic force fields (McCammon et al., 1977;
Brooks et al., 1988; Karplus and McCammon, 2002).
Since its development in the seventies, MD has increased its popularity
and it is now a technique used routinely by a large number of laboratories
around the world. Several programs running highly optimized codes are
available, often with free or almost free license scheme for (at least)
academic groups. The improvement of MD codes and the development
of more efficient and powerful computers have made MD simulations
possible in the microsecond timescale for small proteins, while for the
larger systems (more than million atom systems have been considered),
‘‘state-of-the art’’ simulations are at least one order of magnitude shorter.
Databases such as MoDEL (Rueda et al., 2007a; Meyer et al., 2010) or
Dynameomics (Van der Kamp et al., 2010) compile and make available to
the community the near-equilibrium (10–100 ns range) dynamics of proteins in water for nearly 2000 representative proteins (see Fig. 2), covering
a good percentage of unique-proteins PDB space (see Fig. 3).
In summary, MD is now a mature and widely used technique which
provides results of high quality. Unfortunately, despite its successes we
cannot ignore the existence of four fundamental problems that handicap
its practical applicability: (i) the use of MD requires access to large
computer resources and a notable degree of expertise in the setup of
simulations; (ii) MD simulations are very costly and even with the best
computer resources, human-collection time can extend into the months
(or even years) timescale; (iii) timescale accessible to MD simulations is
still far from that required to properly represent many biologically relevant
transitions; and finally, (iv) data mining of hundreds of gigabytes of
trajectories is complex, with a small signal/noise ratio and again requires
significant experience and special computer equipment.
Development of tools for the automatization of the setup of MD simulations, which take care of completing missing atoms in crystal structure, of
relaxing bad contacts, choosing suitable ionization states for titrable group
of proteins, detecting the placement of structural waters and ions,
COARSE-GRAINED REPRESENTATION OF PROTEIN FLEXIBILITY
187
2000
Number of simulations
1800
1600
1400
1200
1000
800
600
400
200
0
l
s
ta
s
er
To
m
no
o
M
ph
li
O
y
nl
ai
Al
M
ly
n
ai
Be
Be
M
r
ta
ta
a
er
m
go
p
/
ha
Al
la
gu
e
Irr
Protein classes
FIG. 2. Total number of protein simulations in 2010 version of MoDEL and
distribution across structural classes.
40
MoDEL coverage (%)
35
30
25
20
15
10
5
0
PDB
structures
Uniprot KBSeq
Uniprot
KBHuman
Drug bank
targets
FIG. 3. Coverage (measured from BLAST sequences comparison using a limit
e-value of 10À 5) of 2010 version of MoDEL on different structure and sequence
databases.
188
OROZCO ET AL.
defining topologies, creating the solvent environment, and performing
the thermalization and equilibration, will surely open-up MD to a broader
community. Initiatives such as MDWeb go in this direction (http://mmb.
pcb.ub.es/MDWeb; see Fig. 4). Similar initiatives, but centered in the data
mining of trajectories (Camps et al., 2009; see http://mmb.pcb.ub.es/
MoDEL and http://mmb.pcb.ub.es/FlexServ) will be of great help for
facilitating trajectory analysis to nonexperts. Finally, many other initiatives,
such as the Distributed European Infrastructure for Supercomputing
Applications (DEISA; http://www.deisa.eu) or Scalalife (http://www.scalalife.eu), are now being developed to facilitate the access of MD users to
high-performance computers. In parallel, software developers are making
a tremendous effort to develop programs able to use parallel architectures
(Phillips et al., 2005; Hess et al., 2008) and porting of all these codes to GPU
architectures is an ongoing process (Harvey et al., 2009; Voelz et al., 2010).
Major advance in the field will come from the use of MD-specific computers
(http://www.deshaw.com/), which can increase by two to three orders of
magnitude the size of the system or the length of the collected trajectory.
However, even with all these spectacular technical improvements, MD will
remain a technique far too slow and complicated to provide the interactivity
that experimental biologists often require. This is the main motivation for
the development of approximate coarse-grained models, where, in order to
increase computer efficiency, we accept a certain loss of accuracy with a
significant reduction in structural resolution. Such a loss of resolution might
in fact be beneficial for deriving more intuitive description of many biologically processes (such as large conformational transitions or protein aggregation) occurring in the mesoscopic scale.
II.
Coarse-Grained Potentials
The coarse graining of a protein implies always the compression of a
series of atoms in a pseudo-particle and a simplification in the representation of the solvent that is: (i) neglected, (ii) simulated as a continuum, or
(iii) represented by pseudo-particles which account for clusters of solvent
molecules. In all cases, the simplification implies the need to recalibrate
the potential function (the force field) or use information-based potentials to describe intraprotein interactions. The most common level of
coarse graining for proteins involves the representation of every residue
by a single particle located at the Ca. Refinements of the model that have
Unified PDB
PDB Entry
get entry
PDB DB
Strip Hydrogens
tLeap
Assign Residue Types
Dertermine Box Size & Number of
Waters
Neutralize
CMIP
Add Structural Waters (30 first ones)
YES
Contains unknown ligand?
Generate Ligand Parameters
add new ligand
tLeap
No
Add water box
“Titrated” PDB
Strip Bulk Waters
2
Solvent Minimization
50 kcal/(mol Å )
Heat solvent to 300K
40 ps
2
25 kcal/(mol Å )
Lower restraints on protein
20 ps
2
25 kcal/(mol Å )
Limit restraints to backbone
20 ps
2
10 kcal/(mol Å )
Strip Monovalent Ions
Reside & Ligands
Charges DB
Add Sulfur Bridges
tLeap
Add Missing Atoms/Add Charges
sander
Minimize
sander
Lower restraints
Neutralize
CMIP
Free MD without restraints
20 ps
2
1 kcal/(mol Å )
100 ps
Determine Protonation
Unified PDB
Entry
Parmtop &
Coordinates
FIG. 4. General workflow of the MDWeb Server (http://mmb.pcb.ub.es/MDWeb) for automatic generation of MD
trajectories. In the example shown, the workflow will submit an AMBER MD simulation using a PDB entry as input.
190
OROZCO ET AL.
been explored by some authors consist of using additional particles to
mimic the side chains or some backbone atoms.
A. Go¯-Like Potentials
Originating from the early works of Go¯ and coworkers (Taketomi et al.,
1975), these potentials are the basis of many of the currently used information-based potentials. Go¯ potentials are typically used in conjunction
with a Ca coarse-graining of the protein and consider that any two residues
that are in contact in the three-dimensional structure of the protein have a
favorable interaction, while if they are not in contact such interaction is
none or unfavorable:
X
Eẳ
Di;j eij
4ị
i;j
where i and j stand for protein particles (typically residue Ca), di, j is a Dirac
function which takes value À 1 if the two residues are in contact and 0 or ỵ 1
otherwise, and eij is a energy constant equal for all pairs (uncolored Go¯ potential;
eij ¼ e) or different (colored Go¯ potentials). Despite its extreme simplicity Go¯
potentials have been quite successfully used to study protein folding and have
been crucial in the development of some of today’s most accepted theories of
folding (see review in Go, 1983). Very recently, these potentials have increased
in complexity adopting a formalism, which resembles that of atomistic physical
models, like that in Onuchic’s functional (Clementi et al., 2000).
native
nonnative
V ẳ Vbonded ỵ Vangle ỵ Vdihedral ỵ Vnonbonded
ỵ Vnonbonded
ỵ
P
V ẳ
n
P
bonds
1ị
Kr r r0 ị2 ỵ
P
angles
3ị
Ky y y0 ị2
Kf ẵ1 cosf f0 ị ỵ Kf ẵ1 cos3f f0 ị
2 0 112
0 110 3
0 112
P
P
s
s
s0
ij
ij
ỵ native e45@ A 6@ A 5 ỵ nonnative e@ A
rij
rij
rij
dihedrals
ð5Þ
o
ð6Þ
where in the three first terms, r, y, and f are the bond length, angle, and
dihedral angle, respectively. The corresponding subscripts ‘‘0’’ stand for
values in the experimental structure. The fourth term corresponds to the
Lennard–Jones-like (LJ) stabilization energy represented as a 12-10
COARSE-GRAINED REPRESENTATION OF PROTEIN FLEXIBILITY
191
function that acts on only those contacts present in the native state.
Here, rij and sij identify the distance between atoms
!snapshot
i and j in one
rij and sij ¼ rij0 ). The
and in the native state, respectively (rij ¼ !
fifth term is a excluded volume function that energetically disfavors any
close nonnative contact (s0 ¼ 4 A˚). In typical implementations of this
model, native contacts are identified with a 5-A˚ heavy-atom cutoff excluding up to i i ỵ 3 sequential neighbors. Such a contact calculation preserves the number of atomic contacts per residue that depends on the size
of the amino acid. Nearest-neighbor energy terms are usually defined by:
Kr ¼ 100e, Ky ¼ 20e, Kf(1) ¼ e, and Kf(3) ¼ 0.5e, where e sets the energy scale.
Onuchic’s Go¯-like potentials coupled, for example, to Langevin dynamics sampling algorithms (see below) are being extensively used to analyze
experimental biophysical measures on protein folding and unfolding
(Clementi et al., 2000, 2003).
B. Harmonic Potentials
They can be understood as an evolution of Go¯-like potentials, as they
penalize the deviation on native inter-residue distances. These potentials
have became very popular for the study of the ‘‘near-equilibrium’’ dynamics
of proteins when implemented in sampling techniques derived from normal
mode analysis (NMA). The basic assumption when using these potentials is
that a protein behaves as an elastic network model (ENM; Tirion, 1996;
Atilgan et al., 2001), where usually Cas act as network nodes which are
connected by harmonic springs. Note that the number of springs runs with
the number of residues in the protein (N), as (N À 1)!, and accordingly,
direct application of ENM will result in an artifactual over-rigidification of
the protein as the protein size is increased. This problem can be corrected by
using, for example, a distance-dependent cutoff that annihilates the interactions between remote residues, leading to an energy functional as that
developed in Eqs. (7)–(10), where the energy (E) to distort a protein from its
equilibrium conformation (rij0), considered a energy minimum, is given by
the pairwise Hookean potential (Tirion, 1996):
2
X
E¼
Kij rij rij0
7ị
i6ẳj
where rij stands now for the distance between residues i and j (represented
by the corresponding Ca in the protein configuration), and Kij stands for
192
OROZCO ET AL.
the spring constant. The force of the spring restricting the motion of the ij
residue pair is computed as:
1
Kij ẳ kGij
2
8ị
k being a phenomenological constant (in energy/distance2 units) and G
being a Kirchhoff topology matrix of inter-residue contacts, where ijth
element for i 6¼ j ¼ 1, . . ., N, is equal to 1 if residues i and j are within the
cutoff distance rc, or zero otherwise:
&
1 if rij rc
9ị
Gij ẳ
0 if rij > rc
The diagonal elements (iith) are equal to the coordination number or
residue connectivity taken as:
Gii ẳ
M
X
Gij
10ị
jjj6ẳi
Despite their simplicity, functionals as those shown in Eqs. (7)–(10) are able
to provide quite accurate representation of the near-equilibrium dynamics
of many proteins but are extremely dependent on the selected cutoff for
remote interactions, which can have different optimal values for each protein (often difficult to predict a priori (Sen and Jernigan, 2006)). This led to
the derivation of new methods where the discrete Hamiltonian is replaced by
continuous functions, typically
on the inverse exponential of the
dependent
rij ). Thus, Hinsen et al. (2000) derived a
inter-residue distance (rij ¼ !
function for the spring strength by fitting to a local minimum from a single
MD simulation. This procedure leads to a force constant definition with
stronger couplings for neighbors along the backbone, and a sixth power of
distance for the rest of the interactions. The distinction of short- and longrange terms was dependent on a short cutoff, and the formulation also
included a protein-fitted scaling factor for the global energetics, which limits
its general applicability. Kovacs et al. (2004) proposed a simpler sixth-power
exponential, which does not require any cutoff and has become very popular
in current elastic network implementations:
0 6
r
Kij ẳ C
11ị
rij
COARSE-GRAINED REPRESENTATION OF PROTEIN FLEXIBILITY
193
where the proportionality constant C (usually taken as 40 kcal/(mol A˚2))
controls the global rigidity of protein contacts, and r0 is normally taken as
3.8 A˚, which is approximately the mean Ca–Ca distance between any pair
of consecutive residues.
Different authors have tried to improve harmonic potentials by, for
example, defining rigid blocks (Tama et al., 2000), by scaling differently
covalent and non-covalent contacts (Kondrashov et al., 2006), by adding
short-range terms (Moritsugu and Smith, 2007), or by defining distinguished chain interactions by a bond-cutoff ( Jeong et al., 2006). Following
these directions, we have recently (Orellana et al., 2010) developed a
hybrid approach calibrated using a large database of atomistic MD trajectories of representative proteins (Rueda et al., 2007a; Meyer et al., 2010).
The analysis of MD simulations showed that the topology of nearestneighbor interactions, the basis of the secondary structure, is the main
component in the large motions traced by ENM (Rueda et al., 2007a;
Meyer et al., 2010). Accordingly, the method (named essential-dynamics
elastic network model, ed-ENM; Orellana et al., 2010) treats differentially
the sequential and nonsequential (‘‘Cartesian’’) contacts. For the first M
sequential contacts, a fully connected matrix is used, while Cartesian
contacts are treated using a continuum distance-dependent function
with a calibrated size-dependent cutoff, which helps to remove artifactual
long-range interactions. Therefore, the elements of the topology matrix
are defined as:
8
if Sij& M ; ¼ À1
<
¼ 1 if rij rc
12ị
Gij
: otherwise
ẳ 0 otherwise
and the matrix G has always 2M ỵ 1 nonzero-diagonal entries defining
neighbor chained contacts. Accordingly, the force constants Kij are dependent, not only on the Cartesian but also on the sequential distance:
8
seq
>
¼C
Sijns; if Sij M
>
<
(
cart nc
C
13ị
Kij
ẳ
>
rij ; if rij
r
otherwise
>
c
:
ẳ 0 otherwise
where values for all terms (ns ¼ 2 and Cseq ¼ 60 kcal/(mol A˚2); nc ¼ 6 and
C cart ¼ 6 kcal/(mol A˚2), in energy units) were obtained by fitting to apparent force constants and structural variance profiles obtained in a large
194
OROZCO ET AL.
Cai
Cai + 3
Cai + 1
Cai + 2
i, i + 1 (»100 kcal/(mol Å2))
i, i + 2 (»10 kcal/(mol Å2))
i, i + 3 (»1 kcal/(mol Å2))
FIG. 5. Formulation of the ed-ENM model. The ed-ENM is a nearest-neighborbased model, maintaining the secondary structure stereochemistry, where the three firstorder constants acquire values close to a 100:10:1 ratio.
number of atomistic MD simulations. A value of M ¼ 3 was used for
sequential interactions based on MD simulations, which were also instrumental to define the cutoff radii (rc), which is computed using an empirical logarithmic relationship with the size of the protein. This formalism
guarantees sequential contacts which decay quickly with the number of
connecting bonds (see Fig. 5) and a continuum decay of the strength of
Cartesian contacts up to a cutoff. Attempts to improve the formalism by
adding ‘‘color’’ to the topological relations, that is, different spring constants for different physical interactions, or by adding differential weights
to different secondary elements, did not yield to clear improvements in
the results (Orellana et al., 2010).
The hybrid ENM outlined above can work coupled to any sampling technique (see below) and provides quite accurate representations of the nearequilibrium dynamics properties of proteins at both the global (essential
dynamics, global variance) and local levels (B-factor distribution), representing a significant improvement with respect to simpler schemes (see Fig. 6).
C.
Flat Potentials
In recent years, the use of discontinuous flat potentials (also named
stepwise potentials) has gained popularity due to its use in discrete MD
sampling algorithms (Zhou and Karplus, 1999; Ding et al., 2005;