2 Estimation of the triplet structure invariant via its first representation: the P1 and the P¯1 case
Tải bản đầy đủ - 0trang
Estimation of the triplet structure invariant via its first representation
is to study the joint probability distribution
P(Eh1 , Eh2 , Eh3 ) ≡ P(Rh1 , Rh2 , Rh3 , φh1 , φh2 , φh3 ).
(5.2)
According to Section 4.1 we must first calculate the characteristic function
C and then, by Fourier inversion, recover the distribution (5.2). Because of
the importance of the triplet invariant, we report the necessary calculations in
Appendix 5.A. The resulting distribution is
R1 R2 R3
P(R1 , R2 , R3 , φ1 , φ2 , φ3 ) =
exp −R21 − R22 − R23 + C cos(φ1 + φ2 + φ3 ) ,
π3
(5.3)
where R1 , R2 , R3 , φ1 , φ2 , φ3 stand for Rh1 , Rh2 , Rh3 , φh1 , φh2 , φh3 , respectively,
C=
2R1 R2 R3
Neq
(5.4)
3/2
Neq = σ2 /σ3 ,
(5.5)
and
σn =
N
j=1
Zjn .
N is the number of atoms in the unit cell and Zj is the atomic number of the
jth atom. If all of the atoms are of the same species (and have similar thermal
displacement), then Neq ≡ N and
2R1 R2 R3
.
√
N
The simultaneous presence of heavy and light atoms in the unit cell makes
Neq < N (see Section 5.3).
From (5.3) the conditional distribution, P( |Rh1 , Rh2 , Rh3 ), may be obtained
(abbreviated to P( ); Cochran, 1955):
C=
P( ) = [2π I0 (C)]−1 exp(C cos ),
(5.6)
which may also be written as
P( ) = M( ; 0, C),
where
M( ; θ , C) = [2π I0 (C)]−1 exp[C cos(
− θ )]
is the von Mises distribution for the variable , centred at θ, with concentration parameter equal to C.
Equation (5.6) is plotted in Fig. 5.1, from which we observe:
(i) I0 is the modified Bessel function of order 0 (see Appendix M.E). We have
2π
to think of [2π I0 (C)]−1 as a scaling factor, allowing 0 P( )d = 1.
(ii) Equation (5.6) has its maximum at = 0 (where cos = 1). It may be
concluded that the expected value of is always zero.
(iii) The sharpest curves are obtained in correspondence with the largest values of C. Thus the statistical indication ≈ 0 is reliable only if C is
sufficiently large. This condition is satisfied if all three Rs are sufficiently
large and N is sufficiently small.
105
106
The probabilistic estimation of triplet and quartet invariants
P(F)
1
C=6
0.8
C=4
0.6
C=2
0.4
Fig. 5.1
The Cochran distribution P( ) for a
triplet phase invariant, for different values
of parameter C.
0.2
C=0
C=1
Φ
0
–180
–135
–90
–45
0
45
90
135
180
(iv) If at least one of the Rs is zero, then P( ) = (2π )−1 : no phase indication
is obtained.
The statement ≈ 0 is a statistical expectation; it does not mean that
must be zero. To better understand this point, let us calculate the following
mean values:
2π
< cos( )> =
cos( )P( )d
= D1 (C)
(5.7)
= 0,
(5.8)
0
2π
< sin( )> =
sin( )P( )d
0
where D1 (C) = I1 (C)/I0 (C) is the ratio of the two modified Bessel functions
of order 1 and zero, respectively (see Fig. 5.2). According to (5.7), the average
value of < cos( ) > is smaller than 1, and is sufficiently close to 1 if C is large.
However, as for any statistical indication, it may also be that cos( ) is actually
negative, even if C is positive and large.
1.0
D1(x)
0.8
0.6
D2(x)
0.4
0.2
0.0
Fig. 5.2
The functions D1 (x) and D2 (x).
0.0
2.0
4.0
6.0
x
8.0
10.0
Estimation of the triplet structure invariant via its first representation
According to (5.8), the Cochran relationship is unable to fix the enantiomorph. Thus, if < cos( )> = cos q, equation (5.8) says that +q and –q have
equal probability, and coherently gives < > = 0 and < sin > = 0.
A last remark may be useful for readers not familiar with the von Mises
distribution. For circular variables, it plays a role similar to that played by the
Gaussian function for linear variables. In particular, the von Mises distribution
is marked by maximum likelihood and by maximum entropy characterization
(Mardia, 1972). While the normal distribution along a line has useful mathematical and statistical properties, this is not true for a normal distribution along
a circle (i.e. in the case of directional data). Indeed, in the theory of circular
variables, a normal distribution (as well as the most significant distributions
on a line, e.g. Cauchy, Poisson, etc.) is wrapped around the circumference of a
circle of unit radius, thus producing the so-called wrapped distribution.
Let us now consider the P1¯ case (Cochran and Woolfson, 1955): E1 , E2 , E3
are now real numbers, and according to Appendix 5.A the following joint
probability distribution is obtained:
P(E1 , E2 , E3 ) =
1
1
E1 E2 E3
exp − (E12 + E22 + E32 ) +
(2π )3/2
2
Neq
.
In this case the phase problem reduces to a sign problem. The probability that
the sign of E1 E2 E3 is plus, is given (but for a scaling term) by
P+ ≈ exp +
R1 R2 R3
Neq
,
and the probability that it is minus is given (but for a scaling term) by
P− ≈ exp −
R1 R2 R3
Neq
.
Since it must be that P+ + P− = 1, the rescaled value of the positive sign
probability is
+
P = (1 + P− /P+ )
−1
2R1 R2 R3
= 1 + exp −
Neq
−1
.
(5.9)
Since
(1 + e−2x )−1 = ex /(ex + e−x ) =
1 1
+ tanh x,
2 2
from (5.9),
P+ =
1 1
R1 R2 R3
+ tanh
2 2
Neq
(5.10)
is obtained (see Fig. 5.3). As for the acentric case we notice:
(i) P+ is always larger than 1/2, unless some of R1 , R2 , R3 are vanishing.
1 R2 R3
(ii) The reliability of sign indication is large only for large values of R√
.
Neq
(iii) The efficiency of (5.10) decays with the size of the structure.
Triplet estimation in space groups with symmetry higher than triclinic is
described briefly in Appendix 5.B.
107
108
The probabilistic estimation of triplet and quartet invariants
1
0.8
P+ 0.6
Fig. 5.3
Centrosymmetric space groups. P+ (X) is
the probability that the triplet sign is positive, according to equation (5.10), and
1 R2 R3
X = R√
. P+ is always equal to or
Neq
larger than 1/2.
0.4
0.2
0
1
2
3
4
X
5
6
7
8
5.3 About triplet invariant reliability
The relationships (5.6) and (5.10) have been obtained by making use of two
basic assumptions: the structure is composed of discrete atoms (atomicity postulate) and the electron density is everywhere real and positive (positivity
postulate). For X-rays, positivity and atomicity are implicit in the positivity
of the atomic scattering factor f . It is, however, worthwhile noticing that when
triplets are to be estimated for neutron diffraction data (see Chapter 11), the
positivity postulate may be violated and relations (5.6) and (5.10) are no longer
valid. In an analogous way dispersion effects could introduce complex scattering factors, ( fj = fj + ifj ): in this case also, the probabilistic theory for triplet
estimation should be reformulated (Hauptman, 1982a,b; Giacovazzo, 1983b;
see Chapter 15).
In this section we focus our attention only on X-ray data: we wish to enquire
about the range of structural complexity inside which equations (5.6) and
(5.10) may be usefully applied. Since
= 1 by definition, the R values
do not change their order of magnitude, no matter how complex is the structure. Therefore, the only parameter in C which changes size with structural
complexity is 2/ Neq : this parameter influences the average efficiency of the
triplet relationships. In more detail:
1. For crystal structures where non-hydrogen atoms are nearly equal, Neq is
almost equal to the number of non-hydrogen atoms in the unit cell (this
is only valid for X-ray data). Therefore, hydrogen atoms could even be
omitted from the calculation of Neq .
2. N > Neq when heavy and light atoms coexist in the unit cell. The difference
becomes large with increasing values of the ratio
atomic number of heavy atom:atomic number of light atoms.
For example, JAMILAS [K4 C64 H68 N8 O20 S4 , space group P1] is a small
structure with N = 100 non-hydrogen atoms in the unit cell; the corresponding value of Neq is 55. The above result indicates that crystal structures
with a large number of light atoms and a few heavy atoms are more easily
About triplet invariant reliability
Table 5.1 Schwarz [C46 H70 O27 , P1]: statistical results on triplet estimates
(Cochran formula). nr is the number of triplets with Cochran parameter C > THR,
<| |> is the corresponding average value of | |, and % is the percentage of triplets
with positive value of cos
THR
nr
<| |>o
%
0.4
1.2
2.0
2.6
3.8
5117
4572
1552
570
81
41
40
30
27
19
90
91
96
98
100
solvable by direct methods than structures of the same size but without
heavy atoms.
3. For unit cells with a large number of atoms, C is small for most of the
triplets; correspondingly, extremely broad probability distributions (5.6) are
expected. The consequence is that few triplet phases are really close to
zero, the majority are dispersed in the interval (0, 2π). If the structure size
is small, a high percentage of triplet phases will be close to zero.
Table 5.1 shows some statistical calculations for the Schwarz [C46 H70 O27 ,
space group P1] structure, showing how is distributed versus C. The table
entries may be interpreted as follows:
(i) There are 81 triplets for which C > 3.8; for these, the average value of
| | is 19◦ (in this case the condition C > 3.8 selects triplets with phase
really close to zero), and cos is always positive.
(ii) There are 570 triplets with C > 2.6; for these, the average value of | |
is 27◦ .
Data in Table 5.1 may be usefully compared with data in Table 5.2, where we
show similar statistics for a small protein (1e8a; space group R3, 182 residues,
corresponding to 1472 non-hydrogen atoms in the asymmetric unit. Data resolution: 1.95 Å). Only 92 triplets reach a C value larger than 0.5, the percentage
of triplets which deviate from the Cochran expectation ≈ 0 is very high.
Table 5.2 1e8a. Statistical results on triplet estimates (Cochran formula). nr is the number of triplets with Cochran parameter C > THR, <| |> is the corresponding average
value of | |, and % is the percentage of triplets with positive value of cos
THR
0.1
0.2
0.3
0.4
0.5
nr
300000
79494
7355
759
92
<| |>o
%
86
84
83
78
78
54
55
56
59
59
109
110
The probabilistic estimation of triplet and quartet invariants
Apparently, the structural complexity does not allow selection of reliable triplet
invariants, with obvious consequences in the phasing steps.
5.4 The estimation of triplet phases via their
second representation
The Cochran formula (5.6) estimates triplet phases (5.1) by exploiting only the
information contained in three diffraction moduli; any is expected to be close
to 2π , and there is no chance of recognizing bad triplets (i.e. triplet phases
close to ±π/2 or with negative cosine values). This is of paramount importance
to the efficiency of the phasing process. We will see in the Chapter 6 that the
occurrence of a relatively large number of bad triplets in the phasing process
can lead to its failure. Alternatively, the probability of finding the correct set
of phases is enhanced if bad triplets are recognized; they should be excluded
from the structure solving process or actively used in a correct manner.
The representation theory, described in Chapter 4, indicates how information
contained in all of the reciprocal space may be used to improve the Cochran
estimates. In accordance with Section 4.2, the second representation of is a
collection of special quintets,
{ }2 = {
+ φk − φk } ,
(5.11)
where k is a free vector in reciprocal space. The basis magnitudes of any
are
2
Rh 1 , Rh 2 , Rh 3 , Rk
and the cross-magnitudes are
Rh1 ±k , Rh2 ±k , Rh3 ±k .
The collection of the basis and cross-magnitudes of the various quintets
{B}2 , and is called the second phasing shell of :
2
is
{B}2 = Rh1 , Rh2 , Rh3 , Rk , Rh1 ±k , Rh2 ±k , Rh3 ±k .
¯ a study of the ten-variate probability
These results suggest, for P1 and P1,
distribution
P(Eh1 , Eh2 , Eh3 , Ek , Eh1 +k , Eh2 +k , Eh3 +k , Eh1 −k , Eh2 −k , Eh3 −k ),
(5.12)
from which the conclusive conditional distribution,
P( |10 moduli),
(5.13)
is obtained. Equations (5.12) and (5.13) may be calculated by means of the
techniques described in Chapter 4. Since k is a free vector, a formula can be
found which provides the conditional probability distribution of given the
basis and cross-moduli of any quintet 2 . We will denote such a probability
P10 ( ), in order to emphasize the fact that the formula explores the reciprocal
space by means of a ten-node figure. Three nodes (i.e. h1 , h2 , h3 ) are fixed
while k varies; the remaining seven nodes sweep out reciprocal space.
The estimation of triplet phases via their second representation
111
0.7 P (F)
0.6
0.5
G = –2
G=3
0.4
0.3
0.2
Fig. 5.4
P10 ( ) according to equation (5.14).
We choose G = 3 (continuous line) and
G = −2 (dashed line).
0.1
0
–180
–120
–60
0
60
120
F
180
The final probabilistic formula (Cascarano et al., 1984; Burla et al., 1989a)
is of a von Mises type, and may be written as
P10 ( ) = [2π I0 (G)]−1 exp(G cos ),
(5.14)
where G is a concentration parameter which depends on hundreds or thousands
of magnitudes, and may be positive or negative. If G > 0, the expected value
of is zero, if negative, the expected value of is π ; unlike the Cochran relationship, P10 ( ) is able to identify negative triplet cosines. Two distributions
(5.14), one corresponding to a positive and the other to a negative value of G
are shown in Fig. 5.4: it is evident that, when G < 0, the value of is probably
closer to π than to 0.
For cs. space groups the triplet sign may be estimated by equation (5.15),
P+ =
1 1
G
+ tanh
2 2
2
(5.15)
as a substitute for equation (5.10). Since G may also be negative, positive
and negative triplets may be identified. Correspondingly, Fig. 5.3 may be
generalized into Fig. 5.5, allowing values of P+ smaller than 1/2.
For the interested reader, a formal expression of G, including symmetry
effects, is given in Appendix 5.C, where we also compare the efficiencies of
the Cochran and the P10 formulas. Because of its superiority, the P10 formula
1
0.8
0.6
P+
0.4
0.2
0
–6
–4
–2
0
X
2
4
6
Fig. 5.5
P+ in accordance with equation (5.15).
P+ is larger or smaller than 1/2, according
to whether G is positive or negative.
112
The probabilistic estimation of triplet and quartet invariants
has been fully integrated in the SIR suite of phasing programs starting from
SIR88 (Burla et al., 1989a).
5.5 Introduction to quartets
Four phases are said to form the quartet invariant,
= φh1 + φh2 + φh3 + φh4 ,
if
h1 + h2 + h3 + h4 = 0.
Hauptman and Karle (1953) and Simerska (1956), independently, suggested that
would be approximately zero for large values of Rh1 Rh2 Rh3 Rh4 .
The use of quartets in direct procedures for phase solution was first introduced by Schenk (1973a,b, 1974), who, from semi-empirical observations
on the moduli Rh1 +h2 , Rh1 +h3 , Rh2 +h3 , derived useful conditions for improving
estimation of the relation
≈ 0. Probabilistic theories for quartet estimation from the first phasing shell were, independently, described for P1 by
Hauptman (1975a,b) and by Giacovazzo (1976b,c). Theories for P1¯ were given
by Giacovazzo (1975a, 1976a), Green and Hauptman (1976), and Hauptman
and Green (1976). A general probabilistic theory of quartets valid in all space
groups was given by Giacovazzo (1976d).
Both Hauptman’s and Giacovazzo’s approaches use the first phasing shell,
Rh1 , Rh2 , Rh3 , Rh4 , Rh1 +h2 , Rh1 +h3 , Rh2 +h3 , to estimate quartets; mainly, they
differ because the second author has used the Gram–Charlier expansion of
the characteristic function (see Appendix 4.A). For brevity we will use the
following notation:
Ri = Rhi , φi = φhi for i = 1, . . . , 4,
R5 = Rh1 +h2 , R6 = Rh1 +h3 , R7 = Rh2 +h3 ,
φ5 = φh1 +h2 , φ6 = φh1 +h3 , φ7 = φh2 +h3 .
5.6 The estimation of quartet invariants
¯ via their first representation:
in P1 and P1
Hauptman approach
Hauptman derived in P1 the following conditional distribution:
1
P( |R1 , . . . , R7 )
exp(−4C cos )I0 (R5 Z5 ·)I0 (R6 Z6 )I0 (R7 Z7 ),
L
where I0 (x) is the modified Bessel function of order zero,
(5.16)
C = R1 R2 R3 R4 /N,
(5.17)
2
Z5 = √ (R21 R22 + R23 R24 + 2NC cos )1/2 ,
N
(5.18a)
The estimation of quartet invariants in P1 and P1¯
2
Z6 = √ (R21 R23 + R22 R24 + 2NC cos )1/2 ,
N
113
(5.18b)
2
Z7 = √ (R22 R23 + R21 R24 + 2NC cos )1/2 .
(5.18c)
N
As for the triplet invariants, distribution (5.16) depends on cos ; therefore
only cos may be estimated, it being impossible to distinguish between +
an − (or, in other words, to distinguish between the two enantiomorphs).
Since L, the scaling factor, has a rather complicated expression, one might
use numerical methods for calculating:
1. the scaling factor L, via the condition
π
P( )d
= 1;
0
2. the mode m of P( );
3. the mean value, given by
P(F)
π
=
P( )d ;
0
4. the variance, V, as given by
π
V=
5. σ =
√
V.
(
−
R1 = 2.27
R2 = 3.01
R3 = 2.49
R4 = 2.16
R5 = 1.85
R6 = 2.84
R7 = 1.90
)2 P( )d .
0
Estimation of | |, via (5.16), depends on an intricate interrelationship among
all the seven magnitudes. However, some working rules can be stated:
1. P( ) is unimodal between 0 and π , and m can, in principle, lie anywhere
between 0 and π;
2. if the cross-magnitudes are large, is expected to be close to zero;
3. if the cross-magnitudes are small, is expected to be close to π ;
4. if the cross-magnitudes are of medium size and N is sufficiently small, then
is expected to be close to ±π/2;
5. the larger N, the larger the overall variance associated with quartet phase
estimation.
Figures 5.6 and 5.7 show (broken curves) the distribution (5.16) for some
values of the seven magnitudes when N = 47. In Fig. 5.6, where all the
cross-magnitudes are large, m = 0.0,
29◦ , σ = V 1/2 = 21.9◦ . In Fig.
5.7 where all the cross-magnitudes are small, m = 180◦ ,
142◦ , σ =
◦
32.7 .
It is clear from the figures that cosines estimated near π will (on average)
be in poorer agreement with the true values than the cosines estimated near
0, because of the relatively larger value of the variance. Even poorer will
be the estimates of the cosines located in the middle range (usually called
enantiomorph sensitive quarters); no useful application has been found for
them.
The three cross-magnitudes are not always in the set of measured reflections.
Then, some marginal joint probability distributions must be considered in order
p/2
p
Fig. 5.6
Distribution (5.16) (broken curve) and
(5.22) (continuous curve) for the indicated |E| values in a structure with
N = 47 atoms in the unit cell.
P(F)
R1 = 2.31
R2 = 2.82
R3 = 1.88
R4 = 2.10
R5 = 0.36
R6 = 0.24
R7 = 0.10
p/2
p
Fig. 5.7
Distribution (5.16) (broken curve) and
(5.22) (continuous curve) for the indicated |E| values in a structure with
N = 47 atoms in the unit cell.