Chapter 9. Using Monte Carlo Simulation Approaches to Study Statistical Power With MIssing Data
Tải bản đầy đủ
166
Statistical Power Analysis with Missing Data
Figure 9.1
Example of a problem that can be solved exactly.
for their areas and we have some way to relate those areas, we can solve the
problem easily and exactly. The area of the square is ASquare = l 2 , where l
is the length of its sides. The area of the square is ACircle = π ( 2l )2 . With both
of these expressions, we can say that the proportion of the square that is
occupied by the circle is π4 or a little more than 78%. The problems we have
considered to this point are similar to this one.
In contrast, consider Figure 9.2. Solving the ratio of areas of these two
shapes is much more difficult because we have little a priori information
helping us relate one area to another and the area of the inner shape is
highly irregular and would be difficult to characterize with a formula.
A Monte Carlo approach to finding the ratio of the areas would be quite
straightforward. We could print out a copy of the shapes, pin it to the wall,
and then proceed to throw darts at the wall. We would count up the num‑
ber of darts that landed inside the lightning bolt, the number of shapes
that landed in the square but outside the lightning bolt, and any darts that
missed the square entirely either (a) would not be counted, or (b) would
be thrown at the wall again until all of the darts were somewhere in the
square.
The proportion of darts landing inside the lightning bolt would approx‑
imate the proportion of the square’s area occupied by the lightning bolt:
Shazam! This approach would also work with the circle in Figure 9.1. One
advantage of this approach is that it allows to us to solve even rather difficult
problems quite simply. The downside, however, is that this can be a fairly
labor‑intensive approach. To increase the accuracy of our approximate
Using Monte Carlo Simulation Approaches to Study Statistical Power
167
Figure 9.2
Example of a problem than can be solved only approximately.
solution, it is necessary to “throw more darts,” often 10,000 or more. As
ever, there is no free lunch. The researcher is trading off between a small
number of very difficult calculations versus a very large number of much
simpler calculations. With a computer, we would use a random number
generator to provide us with two uncorrelated variables (say x and y) that
would uniformly cover the area of the square, which we could represent
as all pairs of coordinates between 0 and 1 on the x‑axis and y‑axis.
Both structural equation modeling with incomplete data and statisti‑
cal power analysis are situations that are ideally suited for this kind of
approach. In contrast to the situations that have been presented to this
point in the volume, the Monte Carlo approach provides the opportunity
to consider more complex situations (typically under a more limited set
of conditions) as well as how a model is likely to perform in practice as
opposed to in principle. As such, Monte Carlo methods can provide a very
useful “brute force” complement to the broader approaches we have out‑
lined to this point. For other problems, they may represent the only cur‑
rently practical solution.
Researchers interested in learning about Monte Carlo methods will
find no shortage of sources to consult for advice and examples and an
article by Metropolis (1987) provides a very interesting history of the
method. Fan, Felsővályi, Sivo, and Keenan (2001), for example, provide a
readily accessible introduction to conducting basic Monte Carlo studies
using SAS. Within a structural equation modeling framework, Bandalos
(2006) provides a thoughtful introduction to Monte Carlo research that
covers the most important considerations such as design and the range
168
Statistical Power Analysis with Missing Data
Table 9.1
Paxton and Colleagues’ Steps in Planning a Monte
Carlo Study
Step
1
2
3
4
5
6
7
8
9
Task
Developing a research question derived from theory
Creating a valid model
Designing (selecting) experimental conditions
Selecting values of population parameters
Selecting an appropriate software package
Conducting the simulations
File storage
Troubleshooting and verification
Summarizing results
of outcomes that are most often considered. Paxton and colleagues (2001)
outline a number of very useful guidelines for getting started with Monte
Carlo research, and Skrondal (2000) provides additional considerations
for increasing the potential external validity of results generated from a
Monte Carlo study. As well, many empirical studies of different aspects of
structural equation modeling rely on Monte Carlo methods and provide
a good resource for researchers interested in implementing these meth‑
ods (e.g., Enders & Bandalos, 2001; Fan, 2003; Gerbing & Anderson, 1993;
Hu & Bentler, 1999; L. K. Muthén & Muthén, 2002; Sivo & Willson, 2000;
Arbuckle, 1996).
Paxton and colleagues (2001) provide some very specific guidelines for
planning and conducting a Monte Carlo study within the structural equa‑
tion modeling framework, identifying nine key steps, as listed in Table 9.1.
What follows in this section corresponds directly with the guidelines
offered by Paxton and colleagues, because their recommendations are so
generally applicable.
It is critical that every Monte Carlo study begin with a theoretically
informed research question. If there is little compelling theoretical rea‑
son to investigate a phenomenon, then the results are unlikely to have
much scientific value. This step also helps to guide the range of conditions
that are worth investigation. External validity is also a key consideration
in designing a Monte Carlo study. Ensuring that the model under inves‑
tigation has relevance to the kind of situations typically studied in the
structural equation modeling framework increases the potential value of
a Monte Carlo study. Using published research to guide the selection of
models is a great place to start in this regard.
Selection of the specific conditions that will be manipulated by the
researcher is also of critical importance. Typical choices to be considered
Using Monte Carlo Simulation Approaches to Study Statistical Power
169
include variables such as the sample size, number of latent and manifest
indicators, method of estimation, type and extent of model misspecifi‑
cation, type and extent of missing data, and of course the distributions
of the raw data. Once the conditions themselves have been selected, it
is also necessary to select appropriate values of the population param‑
eters, such as factor loadings and structural coefficients. Given the facto‑
rial nature of most experimental designs, the number of factors grows
geometrically, even before considering the desired number of replications
for each condition. Again, theory and the empirical literature can help to
guide the specific factors and their levels that should be considered in any
simulation study. This trade‑off between the number of conditions under
investigation and the number of replications considered was one of the
factors that led Skrondal (2000) to question the “conventional wisdom” on
designing Monte Carlo studies. His perspective is that researchers should
consider a wider range of conditions using fewer replications under each
condition in order to increase the potential external validity of any spe‑
cific study. Though most Monte Carlo studies consider fixed values (e.g.,
specific sample sizes) of study factors, there is generally no reason that
they could not be treated as random factors, where generalizability is the
primary concern.
Choice of an appropriate software package for conducting a simulation
study is also an important consideration. Each program presents its own
strengths and limitations, but the primary consideration should be ease
and accuracy with which the researcher can conduct the actual study.
Programs differ with regard to the ease with which they can accom‑
modate factors such as data generation, level of measurement, analysis
of multiple replication, and storage and presentation of results. As of the
time of this writing, for example, LISREL will allow analysis of replicated
data with missing data using full information maximum likelihood esti‑
mation but is extremely limited in the output it allows researchers to save.
AMOS provides an easy interface for Monte Carlo analyses through pro‑
gramming with Visual Basic. MPlus and EQS both provide quite flexible
opportunities to generate and analyze both normal and nonnormal data
internally. For any given application, and any given researcher, however,
it may be just as convenient to generate the data using one software pack‑
age, estimate models in another, and analyze the results in still another.
Ease of moving between different data formats also reduces potential bar‑
riers in this regard.
Next on the list of steps is the actual execution of the Monte Carlo study.
Paxton et al. (2001) point to additional considerations at this stage, such as
whether to include nonconverged replications in the results. (They recom‑
mend against it in order to keep the number of replications equal across
170
Statistical Power Analysis with Missing Data
conditions unless rates of nonconvergence are of explicit interest, such as
in the study by Enders & Bandalos, 2001.) Others have varied the number
of replications across different sample size conditions in order to hold the
total number of observations in each condition constant. If improper solu‑
tions are to be excluded, then additional data sets must be generated to
allow for models that do not converge or that provide improper solutions.
All of these analyses can place considerable demands on both computa‑
tion time and storage resources, and plans need to be made in advance for
how data will be retained and archived. Fortunately, storage media have
become quite economical, greatly reducing the burden of this aspect of
conducting a Monte Carlo study.
The final two steps outlined by Paxton and colleagues (2001) are check‑
ing the results (we recommend doing this early and often) and summa‑
rizing the results. With regard to the former, we recommend routinely
obtaining descriptive and bivariate statistics for the data under each of the
study conditions in order to ensure that they have been appropriately gen‑
erated and read by the software packages. In our simulation studies we do
so if possible in each of the software packages used. Ensure that the cor‑
rect number of observations and data sets have been read; verify that the
model is correctly structured and estimated; leave nothing to chance. In
terms of summarizing results, Paxton et al. recommend using a combina‑
tion of descriptive, graphical, and inferential approaches, and their advice
is difficult to argue with. Because of the vast quantity of results generated
within the typical simulation study, we particularly recommend learn‑
ing more about compact and effective ways to communicate information
visually (e.g., Tufte, 2001), along with methods for exploratory data analy‑
sis (e.g., Tukey, 1977).
Point of Reflection
What are some important topics in your own area of research which might
lend themselves to a Monte Carlo study? Can you find examples in the litera‑
ture of situations where a simulation approach has been used? What are some
of the key outcomes and factors that might be important for such a study?
Simulating Raw Data Under a Population Model
In any computer simulation setting, it is important to recognize that (a)
numbers generated by a computer are not truly random, (b) in order
to be replicable, an initial “seed” value must be specified, and (c) over
Using Monte Carlo Simulation Approaches to Study Statistical Power
171
extremely large sequences with large numbers of values, it is possible for
these sequences to begin to repeat. In this section, we illustrate techniques
for generating data under a variety of different conditions of increasing
complexity. We move between generating normal and nonnormal data
and between the univariate and multivariate cases, because the latter are
extensions of the former.
Generating Normally Distributed Univariate Data
Underlying all Monte Carlo applications is the (continuous) uniform dis‑
tribution. The simplest example of a (discrete) uniform distribution is a
coin toss, the outcome of which may be either heads or tails, with equal
frequency for any “fair” coin. Another common discrete example would
be the roll of a single die. Each outcome, values from 1 through 6, occurs
with equal frequency. Uniform distributions may also be generated across
continuous distributions, in which case every interval of equivalent length
across the distribution is equally probable. In terms of the square back‑
ground in Figures 9.1 and 9.2 at the beginning of this chapter, we would
want to make sure that every point from 0 to x and from 0 to y would be
equally likely to be selected so that we cover the entire area completely
and evenly. Otherwise, the results of our Monte Carlo study might be inac‑
curate or incorrect. One commonly used formula used for this purpose is
called the congurential generator described in Fan et al. (2001) where values
at a particular (the ith) step, Ri , over a range from 0 to m are a function
of the value at the previous step Ri − 1 , a multiplier, a, and an increment,
c: Ri = ( aRi − 1 + c)(mod m) . The advantage of a uniform distribution is that
all values along its full range are equally likely to occur. This ensures, in
turn, that all values calculated from a uniform generation can be expected
to occur with the probability specified by their probability distribution
function.
Every major statistical package has its own routines for generating
uniform variables. In SAS, for example, the ranuni function generates
values uniformly distributed between 0 and 1. In SPSS, the function is
rv.uniform, and in Stata it is uniform. With each package, the steps are the
same: (a) specify the desired number of observations, (b) specify a seed
value to initiate the sequence of numbers and make the results replicable,
and then (c) generate the total required number of values. Here is a simple
routine in Stata to generate 1000 values uniformly distributed between 0
and 1. Simple arithmetic can be used to generate uniform variables across
a different scale.
set obs 1000
set seed 2047881
generate x = uniform()
172
Statistical Power Analysis with Missing Data
The choice of a seed is important only in that syntax run with the same
seed will produce identical results every time it is run, whereas syntax
run without a seed (or with a different seed) will produce different out‑
put. Although its importance depends on the type of random number gen‑
erator used in a specific software package, it is generally good practice to
select an odd‑numbered value for the seed. Once we have our uniformly
distributed variables, we can readily transform them or use them to cal‑
culate variables with other distributions. If a variable with a range from
1 to 100 is desired, we can calculate it as 99*x + 1 for example. A normally
distributed variable with a mean of 50 and standard deviation of 10 can be
calculated in Stata as 50 + 10*invnorm(x).
Generating Nonnormally Distributed Univariate Data
Using linear transformations such as the one above, researchers can con‑
struct variables with any desired means and standard deviations, the
first and second moments of a distribution. Often, researchers may wish
to generate variables with nonnormal distributions, ones that also have
known skewness and kurtosis. One of the most commonly used ways to
accomplish this was outlined by Fleishman (1978). For desired values of
skewness and kurtosis, his iterative method allows solving a set of equa‑
tions for three values that can be used to transform a normally distributed
variable into one with the desired values. Although his method involves
quite a bit of algebra and numerous calculations (all best left to comput‑
ers), the approach itself is not difficult to grasp.
Fleishman (1978) set out to find an approximation, using a polynomial
transformation, of a normally distributed variable, X, into a nonnormally
distributed variable, Y, with given skewness and kurtosis. In particular,
Y = a + bX + cX 2 + dX 3, and the constants a , b , c , and d are solved for
in terms of the moments of a normal distribution, which are known. Two
simplifying assumptions are used, specifically that the mean is 0 and the
standard deviation is 1. Linear transformations can be used after the fact
to create variables with the desired mean and standard deviations. Setting
the mean equal to 0 implies that a + c = 0 (or, equivalently, that a = − c).
Setting the variance equal to 1 implies that b 2 + 6bd + 2 c 2 + 15d 2 = 1. The
desired skewness (γ1) can be expressed in terms of the coefficients as
γ 1 = 2 c(b 2 + 24bd + 105d 2 + 2) . Similarly, the kurtosis (γ2) can be expressed as γ 2 = 24[bd + c 2 (1 + b 2 + 28bd) + d 2 (12 + 48bd + 141c 2 + 225d 2 )] . Now
all that has to be done is to solve these four equations for the constants a,
b, c, and d. One fairly straightforward way to do this is to find a solution
using iterative techniques such as Newton’s method. Newton’s method
Using Monte Carlo Simulation Approaches to Study Statistical Power
173
finds an approximate solution (within any desired level of accuracy) based
on an initial starting value, x0 . The updated value at a given step is given
f x
by xn+1 = xn − f ′(( xnn)) where f ( xn ) is the value of the function at the nth iteration, and f ′( xn ) is the value of the first derivative of the function at the
nth iteration. This process continues until xn + 1 − xn < the desired level
of accuracy (or until the maximum number of iterations is reached). In
this case, the process is conducted simultaneously for b , c , and d using
matrix algebra by taking the partial derivatives of the equations above
with respect to each unknown. A sample program that executes this pro‑
cess using Stata appears below. By default, it is set up to continue until
none of the parameters changes by more than 0.000001 or until 500 itera‑
tions, whichever comes first.
/* Solving for Fleishman’s Coefficients */
#delimit;
mat maxiter = (500);
mat iter = (0);
* Skewness and Kurtosis;
mat skewkurt = (1, 5);
mat skew = skewkurt[1..rowsof(skewkurt),1];
mat kurt = skewkurt[1..rowsof(skewkurt),2];
mat output = J(rowsof(skewkurt),3,0);
mat coef = (1 \ 0 \ 0);
mat f = J(3,1,1);
while (trace(iter) <= trace(maxiter) &
max(abs(f[1,1]),abs(f[2,1]),abs(f[3,1])) > .000001 {;
mat b = coef[1,1];
mat c = coef[2,1];
mat d = coef[3,1];
* Matrix of Function (f);
mat f = (b*b+6*b*d+2*c*c+15*d*d - 1 \
2*c*(b*b+24*b*d+105*d*d+2) - skew[4,1] \
24*(b*d+c*c*(1+b*b+28*b*d)+d*d*(12+48*b*d+141*c*c+225*d*d))
- kurt[4,1]);
* Matrix of Partial Derivatives (df);
mat df = (2*b+6*d, 4*c, 6*b+30*d \
4*c*(b+12*d), 2*(b*b+24*b*d+105*d*d+2),
4*c*(12*b+105*d) \
24*(d+c*c*(2*b+28*d)+48*d*d),
48*c*(1+b*b+28*b*d+141*d*d),
24*(b+28*b*c*c+2*d*(12+48*b*d+141*c*c+225*d*d)+d*d)+d*d*(48
*b+450*d));
174
Statistical Power Analysis with Missing Data
mat delta = inv(df)*f;
mat
mat
};
mat
mat
mat
mat
mat
coef = coef - delta;
iter=iter+I(1);
list
list
list
list
list
iter;
coef;
delta;
f;
df;
Not all combinations of skewness and kurtosis can be generated using
this method. Specifically, whenever γ 12 < 0.0629576 × γ 22 + 0.0717247,
there is no solution. In these cases, approaches based on different meth‑
ods must be used (e.g., Burr, 1942; Headrick & Mugdadi, 2006; Ramberg &
Schmeiser, 1974).
Generating Normally Distributed Multivariate Data
It is a fairly straightforward matter to generate multivariate normally
distributed data with a desired covariance structure. Two methods,
Cholesky decomposition and the factor pattern matrix, are used most
commonly, both of which create linear combinations of independent (i.e.,
uncorrelated) normally distributed variables to construct new variables
with the desired covariance structure.
Cholesky decomposition of a square symmetric matrix, S, uses a form
of Gaussian elimination to find a lower triangular matrix, L, such that
S = LL′ . In this way, it is analogous to the square root function in scalar
algebra. Statistical packages with matrix routines such as SPSS, SAS, and
Stata all have Cholesky decomposition commands (called chol, root, and
cholesky, respectively). The same solution may also be obtained in struc‑
tural equation modeling software such as LISREL very simply. Consider
estimating the model in Figure 9.3.
We use the Cholesky‑decomposed matrix to take three uncorrelated
variables (Old 1, Old 2, and Old 3) and estimate three new correlated vari‑
ables (New 1, New 2, and New 3). New 1 is a linear function of Old 1; New
2 is a linear function of Old 1 and Old 2; New 3 is a linear function of Old
1, Old 2, and Old 3. In the LISREL model, we can estimate the lower trian‑
gular matrix L using the following matrices.
1
Ψ = 0
0
0
1
0
0
0
1
*
, Λ =
y
*
*
0
*
*
0
0
*
0
, and
d Θε = 0
0
0
0
0
0
0
0
Using Monte Carlo Simulation Approaches to Study Statistical Power
Old 1
Old 2
*
*
New 1
175
Old 3
*
*
New 2
*
*
New 3
Figure 9.3
Graphical representation of the Cholesky decomposition to generate variables with desired
covariance structure from uncorrelated variables.
This model can be estimated from our desired covariance matrix, S =
1.0 0.4 0.5
0.4 1.0 0.6 , in order to solve for L. In this case, L = Λ y because
0.5 0.6 1.0
LL′ = S .
da ni=3 no=1000
la
new1 new2 new3
cm
1
.4 1
.5 .6 1
mo ny=3 ne=3 ly=fu,fi be=fu,fi ps=sy,fi te=sy,fi
le
old1 old2 old3
va 1.0 ps(1,1) ps(2,2) ps(3,3)
fr ly(1,1) ly(2,2) ly(3,3)
fr ly(2,1) ly(3,1)
fr ly(3,2)
ou nd=4
176
Statistical Power Analysis with Missing Data
1.0000
In this case, L = Λ y = 0.4000
0.5000
0
0.9165
0.4364
0
0
0.7480
. So beginning with
three uncorrelated normally distributed variables, it is possible to gener‑
ate three new normally distributed variables that will have the desired
covariance structure using the following equations (remember that col‑
umns cause rows):
New1 = 1.0000 × Old1 + 0.0000 × Old 2 + 0.0000 × Old 3
New 2 = 0.4000 × Old1 + 0.9165 × Old 2 + 0.0000 × Old 3
New 3 = 0.5000 × Old1 + 0.4364 × Old 2 + 0.7480 × Old 3 .
A second commonly used method for generating normally distributed
variables with a desired covariance structure uses the factor pattern matrix,
which represents another way of “solving” the linear combination of uncor‑
related variables that result in data with the desired covariance structure.
For a given covariance matrix, S, it is possible to find values for matrices
V (eigenvectors) and L (eigenvalues) such that (S − LI )V = 0. As with the
Cholesky decomposition, most commonly used statistical packages have
routines for finding eigenvectors and eigenvalues for square symmet‑
ric matrices. In Stata, for example, matrix symeigen V L = S provides the
desired values. The eigenvalues and eigenvectors can be used to generate
a matrix of weights, A (the factor pattern matrix), which can generate data
with a desired covariance structure. In this case, A = V (Cholesky(LI )). Two
more lines of syntax are all that is required to generate this matrix in Stata.
matrix L = diag(L)
matrix A = V*cholesky(L)
Using the same input covariance matrix used with the Cholesky example,
0.8427 −0.4250
0.3306
we find that A = 0.9167 −0.1248 −0.3796 . Again, beginning with
0.7965
0.1169
0.5932
three uncorrelated normally distributed variables, it is possible to gener‑
ate three new normally distributed variables with the desired covariance
structure using the following equations:
New1 = 0.8427 × Old1 − 0.4250 × Old 2 + 0.3306 × Old 3
New 2 = 0.9167 × Old1 − −0.1248 × Old 2 − 0.3796 × Old 3
New 3 = 0.5932 × Old1 + 0.7965 × Old 2 + 0.1169 × Old 3 .