7 Caution: MCMC and EM are Dangerous!
Tải bản đầy đủ
9.7 Caution: MCMC and EM are Dangerous!
325
great deal of time to develop, code, and test algorithms for particular models.
With BUGS, the analyst only needs to specify the model in a language based
on the S statistical language (Chambers 2004). BUGS then ﬁgures out how
to set up a Gibbs sampler, and whether it can calculate the full conditional
distributions analytically or whether it needs the Metropolis algorithm. In the
latter case, it even automatically tunes the proposal distribution.
This means almost any Bayesian model can be ﬁt to any data set. There
is no requirement that the chosen model is sensible. If the data provide no
information about a parameter then the parameter’s prior and posterior law
will be nearly identical. The displays in WinBUGS are designed to help one
assess convergence, but they do not always help with the issue of whether or
not the model is appropriate.
For this reason, the BUGS manual (Spiegelhalter et al. 1995) bears the
warning, “Gibbs sampling is dangerous” on the ﬁrst page. The warning is
not so much about Gibbs sampling as it is to leaping into computation without thinking about whether or not the model is appropriate for the data and
problem at hand. To that extent, the warning is equally applicable to blindly
applying the EM algorithm. Although both procedures will provide an answer
to the question, “What are the parameters of this model?” they do not necessarily answer the question, “Is this a reasonable model?”
Fortunately, Bayesian statistics oﬀers an answer here as well. If we have a
full Bayesian model, that model makes a prediction about the data we might
see. If this model has a very low probability of generating a given data set, it
is an indication that the model may not be appropriate. We can also use this
idea to choose between two competing models, or search for the best possible
model. The next chapter explores model checking in some detail.
Exercises
9.1 (Stratiﬁed Sampling). Example 9.1 used a simple random sample of
100 students with the number of masters in the sample unknown in advance.
Suppose instead a stratiﬁed sample of 50 masters and 50 nonmasters was
used. How would the inference diﬀer, if it was known who the masters and
nonmasters are? How about if we do not know who is who, but we know there
are exactly 50 of each?
9.2 (Missing At Random). Classify the following situations as MCAR,
MAR, or neither:
1. A survey of high school seniors asks the school administrator to provide
grade point average and college entrance exam scores. College entrance
exam scores are missing for students who have not taken the test.
2. Same survey (mentioned in ﬁrst point) except now survey additionally
asks whether or not student has declared an intent to apply for college.
326
9 Learning in Models with Fixed Structure
3. To reduce the burden on the students ﬁlling out the survey, the background questions are divided into several sections, and each student is
assigned only some of the sections using a spiral pattern. Responses on
the unassigned section are missing.
4. Some students when asked their race decline to answer.
9.3 (Missing At Random and Item Responses). Item responses can be
missing for a variety of reasons. Classify the following situations concerning a
student’s missing response to a particular Task j as MCAR, MAR, or neither.
Hint: See (Mislevy 2015) or (Mislevy and Wu 1996).
1. John did not answer Task j because it was not on the test form he was
administered.
2. Diwakar did not answer Task j because there are linked harder and easier
test forms, intended for fourth and sixth grade students; Task j is an easy
item that only appears on the fourth grade form; and Diwakar is in sixth
grade, so he was administered the hard form.
3. Rodrigo took an adaptive test. He did well, the items he was administered
tended to be harder as he went along, and Task j was not selected to
administer because his responses suggested it was too easy to provide
much information about his proﬁciency.
4. Task j was near the end of the test, and Ting did not answer it because
she ran out of time.
5. Shahrukh looked at Task j and decided not to answer it because she
thought she would probably not do well in it.
6. Howard was instructed to examine four items and choose two of them to
answer. Task j was one of the four, and not one that Howard chose.
9.4 (Classical Test Theory). Consider the following simple model from
classical test theory. Let Ti be a student’s true score on a family of parallel tests
on which scores can range from 0–10. Ti characterizes Student i’s proﬁciency
in the domain, but cannot be observed directly. Instead, we observe noisy
scores on administrations of the parallel tests. Let Xij be Student i’s score on
Test j. Following classical test theory, let
Xij = Ti +
ij
(9.28)
9.7 Caution: MCMC and EM are Dangerous!
where
i
∼ N(0, σ 2 ) and Ti ∼ N(μ, σT2 ). The classical test theory index of
reliability is ρ =
2
327
2
σT
2 +σ 2 .
σT
The following is BUGS code for estimating μ, σT2 ,
σ , and ρ from the responses of nine students to ﬁve parallel test forms:
model ctt {
for (i in 1:N) {
T[i] ~ dnorm(mu,tauT)
for (j in 1:ntest) {
x[i,j] ~ dnorm(T[i],taue);
}
}
mu ~ dnorm(0,.01)
tauT ~ dgamma(.5,1)
taue ~ dgamma(.5,1)
rho <- taue / (taue + tauT)
varT <- 1/tauT
varE <- 1 / taue
}
#inits
list(T = c(20,20,20,20,20,20,20,20,20))
list(T = c(-20,-20,-20,-20,-20,-20,-20,-20,-20))
#data
list(N=9,ntest=5,
x=structure(.Data=c(
2,
3,
2,
4,
3,
4,
6,
4,
3,
4,
7,
5,
7,
5,
4,
4,
5,
4,
6,
5,
6,
5,
6,
5,
7,
6,
7,
5,
3,
4,
4,
5,
7,
5,
8,
6,
3,
6,
3,
5,
4,
5,
8,
6,
9 ), .Dim=c(9,5)))
Note that in BUGS, the normal distribution is parameterized with the mean,
μ, and precision, τ = 1/σ 2 . The line varT <- 1/tauT produces draws for σT2 .
Run the problem with this setup, and the two chains as initial values for the
T s. Monitor T, mu, varT, varE, and rho.
1. Run 500 MCMC cycles. There are overdispersed initial values for the T s.
Ask for Stats, history, density, and the BGR convergence diagnostics plot.
Does it look like burn-in cycles may be needed for this problem? Which
parameters seem to be more or less aﬀected by the overdispersed initial
values?
2. Run another 500 cycles, and calculate summary statistics and distributions for the parameters you monitored based on only the last 500 cycles
328
9 Learning in Models with Fixed Structure
(hint: beg = 501 on the sample monitor dialog box). Regarding estimates
for individual students: What are the posterior means for the T s (i.e.,
true scores) of each of the students? What are their maximum likelihood
estimates (hint: BUGS does not tell you this—you need to do a little
arithmetic). In what directions do they diﬀer, and why?
3. Look at the summary statistics for T[1]. What is the meaning of the
number in each column?
9.5 (Classical Test Theory and the Eﬀects of Diﬀerent Priors). Consider the model and data from Exercise 9.4. Start with the original setup.
Run just one chain for each of the variations required. In each case, run 2000
cycles, and calculate statistics based on only the last 1000. Monitor mu, varT,
varE, T, and rho.
1. The original setup uses, as a prior distribution for mu, N (0, .01) (using
the BUGS convention with the precision as the shape parameter). Run
the same problem, except with N (0, 1) as the prior for mu, and monitor
the results. Focusing on posterior means and standard deviations of the
parameters listed above, which ones change? How much and why?
2. Repeat the run, except with N (0, 10) as the prior for mu. Again focusing on
posterior means and standard deviations of the parameters listed above,
which ones change? How much and why?
3. The original setup uses, as a prior distribution for both tauT and taue,
Gamma(.5, 1). Run the same problem, except with Gamma(.05, .10) as the
prior for tauT, and monitor the results. Focusing on posterior means and
standard deviations of the parameters listed above, which ones change?
How much and why?
4. Repeat the run, except with Gamma(50, 100) as the prior for tau. Again
focusing on posterior means and standard deviations of the parameters
listed above, which ones change? How much and why?
9.6 (Slow Mixing). A researcher runs a small pilot MCMC chain on a particular problem and gets a trace graph for one parameter like Fig. 9.9. The
researcher shows it to two colleagues. The ﬁrst says that the problem is the
proposal distribution and suggests that the researcher should try to ﬁnd for
new proposal distributions. The second says that if the researcher just runs
the chain for ten times longer than originally planned, then the trace plot
will look like “white noise” and the MCMC sample will be adequate. The
second colleague further suggests that a long weekend is coming up and the
lab computers will be idle anyway. Which advice should the researcher take?
Why?
9.7 (MAP and MCMC Mean). In Table 9.10, the MCMC mean frequently
appears to be closer to the MAP from the EM algorithm than it does to the
“true” value from the data generation. Why is this seen?
9.7 Caution: MCMC and EM are Dangerous!
329
9.8 (Latent Class). The following response vectors come from a simple
latent class model with two classes and ten dichotomously scored items.
1111111010 1110111011 0000010000 1111001011 0010001000
0000101000 0000010000 0010100000 0000000000 0000001000
0011100111 0100000000 0000000000 0000000000 0010000000
1100000000 1111011111 0010100000 0010000010 0111111011
0010000000 0000000000 1011111111 0010000100 1111111011
1110110111 0000000000 0000110010 1111010100 0000000010
0100100000 0111100010 0000000000 0011000000 1111111010
0100100000 0000001100 1111111111 1100000000 0010010000
0000101000 0101011111 0000000000 1111111011 0000010000
0010000000 0000010000 1111110111 0010000100 0010000000
0010010010 1111011111 1110101110 1111011101 0000000000
1111111110 1100010010 0001000001 1111111101 0011111011
0000000100 0010000001 0010000000 0010100000 0111111111
0000000000 0001110000 0000010010 1111011111 0010010000
1110101001 0010110110 1110000000 0000100101 0010100010
0000000110 0000000000 0000100010 1111001011 0000001000
0010000000 0000010001 0010011010 1111111011 1010101100
0010000001 0011000000 1000100010 0000010000 1111100011
0110000100 1001000000 1010000000 0100010110 0010111101
1101111111 0010000000 1111101111 1000100100 1111101111
Use a Beta(1, 1) prior for the class membership probability λ and for all tasks
use a Beta(1.6, .4) prior for the probability of success for masters, πj1 , and a
Beta(.4, 1.6) probability of success for nonmasters, πj0 . Estimate the parameters from the data using MCMC.
9.9 (Latent Class Prior). In Exercise 9.8, what would have happened if we
had used a Beta(1, 1) prior for πj1 and πj0 ?
Hint: Consider three MCMC chains starting from the starting points: πj =
(.2, .8), πj = (.5, .5), and π j = (.8, .2) for all j.
9.10 (Latent Class Parameter Recovery). The data for Exercise 9.8 are
from a simulation, and the parameters used in the simulation are: λ = 0.379,
plus the values in the following table.
Item
1
2
3
4
5
6
7
8
9 10
Nonmasters 0.22 0.17 0.31 0.07 0.25 0.19 0.15 0.18 0.22 0.11
Masters
0.82 0.84 0.85 0.72 0.81 0.76 0.8 0.7 0.84 0.76
Calculate a 95 % credibility interval for each parameter (this can be done by
taking the 0.025 and 0.975 quantiles of the MCMC sample). How many of the
credibility intervals cover the data generation parameters? How many do we
expect will cover the data generation parameters?
330
9 Learning in Models with Fixed Structure
9.11 (EM vs. MCMC). In each of the following situations, tell whether it
is better to use the EM algorithm or MCMC to estimate parameters.
1. The posterior mean will be used in an online scoring engine. The posterior
variance will be examined brieﬂy as a model checking procedure, but will
not be used in scoring.
2. The test specs call for only using items whose p-plus — marginal probability of success, p+
j = P(Xj = 1) — is greater than .1 and less than .9
with 90 % credibility, that is P(0.1 ≤ p+
j ≤ 0.9) ≥ 0.9.
3. Only the posterior mean will be used in online scoring, but there is strong
suspicion that the distribution for the diﬃculty on several item parameters
is bimodal.
9.12 (Improving Posterior Standard Deviation). Consider the calibration in Example 9.3. Which of the following activities are likely to reduce the
standard deviation of the posterior law for the proﬁciency model parameter
λ:
1. Increase the size of the MCMC sample.
2. Increase the length (number of tasks) of the test.
3. Increase the number of students in the calibration sample.
Which of the following activities are likely to reduce the standard deviation
of the posterior for an evidence model parameter such as π10 :
1. Increase the size of the MCMC sample.
2. Increase the length (number of tasks) of the test.
3. Increase the number of students in the calibration sample.
9.13 (LSAT model). The BUGS distribution package (Spiegelhalter et al.
n.d.) comes with a sample model called “LSAT” based on an analysis performed by Bock and Aitkin (1981) of responses on ﬁve items from 1000 students taking the Law School Admissions Test (LSAT). The data are analyzed
using the Rasch model, where if pij is the probability that Student i gets
item j correct, then
(9.29)
logit(pij ) = θi − αj
where the proﬁciency variable θi has distribution N (0, τ ) (Spiegelhalter et al.
n.d.). Note that this equation can be reparameterized as:
logit(pij ) = βθi − αj
(9.30)
where θi ∼ N (0, 1) and β = 1/τ .
Run an MCMC sampler using both the original (Eq. 9.29) and reparameterized (Eq. 9.29) models. What diﬀerences are there in the resulting Markov
Chains?
10
Critiquing and Learning Model Structure
The previous chapter described how to ﬁt a model to data. The parameterlearning methods described there assumed the structure of the model was
ﬁxed. However, often there is as much or more uncertainty about the structure of the model as there is about the values of the parameters. There are
basically two approaches to this problem. The ﬁrst is model checking, or as
it is sometimes called, model criticism. Fit indices and graphical displays can
help us explore where and how well the model ﬁts the data, and bring to light
problems with a model. The second is model search. There are a number of
ways to search the model space for one that is “best” in some sense.
While traditional methods of characterizing model ﬁt emphasized overall
goodness of ﬁt, we take a more utilitarian perspective. The statistician George
Box famously said “All models are false, but some are useful” (Box 1976). We
want a Bayes net that captures the key interrelationships between what students know, in terms of proﬁciency variables, and what they can do, in terms
of observables, without having to believe that the model expresses every pattern in the data. We do not, however, want unmodeled patterns that make our
inferences about students’ proﬁciencies misleading for the purpose at hand.
We are interested in ﬁt indices that highlight particular kinds of model misﬁt
which, from experience, we know can appear in assessment data and distort
our uses of the model.
We emphasize exploratory uses of model checking over statistical tests of
ﬁt, partly because Bayes nets are often applied with small to medium size data
sets, and partly because the techniques we describe fall out almost as a byproduct of Markov Chain Monte Carlo (MCMC) estimation, in ways that generate their own reference distributions. The reader interested in large-sample
distributions of prediction-based ﬁt indices is referred to Gilula and Haberman (1995), Gilula and Haberman (2001), and to Haberman et al. (2013)
for an application to item response theory (IRT) models. Although we do not
pursue large sample properties here, the chapter draws in places on their work
on prediction-based indices.
c Springer Science+Business Media New York 2015
331
R. G. Almond et al., Bayesian Networks in Educational Assessment,
Statistics for Social and Behavioral Sciences, DOI 10.1007/978-1-4939-2125-6 10
332
10 Critiquing and Learning Model Structure
Section 10.1 introduces some ﬁt indices and describes a simple simulation
experiment for using them. Section 10.2 looks at the technique of posterior predictive model checking (PPMC), which goes well with MCMC estimation. Section 10.3 looks at some graphical methods for assessing model ﬁt. Section 10.4
addresses diﬀerential task functioning, where the issue is whether a task works
similarly across student groups. Section 10.5 then turns to model comparison.
Usually simpler models are preferable to more complex ones, but complex
ones will ﬁt better just because of the extra parameters. The DIC ﬁt measure
discussed in Sect. 10.5.1, which also goes well with MCMC, includes a penalty
for model complexity. In Sect. 10.5.2, prediction-based indices are deﬁned and
illustrated with the discrete-IRT testlet model. Looking ahead, Chap. 11 will
apply several of these techniques to the mixed-number subtraction example.
Given a measure of model ﬁt, one can search for a model that ﬁts the data
best. Section 10.6 looks at some literature on automatic model selection. There
are, however, some important limitations on learning model structure with
Bayes nets. In particular, there can be ways of reversing the direction of some
of the edges that have diﬀerent interpretations but do not change the implied
probability distribution. Section 10.7 discusses some of these equivalent models, and highlights pitfalls in attempting to learn “causality” from data.
10.1 Fit Indices Based on Prediction Accuracy
The fact that Bayes nets are probability models gives them a distinct advantage over more ad hoc mathematical models for managing uncertainty, such
as fuzzy logic (Zadeh 1965) and certainty factors (Shortliﬀe and Buchanan
1975). In particular, the probabilities can be regarded as predictions and standard statistical techniques can assess how accurate those predictions are. This
assessment yields information about how well the model ﬁts the data.
Cowell et al. (1993) describe several locations in a Bayesian network at
which ﬁt can be assessed:
Node Fit: How well the model predicts the distribution of a single variable
in the model. These can either be conditional predictions taking
the values of other variables into account or marginal predictions
ignoring the values of other variables.
Edge Fit: How well the relationship between a parent and child in the graph
is modeled.
Global Fit: How well all variables in the data ﬁt the graphical model.
These are not always easy to apply in educational testing because the
proﬁciency variables are latent, and therefore predictive patterns of certain
dependencies described in the model cannot be directly assessed. In particular,
parent–child relationships often involve at least one latent variable and hence,
they cannot be directly tested with only observed data. Thus, the node ﬁt
indices can only be applied to observable variables and the global ﬁt indices
must calculate how well the model predicts all the observables.
10.1 Fit Indices Based on Prediction Accuracy
333
One way to get around this problem is to leave the data out for one
observable, and see how well the model predicts that observable based on the
remaining values. This is called leave one out prediction. (The idea extends
readily to leaving out a group of observables, then predicting some summary
statistic of them such as a subtest score.) Suppose that a collection of observable outcomes is available from N learners taking a particular form of an
assessment. Let Yij be the value of Observable j for Learner i. Assume that
Yij is coded as an integer and it can take on possible values 1, . . . , Kj .
Using the methods of Chap. 5, it is easy to calculate a predictive distribution for Yij given any other set of data. Let Yi,−j be the vector of
responses for Learner i on every observable except Observable j. Deﬁne
pijk = P(Yij = k|Yi,−j ). In all of the ﬁt indices described below, the idea
is to characterize how well the model’s prediction pijk , given all her other
responses, predicts the Yij that was actually observed. Let pij∗ denote the
value of pijk for k = Yij , that is, the prediction probability of the event that
actually occurred.
Williamson (2000) noted that a number of measures of quality of prediction have historically been applied to evaluate weather forecasting. These can
be pressed into service fairly easily to evaluate node ﬁt in Bayesian networks.
Williamson (2000) (see also Williamson et al. 2000) evaluated a number of
these and found the most useful to be Weaver’s Surprise Index , the Ranked
Probability Score, and Good’s Logarithmic Score. The ﬁrst two are more traditional indices, which remain useful to alert users to anomalies in collections
of predictions. Good’s Logarithmic Score, useful enough on its own, leads
us to more theoretically grounded techniques based on statistical theory and
information theory.
Weaver’s Surprise Index (Weaver 1948) attempts to distinguish between
a “rare” event and a “surprising” one. A rare event is one with a small probability. A surprising event is one with a small probability relative to the probability of other events. (The deﬁnition of events can make a diﬀerence: The
probability of a Royal Flush in clubs—the ace, king, queen, jack, and ten of
clubs—is the same as the probability of any other speciﬁed hand, so getting
this particular hand in poker is rare but no more surprising than any other
speciﬁed set of ﬁve cards in the deck when each is considered an event in the
comparison. It is rare and surprising with respect to events deﬁned by sets of
hands with the same poker value, such as one pair, two pairs, straight, etc.)
Weaver’s surprise index is deﬁned as the ratio of the expected value of
prediction probabilities to that of the actual event:
Wij =
E (pijk )
=
pij∗
Kj
k=1
p2ijk
pij∗
(10.1)
The expectation here is over Yij using the predictive probabilities pijk . The
larger the value, the more surprising the result. Weaver suggests that values
334
10 Critiquing and Learning Model Structure
of 3–5 are not large, values of 10 begin to be surprising and values above 1000
are deﬁnitely surprising.
Weaver’s surprise index assumes that one wrong prediction is as bad as
another. However, frequently the observables in an educational model represent ordered outcomes (e.g., a letter grade assigned to an essay). In those
situations, the measure of prediction quality should provide a greater penalty
for predictions that are far oﬀ than for near misses. The Ranked Probability
Score (Epstein 1969) takes this into account.
⎡
⎞ ⎤
⎛
Sij
1
3
= −
2 2(Kj − 1)
1
−
Kj − 1
Kj −1
k=1
⎢
⎣
2
k
pijn
n=1
+⎝
Kj
2
⎥
pijn ⎠ ⎦
n=k+1
k
|k − Yij |pijk .
(10.2)
k=1
This index ranges from 0.0 (poor prediction) to 1.0 (perfect prediction). It
assumes the states have an interval scale.
Good’s Logarithmic Score extends the basic Logarithmic Scoring Rule.
Recalling that pij∗ is the posterior probability of the observed outcome, the
basic logarithmic score is
(10.3)
Lij = − log pij∗ .
The lower the probability, the larger the value of the logarithmic score for
this observation. Twice the logarithmic score is called the deviance, and ﬁnding parameter values that minimize the total deviance over a sample gives
the maximum likelihood estimates for a model. We will return to the use of
deviance in model comparisons in Sect. 10.5.1.
The basic logarithmic score makes no distinction between “rare” and “surprising.” To take care of this eﬀect, Good (1952) subtracts the expected logarithmic score, over the values that might have been observed:
Kj
GLij = − log pij∗ − −
pijk log pijk .
(10.4)
k=1
Values near zero indicate that the model accurately predicts the observation.
The average logarithmic score is known in information theory as entropy, or
the uncertainty about Yij before it was observed, and is denoted by Ent(Yij ).
Any of the preceding ﬁt indices measures can be averaged across the sample
of examinees to give a measure of node ﬁt for a particular observable. For
N
example, Sj = i=1 Sij /N is the ranked probability score for Observable j.
The indices can also be averaged across nodes to get a measure of person ﬁt,
J
e.g., Wi = j=1 Wij /J.
10.2 Posterior Predictive Checks
335
The question is how extreme indices above need to be to indicate a problem. Williamson et al. (2000) posed a simple simulation experiment to answer
that question for any given model. When assessing model ﬁt, the null hypothesis is that the data ﬁt the model. This suggests a procedure for determining
the distribution of the test statistic under the null hypothesis.
1. Generate a sample of the same size as the real data from the posited
model.
2. Calculate the value of the ﬁt indices for the simulated data set, using the
posited model and conditional probabilities.
3. Repeat Steps 1 and 2 many times to generate the desired reference distribution of any of the ﬁt indices.
Step 2 produces ﬁt index values for all of the simulee*task combinations in
∗
. Reference distributions for task and person ﬁt measures
the sample, e.g., Sij
are created by averaging over tasks or persons accordingly, i.e., Sj is computed
by averaging Sij . Williamson et al. (2000) note that the reference distribution
can be computed more cheaply using a simple bootstrap (Efron 1979), which
samples repeatedly from the observed data.
10.2 Posterior Predictive Checks
The Williamson et al. (2000) method ignores one potentially important source
of variability, the uncertainty in the predictions due to uncertainty about the
parameters (it matters less for problems with large samples). The method of
PPMC (Guttman 1967; Rubin 1984) does incorporate uncertainty in parameters. Furthermore, it works very naturally with Markov Chain Monte Carlo
estimation. Sinharay (2006) provides a good summary of this technique
applied to Bayesian network models. This section describes the approach and
gives simple example.
Let p(y|ω) be the likelihood for data y given parameters ω. Let yrep be
a replicate set of data generated by the same process as y with the same
parameters ω. This is sometimes called a shadow data set. The posterior
predictive method suggests using the posterior predictive distribution for yrep
to create a reference distribution for a given ﬁt statistic, similar to the way
Williamson et al. (2000) method created an empirical reference distribution
in the previous section. The posterior predictive distribution is deﬁned as:
p(yrep |y) =
p(yrep |ω)p(ω|y)dω .
(10.5)
Correspondingly, the posterior predictive distribution for a ﬁt index, e.g.,
Weaver’s surprise index for Task j, Wj , is obtained as
p(Wj (yrep )|y) =
p((Wj (yrep )|ω)p(ω|y)dω .
(10.6)