Tải bản đầy đủ - 682 (trang)
7 Caution: MCMC and EM are Dangerous!

7 Caution: MCMC and EM are Dangerous!

Tải bản đầy đủ - 682trang

9.7 Caution: MCMC and EM are Dangerous!


great deal of time to develop, code, and test algorithms for particular models.

With BUGS, the analyst only needs to specify the model in a language based

on the S statistical language (Chambers 2004). BUGS then figures out how

to set up a Gibbs sampler, and whether it can calculate the full conditional

distributions analytically or whether it needs the Metropolis algorithm. In the

latter case, it even automatically tunes the proposal distribution.

This means almost any Bayesian model can be fit to any data set. There

is no requirement that the chosen model is sensible. If the data provide no

information about a parameter then the parameter’s prior and posterior law

will be nearly identical. The displays in WinBUGS are designed to help one

assess convergence, but they do not always help with the issue of whether or

not the model is appropriate.

For this reason, the BUGS manual (Spiegelhalter et al. 1995) bears the

warning, “Gibbs sampling is dangerous” on the first page. The warning is

not so much about Gibbs sampling as it is to leaping into computation without thinking about whether or not the model is appropriate for the data and

problem at hand. To that extent, the warning is equally applicable to blindly

applying the EM algorithm. Although both procedures will provide an answer

to the question, “What are the parameters of this model?” they do not necessarily answer the question, “Is this a reasonable model?”

Fortunately, Bayesian statistics offers an answer here as well. If we have a

full Bayesian model, that model makes a prediction about the data we might

see. If this model has a very low probability of generating a given data set, it

is an indication that the model may not be appropriate. We can also use this

idea to choose between two competing models, or search for the best possible

model. The next chapter explores model checking in some detail.


9.1 (Stratified Sampling). Example 9.1 used a simple random sample of

100 students with the number of masters in the sample unknown in advance.

Suppose instead a stratified sample of 50 masters and 50 nonmasters was

used. How would the inference differ, if it was known who the masters and

nonmasters are? How about if we do not know who is who, but we know there

are exactly 50 of each?

9.2 (Missing At Random). Classify the following situations as MCAR,

MAR, or neither:

1. A survey of high school seniors asks the school administrator to provide

grade point average and college entrance exam scores. College entrance

exam scores are missing for students who have not taken the test.

2. Same survey (mentioned in first point) except now survey additionally

asks whether or not student has declared an intent to apply for college.


9 Learning in Models with Fixed Structure

3. To reduce the burden on the students filling out the survey, the background questions are divided into several sections, and each student is

assigned only some of the sections using a spiral pattern. Responses on

the unassigned section are missing.

4. Some students when asked their race decline to answer.

9.3 (Missing At Random and Item Responses). Item responses can be

missing for a variety of reasons. Classify the following situations concerning a

student’s missing response to a particular Task j as MCAR, MAR, or neither.

Hint: See (Mislevy 2015) or (Mislevy and Wu 1996).

1. John did not answer Task j because it was not on the test form he was


2. Diwakar did not answer Task j because there are linked harder and easier

test forms, intended for fourth and sixth grade students; Task j is an easy

item that only appears on the fourth grade form; and Diwakar is in sixth

grade, so he was administered the hard form.

3. Rodrigo took an adaptive test. He did well, the items he was administered

tended to be harder as he went along, and Task j was not selected to

administer because his responses suggested it was too easy to provide

much information about his proficiency.

4. Task j was near the end of the test, and Ting did not answer it because

she ran out of time.

5. Shahrukh looked at Task j and decided not to answer it because she

thought she would probably not do well in it.

6. Howard was instructed to examine four items and choose two of them to

answer. Task j was one of the four, and not one that Howard chose.

9.4 (Classical Test Theory). Consider the following simple model from

classical test theory. Let Ti be a student’s true score on a family of parallel tests

on which scores can range from 0–10. Ti characterizes Student i’s proficiency

in the domain, but cannot be observed directly. Instead, we observe noisy

scores on administrations of the parallel tests. Let Xij be Student i’s score on

Test j. Following classical test theory, let

Xij = Ti +



9.7 Caution: MCMC and EM are Dangerous!



∼ N(0, σ 2 ) and Ti ∼ N(μ, σT2 ). The classical test theory index of

reliability is ρ =





2 +σ 2 .


The following is BUGS code for estimating μ, σT2 ,

σ , and ρ from the responses of nine students to five parallel test forms:

model ctt {

for (i in 1:N) {

T[i] ~ dnorm(mu,tauT)

for (j in 1:ntest) {

x[i,j] ~ dnorm(T[i],taue);



mu ~ dnorm(0,.01)

tauT ~ dgamma(.5,1)

taue ~ dgamma(.5,1)

rho <- taue / (taue + tauT)

varT <- 1/tauT

varE <- 1 / taue



list(T = c(20,20,20,20,20,20,20,20,20))

list(T = c(-20,-20,-20,-20,-20,-20,-20,-20,-20))
















































9 ), .Dim=c(9,5)))

Note that in BUGS, the normal distribution is parameterized with the mean,

μ, and precision, τ = 1/σ 2 . The line varT <- 1/tauT produces draws for σT2 .

Run the problem with this setup, and the two chains as initial values for the

T s. Monitor T, mu, varT, varE, and rho.

1. Run 500 MCMC cycles. There are overdispersed initial values for the T s.

Ask for Stats, history, density, and the BGR convergence diagnostics plot.

Does it look like burn-in cycles may be needed for this problem? Which

parameters seem to be more or less affected by the overdispersed initial


2. Run another 500 cycles, and calculate summary statistics and distributions for the parameters you monitored based on only the last 500 cycles


9 Learning in Models with Fixed Structure

(hint: beg = 501 on the sample monitor dialog box). Regarding estimates

for individual students: What are the posterior means for the T s (i.e.,

true scores) of each of the students? What are their maximum likelihood

estimates (hint: BUGS does not tell you this—you need to do a little

arithmetic). In what directions do they differ, and why?

3. Look at the summary statistics for T[1]. What is the meaning of the

number in each column?

9.5 (Classical Test Theory and the Effects of Different Priors). Consider the model and data from Exercise 9.4. Start with the original setup.

Run just one chain for each of the variations required. In each case, run 2000

cycles, and calculate statistics based on only the last 1000. Monitor mu, varT,

varE, T, and rho.

1. The original setup uses, as a prior distribution for mu, N (0, .01) (using

the BUGS convention with the precision as the shape parameter). Run

the same problem, except with N (0, 1) as the prior for mu, and monitor

the results. Focusing on posterior means and standard deviations of the

parameters listed above, which ones change? How much and why?

2. Repeat the run, except with N (0, 10) as the prior for mu. Again focusing on

posterior means and standard deviations of the parameters listed above,

which ones change? How much and why?

3. The original setup uses, as a prior distribution for both tauT and taue,

Gamma(.5, 1). Run the same problem, except with Gamma(.05, .10) as the

prior for tauT, and monitor the results. Focusing on posterior means and

standard deviations of the parameters listed above, which ones change?

How much and why?

4. Repeat the run, except with Gamma(50, 100) as the prior for tau. Again

focusing on posterior means and standard deviations of the parameters

listed above, which ones change? How much and why?

9.6 (Slow Mixing). A researcher runs a small pilot MCMC chain on a particular problem and gets a trace graph for one parameter like Fig. 9.9. The

researcher shows it to two colleagues. The first says that the problem is the

proposal distribution and suggests that the researcher should try to find for

new proposal distributions. The second says that if the researcher just runs

the chain for ten times longer than originally planned, then the trace plot

will look like “white noise” and the MCMC sample will be adequate. The

second colleague further suggests that a long weekend is coming up and the

lab computers will be idle anyway. Which advice should the researcher take?


9.7 (MAP and MCMC Mean). In Table 9.10, the MCMC mean frequently

appears to be closer to the MAP from the EM algorithm than it does to the

“true” value from the data generation. Why is this seen?

9.7 Caution: MCMC and EM are Dangerous!


9.8 (Latent Class). The following response vectors come from a simple

latent class model with two classes and ten dichotomously scored items.

1111111010 1110111011 0000010000 1111001011 0010001000

0000101000 0000010000 0010100000 0000000000 0000001000

0011100111 0100000000 0000000000 0000000000 0010000000

1100000000 1111011111 0010100000 0010000010 0111111011

0010000000 0000000000 1011111111 0010000100 1111111011

1110110111 0000000000 0000110010 1111010100 0000000010

0100100000 0111100010 0000000000 0011000000 1111111010

0100100000 0000001100 1111111111 1100000000 0010010000

0000101000 0101011111 0000000000 1111111011 0000010000

0010000000 0000010000 1111110111 0010000100 0010000000

0010010010 1111011111 1110101110 1111011101 0000000000

1111111110 1100010010 0001000001 1111111101 0011111011

0000000100 0010000001 0010000000 0010100000 0111111111

0000000000 0001110000 0000010010 1111011111 0010010000

1110101001 0010110110 1110000000 0000100101 0010100010

0000000110 0000000000 0000100010 1111001011 0000001000

0010000000 0000010001 0010011010 1111111011 1010101100

0010000001 0011000000 1000100010 0000010000 1111100011

0110000100 1001000000 1010000000 0100010110 0010111101

1101111111 0010000000 1111101111 1000100100 1111101111

Use a Beta(1, 1) prior for the class membership probability λ and for all tasks

use a Beta(1.6, .4) prior for the probability of success for masters, πj1 , and a

Beta(.4, 1.6) probability of success for nonmasters, πj0 . Estimate the parameters from the data using MCMC.

9.9 (Latent Class Prior). In Exercise 9.8, what would have happened if we

had used a Beta(1, 1) prior for πj1 and πj0 ?

Hint: Consider three MCMC chains starting from the starting points: πj =

(.2, .8), πj = (.5, .5), and π j = (.8, .2) for all j.

9.10 (Latent Class Parameter Recovery). The data for Exercise 9.8 are

from a simulation, and the parameters used in the simulation are: λ = 0.379,

plus the values in the following table.










9 10

Nonmasters 0.22 0.17 0.31 0.07 0.25 0.19 0.15 0.18 0.22 0.11


0.82 0.84 0.85 0.72 0.81 0.76 0.8 0.7 0.84 0.76

Calculate a 95 % credibility interval for each parameter (this can be done by

taking the 0.025 and 0.975 quantiles of the MCMC sample). How many of the

credibility intervals cover the data generation parameters? How many do we

expect will cover the data generation parameters?


9 Learning in Models with Fixed Structure

9.11 (EM vs. MCMC). In each of the following situations, tell whether it

is better to use the EM algorithm or MCMC to estimate parameters.

1. The posterior mean will be used in an online scoring engine. The posterior

variance will be examined briefly as a model checking procedure, but will

not be used in scoring.

2. The test specs call for only using items whose p-plus — marginal probability of success, p+

j = P(Xj = 1) — is greater than .1 and less than .9

with 90 % credibility, that is P(0.1 ≤ p+

j ≤ 0.9) ≥ 0.9.

3. Only the posterior mean will be used in online scoring, but there is strong

suspicion that the distribution for the difficulty on several item parameters

is bimodal.

9.12 (Improving Posterior Standard Deviation). Consider the calibration in Example 9.3. Which of the following activities are likely to reduce the

standard deviation of the posterior law for the proficiency model parameter


1. Increase the size of the MCMC sample.

2. Increase the length (number of tasks) of the test.

3. Increase the number of students in the calibration sample.

Which of the following activities are likely to reduce the standard deviation

of the posterior for an evidence model parameter such as π10 :

1. Increase the size of the MCMC sample.

2. Increase the length (number of tasks) of the test.

3. Increase the number of students in the calibration sample.

9.13 (LSAT model). The BUGS distribution package (Spiegelhalter et al.

n.d.) comes with a sample model called “LSAT” based on an analysis performed by Bock and Aitkin (1981) of responses on five items from 1000 students taking the Law School Admissions Test (LSAT). The data are analyzed

using the Rasch model, where if pij is the probability that Student i gets

item j correct, then


logit(pij ) = θi − αj

where the proficiency variable θi has distribution N (0, τ ) (Spiegelhalter et al.

n.d.). Note that this equation can be reparameterized as:

logit(pij ) = βθi − αj


where θi ∼ N (0, 1) and β = 1/τ .

Run an MCMC sampler using both the original (Eq. 9.29) and reparameterized (Eq. 9.29) models. What differences are there in the resulting Markov



Critiquing and Learning Model Structure

The previous chapter described how to fit a model to data. The parameterlearning methods described there assumed the structure of the model was

fixed. However, often there is as much or more uncertainty about the structure of the model as there is about the values of the parameters. There are

basically two approaches to this problem. The first is model checking, or as

it is sometimes called, model criticism. Fit indices and graphical displays can

help us explore where and how well the model fits the data, and bring to light

problems with a model. The second is model search. There are a number of

ways to search the model space for one that is “best” in some sense.

While traditional methods of characterizing model fit emphasized overall

goodness of fit, we take a more utilitarian perspective. The statistician George

Box famously said “All models are false, but some are useful” (Box 1976). We

want a Bayes net that captures the key interrelationships between what students know, in terms of proficiency variables, and what they can do, in terms

of observables, without having to believe that the model expresses every pattern in the data. We do not, however, want unmodeled patterns that make our

inferences about students’ proficiencies misleading for the purpose at hand.

We are interested in fit indices that highlight particular kinds of model misfit

which, from experience, we know can appear in assessment data and distort

our uses of the model.

We emphasize exploratory uses of model checking over statistical tests of

fit, partly because Bayes nets are often applied with small to medium size data

sets, and partly because the techniques we describe fall out almost as a byproduct of Markov Chain Monte Carlo (MCMC) estimation, in ways that generate their own reference distributions. The reader interested in large-sample

distributions of prediction-based fit indices is referred to Gilula and Haberman (1995), Gilula and Haberman (2001), and to Haberman et al. (2013)

for an application to item response theory (IRT) models. Although we do not

pursue large sample properties here, the chapter draws in places on their work

on prediction-based indices.

c Springer Science+Business Media New York 2015


R. G. Almond et al., Bayesian Networks in Educational Assessment,

Statistics for Social and Behavioral Sciences, DOI 10.1007/978-1-4939-2125-6 10


10 Critiquing and Learning Model Structure

Section 10.1 introduces some fit indices and describes a simple simulation

experiment for using them. Section 10.2 looks at the technique of posterior predictive model checking (PPMC), which goes well with MCMC estimation. Section 10.3 looks at some graphical methods for assessing model fit. Section 10.4

addresses differential task functioning, where the issue is whether a task works

similarly across student groups. Section 10.5 then turns to model comparison.

Usually simpler models are preferable to more complex ones, but complex

ones will fit better just because of the extra parameters. The DIC fit measure

discussed in Sect. 10.5.1, which also goes well with MCMC, includes a penalty

for model complexity. In Sect. 10.5.2, prediction-based indices are defined and

illustrated with the discrete-IRT testlet model. Looking ahead, Chap. 11 will

apply several of these techniques to the mixed-number subtraction example.

Given a measure of model fit, one can search for a model that fits the data

best. Section 10.6 looks at some literature on automatic model selection. There

are, however, some important limitations on learning model structure with

Bayes nets. In particular, there can be ways of reversing the direction of some

of the edges that have different interpretations but do not change the implied

probability distribution. Section 10.7 discusses some of these equivalent models, and highlights pitfalls in attempting to learn “causality” from data.

10.1 Fit Indices Based on Prediction Accuracy

The fact that Bayes nets are probability models gives them a distinct advantage over more ad hoc mathematical models for managing uncertainty, such

as fuzzy logic (Zadeh 1965) and certainty factors (Shortliffe and Buchanan

1975). In particular, the probabilities can be regarded as predictions and standard statistical techniques can assess how accurate those predictions are. This

assessment yields information about how well the model fits the data.

Cowell et al. (1993) describe several locations in a Bayesian network at

which fit can be assessed:

Node Fit: How well the model predicts the distribution of a single variable

in the model. These can either be conditional predictions taking

the values of other variables into account or marginal predictions

ignoring the values of other variables.

Edge Fit: How well the relationship between a parent and child in the graph

is modeled.

Global Fit: How well all variables in the data fit the graphical model.

These are not always easy to apply in educational testing because the

proficiency variables are latent, and therefore predictive patterns of certain

dependencies described in the model cannot be directly assessed. In particular,

parent–child relationships often involve at least one latent variable and hence,

they cannot be directly tested with only observed data. Thus, the node fit

indices can only be applied to observable variables and the global fit indices

must calculate how well the model predicts all the observables.

10.1 Fit Indices Based on Prediction Accuracy


One way to get around this problem is to leave the data out for one

observable, and see how well the model predicts that observable based on the

remaining values. This is called leave one out prediction. (The idea extends

readily to leaving out a group of observables, then predicting some summary

statistic of them such as a subtest score.) Suppose that a collection of observable outcomes is available from N learners taking a particular form of an

assessment. Let Yij be the value of Observable j for Learner i. Assume that

Yij is coded as an integer and it can take on possible values 1, . . . , Kj .

Using the methods of Chap. 5, it is easy to calculate a predictive distribution for Yij given any other set of data. Let Yi,−j be the vector of

responses for Learner i on every observable except Observable j. Define

pijk = P(Yij = k|Yi,−j ). In all of the fit indices described below, the idea

is to characterize how well the model’s prediction pijk , given all her other

responses, predicts the Yij that was actually observed. Let pij∗ denote the

value of pijk for k = Yij , that is, the prediction probability of the event that

actually occurred.

Williamson (2000) noted that a number of measures of quality of prediction have historically been applied to evaluate weather forecasting. These can

be pressed into service fairly easily to evaluate node fit in Bayesian networks.

Williamson (2000) (see also Williamson et al. 2000) evaluated a number of

these and found the most useful to be Weaver’s Surprise Index , the Ranked

Probability Score, and Good’s Logarithmic Score. The first two are more traditional indices, which remain useful to alert users to anomalies in collections

of predictions. Good’s Logarithmic Score, useful enough on its own, leads

us to more theoretically grounded techniques based on statistical theory and

information theory.

Weaver’s Surprise Index (Weaver 1948) attempts to distinguish between

a “rare” event and a “surprising” one. A rare event is one with a small probability. A surprising event is one with a small probability relative to the probability of other events. (The definition of events can make a difference: The

probability of a Royal Flush in clubs—the ace, king, queen, jack, and ten of

clubs—is the same as the probability of any other specified hand, so getting

this particular hand in poker is rare but no more surprising than any other

specified set of five cards in the deck when each is considered an event in the

comparison. It is rare and surprising with respect to events defined by sets of

hands with the same poker value, such as one pair, two pairs, straight, etc.)

Weaver’s surprise index is defined as the ratio of the expected value of

prediction probabilities to that of the actual event:

Wij =

E (pijk )








The expectation here is over Yij using the predictive probabilities pijk . The

larger the value, the more surprising the result. Weaver suggests that values


10 Critiquing and Learning Model Structure

of 3–5 are not large, values of 10 begin to be surprising and values above 1000

are definitely surprising.

Weaver’s surprise index assumes that one wrong prediction is as bad as

another. However, frequently the observables in an educational model represent ordered outcomes (e.g., a letter grade assigned to an essay). In those

situations, the measure of prediction quality should provide a greater penalty

for predictions that are far off than for near misses. The Ranked Probability

Score (Epstein 1969) takes this into account.

⎞ ⎤




= −

2 2(Kj − 1)


Kj − 1

Kj −1









pijn ⎠ ⎦



|k − Yij |pijk .



This index ranges from 0.0 (poor prediction) to 1.0 (perfect prediction). It

assumes the states have an interval scale.

Good’s Logarithmic Score extends the basic Logarithmic Scoring Rule.

Recalling that pij∗ is the posterior probability of the observed outcome, the

basic logarithmic score is


Lij = − log pij∗ .

The lower the probability, the larger the value of the logarithmic score for

this observation. Twice the logarithmic score is called the deviance, and finding parameter values that minimize the total deviance over a sample gives

the maximum likelihood estimates for a model. We will return to the use of

deviance in model comparisons in Sect. 10.5.1.

The basic logarithmic score makes no distinction between “rare” and “surprising.” To take care of this effect, Good (1952) subtracts the expected logarithmic score, over the values that might have been observed:


GLij = − log pij∗ − −

pijk log pijk .



Values near zero indicate that the model accurately predicts the observation.

The average logarithmic score is known in information theory as entropy, or

the uncertainty about Yij before it was observed, and is denoted by Ent(Yij ).

Any of the preceding fit indices measures can be averaged across the sample

of examinees to give a measure of node fit for a particular observable. For


example, Sj = i=1 Sij /N is the ranked probability score for Observable j.

The indices can also be averaged across nodes to get a measure of person fit,


e.g., Wi = j=1 Wij /J.

10.2 Posterior Predictive Checks


The question is how extreme indices above need to be to indicate a problem. Williamson et al. (2000) posed a simple simulation experiment to answer

that question for any given model. When assessing model fit, the null hypothesis is that the data fit the model. This suggests a procedure for determining

the distribution of the test statistic under the null hypothesis.

1. Generate a sample of the same size as the real data from the posited


2. Calculate the value of the fit indices for the simulated data set, using the

posited model and conditional probabilities.

3. Repeat Steps 1 and 2 many times to generate the desired reference distribution of any of the fit indices.

Step 2 produces fit index values for all of the simulee*task combinations in

. Reference distributions for task and person fit measures

the sample, e.g., Sij

are created by averaging over tasks or persons accordingly, i.e., Sj is computed

by averaging Sij . Williamson et al. (2000) note that the reference distribution

can be computed more cheaply using a simple bootstrap (Efron 1979), which

samples repeatedly from the observed data.

10.2 Posterior Predictive Checks

The Williamson et al. (2000) method ignores one potentially important source

of variability, the uncertainty in the predictions due to uncertainty about the

parameters (it matters less for problems with large samples). The method of

PPMC (Guttman 1967; Rubin 1984) does incorporate uncertainty in parameters. Furthermore, it works very naturally with Markov Chain Monte Carlo

estimation. Sinharay (2006) provides a good summary of this technique

applied to Bayesian network models. This section describes the approach and

gives simple example.

Let p(y|ω) be the likelihood for data y given parameters ω. Let yrep be

a replicate set of data generated by the same process as y with the same

parameters ω. This is sometimes called a shadow data set. The posterior

predictive method suggests using the posterior predictive distribution for yrep

to create a reference distribution for a given fit statistic, similar to the way

Williamson et al. (2000) method created an empirical reference distribution

in the previous section. The posterior predictive distribution is defined as:

p(yrep |y) =

p(yrep |ω)p(ω|y)dω .


Correspondingly, the posterior predictive distribution for a fit index, e.g.,

Weaver’s surprise index for Task j, Wj , is obtained as

p(Wj (yrep )|y) =

p((Wj (yrep )|ω)p(ω|y)dω .


Tài liệu bạn tìm kiếm đã sẵn sàng tải về

7 Caution: MCMC and EM are Dangerous!

Tải bản đầy đủ ngay(682 tr)