Tải bản đầy đủ
1 Data, Models, and Plate Notation

1 Data, Models, and Plate Notation

Tải bản đầy đủ


9 Learning in Models with Fixed Structure

directed graphical notation discussed in Chap. 4, with the major elaboration
being for structurally identical replications of models and variables. Plate
notation is introduced below with some simple examples that show the correspondence between plate diagrams and probability models. A probability
model and a plate diagram for a general educational measurement model are
then presented.
We then discuss inference about structures like these—parameter estimation, as statisticians typically call it, or “learning,” as it is more often called
in the artificial intelligence community. A natural starting point is estimating
the structural parameters of the model, namely the population proficiency distributions and conditional distributions for observable variables, when examinees’ proficiency variables are known. This so-called complete data problem
is a straightforward application of the Bayesian updating in the binomial and
multinomial models. It is also at the heart of the more realistic incomplete
data problem, where neither examinees’ proficiencies nor structural parameters are known. Two approaches are discussed: the EM algorithm (Dempster
et al. 1977) and MCMC estimate (Gilks et al. 1996). Many good references for
these techniques are available, so the focus here is on the key ideas and their
application in a simple problem. The special problem of bringing new tasks
into an established assessment, a recurring activity in ongoing assessment
systems, is also addressed.
9.1.1 Plate Notation
Plate notation (Buntine 1996) extends the notation for directed graphs we
developed in Chap. 4. As before, there are nodes that represent parameters and
variables, and directed edges that represent dependency relationships among
them. The new idea is an efficient way to depict replications of variables and
structures by displaying a single representative on a “plate” that indicates
As a first simple example, consider the outcomes Xj of four draws from a
Bernoulli distribution with known parameter θ. The joint probability distribution is p(xj |θ). The usual digraph is shown in the top panel of Fig. 9.1.
The bottom panel represents the same joint distribution using a plate for
the replicated observations. The node for θ lies outside the plate—it is not
replicated—and influences all observations in the same manner. That is, the
structure of the conditional probability distributions for all four Xj s given θ
is the same. The double edge on the θ node indicates that this parameter is
known, while the single edge on the Xj node indicates that their values are
not known.
Plates can be nested. Consider a hierarchical model extending the previous
example such that four responses are observed each of N students. For each
student i, the response variables Xij are Bernoulli distributed with probability
θi 1 . These success probabilities are not known, and prior belief about them is

This is a variable not a parameter as it is student specific.

9.1 Data, Models, and Plate Notation


Fig. 9.1 Expanded and plate digraphs for four Bernoulli variables
Upper figure shows the full graph, lower figure shows the same structure with the
plate notation. Reprinted with permission from ETS.

expressed through a beta distribution with known parameters α and β. The
joint probability distribution is now
p (xij | θi )p (θi | α, β) ,


and the digraph using plate notation is as shown in Fig. 9.2. The replications of responses for a given examinee are conditionally independent and
all depend in the same way on the same Bernoulli probability θi , as implied
by j p (xij | θi ). Similarly, all the θi s depend in the same way on the same
higher level parameters α and β, as implied by i p (θi | α, β).
A final introductory example is the Rasch model for dichotomous items,
shown in Fig. 9.3. At the center of the digraph, where plates for the proficiency
variables θi for students and difficulty parameters βj for items overlap, is the
probability pij of a correct answer by student i to item j:
pij = P (Xij = 1 | θi , βj ) = Ψ (θi − βj ) ≡

exp (θi − βj )
1 + exp (θi − βj )

where Ψ (·) denotes the cumulative logistic distribution. This probability is
known if θi and βj are known, through the functional form of the Rasch


9 Learning in Models with Fixed Structure

Fig. 9.2 Plate digraph for hierarchical Beta-Bernoulli model
Reprinted with permission from ETS.

model; this functional relationship rather than the stochastic relationship is
indicated in the digraph by a double edge on the node. That is, the double edge
indicates that the value of a variable is known or that it is known conditional
on the values of its parents. Logical functions such as an AND gate have this
property too, but not a noisy-AND because its outcome is probabilistic.
Examinee proficiency parameters are posited to follow a normal distribution with unknown parameters mean μθ and variance σθ2 . Higher level
distributions for the examinee proficiency distribution are μθ ∼ N (μw , σw
and σθ ∼ Gamma(aθ , bθ ), with the parameters of the higher level distributions known. Item parameters are also posited to follow a normal distribution,
with a mean fixed at zero to set the scale and an unknown variance σβ2 , and
σβ2 ∼ Gamma(aβ , bβ ).

9.1.2 A Bayesian Framework for a Generic Measurement Model
In educational and psychological measurement models, observable variables
are outcomes of the confrontation between a person and a situation, or
more specifically, a task. In particular, observable variables X are modeled
as independent given unobservable, or latent, person variables2 θ and task

In the context of maximum likelihood estimation, these are called person parameters because they must be estimated, but this book is following the convention
of calling person-specific values variables rather than parameters.

9.1 Data, Models, and Plate Notation










Observable j

Student i

Fig. 9.3 Plate digraph for hierarchical Rasch model
Reprinted with permission from ETS.

parameters β. In the Rasch model, for example, examinees have more or less
proficiency in the same amount with respect to all items, and items are more
or less difficult for all examinees. (Section 9.1.3 gives extensions where this is
no longer technically true, including differential item functioning (DIF) and
mixture models.)
The discrete Bayes nets that are useful in cognitive diagnosis exhibit this
character. It is useful, however, to cast estimation in these models in a more
general framework in order to connect it with the broader psychometric and
statistical literature. The same characterization applies to the models of item
response theory (IRT), latent class analysis, factor analysis, latent profile
analysis, and, at the level of parallel tests, parametric classical test theory.
Examples of familiar models that often have their own history, notation, and
terminology are shown in Table 9.13 . All of these models differ only in the
nature of the observable variables and student-model variables—discrete vs.
continuous, for example, or univariate vs. multivariate—and the form of the
link model, or probability model for observables given person variables and
task parameters. We will write p xj | θ, βj , and include a subscript i as in
p xij | θi , β j if we need to focus on the responses of a particular examinee i.
In discrete Bayesian models in cognitive diagnosis, p xj | θ, β j takes the
form of a collection of categorical distributions: For a given value of the proficiency variable(s) θ, there is a categorical probability distribution over the
possible values of xj , the value of the observable variable. The (vector valued)

The table is not exhaustive, as for example, classical test theory is readily applied
to vectors, factor analysis can be carried out without assuming normality, and
there are mixture models that include both discrete classes and classes with probabilities structured by IRT models (e.g., Yamamoto 1987).


9 Learning in Models with Fixed Structure

parameter βj comprises the probabilities for each response category for each
given value of θ.
A full Bayesian model also requires distributions (or laws) for θs and βs
(as they are unknown). Treating examinees as exchangeable means that before
seeing responses, we have no information about which examinees might be
more proficient than others, or whether there are groups of examinees more
similar to one another than to those in other groups. It is appropriate in
this case to model θs as independent and identically distributed, possibly
conditional on the parameter(s) λ of a higher level distribution or law. That
is, for all examinees i,
θi ∼ p (θ | λ) .
In the normal-prior Rasch model example in the previous section, λ = (μθ , σθ2 )
(the mean and variance of the normal law). If θ is the categorical variable, λ
will be category probabilities or a smaller number of parameters that imply
category probabilities in a more parsimonious way (e.g., the models discussed
in Chap. 8).
Similarly, if we have no prior beliefs to distinguish among item difficulties,
we can model βs as exchangeable given the parameter(s) of their law:
β j ∼ p (β | ξ) .
In the Rasch example, ξ = σβ2 . When βs are probabilities in categorical distributions, as they are in discrete Bayes nets, then either p (β | ξ) is a beta or
Dirichlet prior on the probabilities or a function of smaller number of variables
that imply the conditional probabilities. The DiBello–Samejima distributions
in the previous chapter are an example of the latter.
Both λ and ξ may be specified as known, or as unknown with higher level
distributions, or laws, p(λ) and p(ξ).
The full Bayesian probability model for educational measurement is thus
p xij | θ i , β j p (θ i | λ) p β j | ξ p (λ) p (ξ) . (9.1)

p (X, θ, β, λ, ξ) =


The corresponding graph, using plate notation, is shown in Fig. 9.4.

9.1.3 Extension to Covariates
We briefly note three common extensions of the basic measurement model that
involve covariates, or collateral information, about examinees, denoted Zi , or
about tasks, denoted Yj . (Note that the entries in a Q-matrix are collateral
information about tasks.) The first is the extension to conditional exchangeability about examinee or task parameters. The second concerns interactions
between examinee covariates and response probabilities, or DIF. The third
concerns interactions as well, but when examinee covariates are not fully
known. These are mixture models.

9.1 Data, Models, and Plate Notation


Table 9.1 Special cases of the generic measurement model
classical test
Dichotomous IRT

Graded response
dichotomous IRT
Latent class


Factor analysis
Latent profile

variables (θ)
(true score)






Discrete (class Categorical
memberships) responses

Discrete (class

Dichotomous IRT Unidimenmixture
and discrete
(Discrete) Bayes Discrete
and class



parameters (β)
Error variance

Item difficulty,
Categorical Item step

Item difficulty,
Categorical Conditional
(including probabilities
Categorical Conditional
(including probabilities
parameters in
Multivariate Means and
covariances for
each class
parameters for
each class

Categorical Conditional
(including probabilities
parameters in


9 Learning in Models with Fixed Structure




Observable j

Student i

Fig. 9.4 Graph for generic measurement model
Reprinted with permission from ETS.

Z refers to known information about examinees such as demographic group
or instructional background. When this information affects our beliefs about
examinees’ proficiencies, examinees are no longer exchangeable. Eighth grade
students are more likely to know how to subtract fractions than fourth grade
students, for example. We may however posit conditional exchangeability,
or exchangeability among examinees with the same covariates, by making
the distributions on θ conditional on covariates, namely p(θ|z, λ). Through
p(θ|z, λ), we model an expectation that examinees with the same value of Z
are more similar to one another than those with other values of Z, or that
eighth graders generally have higher proficiencies than fourth graders.
Similarly, we can incorporate covariates for tasks, through p(β|y, ξ). A task
model variable Y can indicate the particular task model used to generate a
task (or characteristics of the task defined by the task model variable), so
we can use information about the parameters of similar tasks to estimate the
parameters of a new task from the same family, or allow for the fact that
problems with more steps are usually harder than problems with fewer steps
(Mislevy et al. 1993; Glas and van der Linden 2003).
Figure 9.5 depicts the extension to covariates that influence belief about
item or student parameters, but do not affect responses directly. That is,
responses remain dependent on item parameters and student variables only.
If we knew an examinee’s θ and a task’s β, learning Z or Y would not change
our expectations about potential responses.
DIF occurs when learning Z or Y would change our expectations about
responses, beyond parameters for the student and task (Holland and Wainer
1993). One example occurred in a reading assessment that surveyed grades 8
and 12: A task that required filling in an employment application was relatively easier for 12th grade students than for 8th grade students. Presumably
many 12th grade students had personal experience with employment applica-

9.2 Techniques for Learning with Fixed Structure







Observable j


Student i

Fig. 9.5 Graph with covariates and no DIF
Reprinted with permission from ETS.

tions, a factor that would make the item easier for such a student than one
who had not, but was otherwise generally similar in reading proficiency. In
this case, p xj | θ, β j , Grade = 8 = p xj | θ, β j , Grade = 12 . In such cases,
task parameters would differ for at least some tasks, and be subscripted with
respect to covariates as βj(z) . Figure 9.6 shows how DIF appears in a graph.
DIF concerns interactions between task response probabilities and examinees’ background variables Z, when Z is known. Mixture models posit that
such interactions may exist, but examinees’ background variables are not
observed (e.g., Rost 1990). An example would be mixed number subtraction
items that vary in their difficulty depending on whether a student solves them
by breaking them into whole number and fraction subproblems or by converting everything to mixed numbers and then subtracting (Tatsuoka 1983).
Figure 9.7 depicts the digraph for a generic mixture model. It differs from the
DIF model only in that Z is marked as unobserved rather than observed. Note
that this requires specifying a prior (population) distribution for Z. Now the
higher level parameter λ for student variables has two components: λθ for the
law for θ and λZ representing the probability parameter in, say, a Bernoulli
law for Z if there are two classes, or a vector of probabilities for a categorical
law if there are more than two classes.

9.2 Techniques for Learning with Fixed Structure
How, then, does one learn the parameters β of conditional probability distributions and λ of examinee proficiency distributions? This section addresses
inference in terms of the full Bayesian probability model of Eq. 9.1, the basics
of which are outlined below in Sect. 9.2.1. Section 9.2.2 discusses the simpler problem in which students’ θs are observed as well as their xs—the
“complete data” problem, in the terminology introduced in Dempster et al.


9 Learning in Models with Fixed Structure



Observable j

Student i

Fig. 9.6 Graph with DIF
Reprinted with permission from ETS.

(1977)’s description of their EM algorithm. Section 9.3 relates the complete
data problem to the “incomplete data” problem we face in Eq. 9.1. This chapter introduces two common approaches to solve it, namely the EM algorithm
(Sect. 9.4) and MCMC (Sect. 9.5). Although both algorithms are general
enough to work with any model covered by Eq. 9.1, the focus of this chapter
is on discrete Bayes nets measurement models.
9.2.1 Bayesian Inference for the General Measurement Model

Fig. 9.7 Graph for mixture model
Reprinted with permission from ETS.

The full Bayesian model (Eq. 9.1) contains observable variables X, proficiency variables (sometimes called person parameters) θ, task parameters

9.2 Techniques for Learning with Fixed Structure


β, and higher level parameters λ and ξ. Equation 9.1 represents knowledge
about the interrelationships among all the variables in the generic measurement model without covariates, before any observations are made. The only
observations that can be made are those for observable variables. We observe
realized values of X, say x∗ . In Bayesian inference, the basis of inference about
person variables and task parameters and their distributions is obtained by
examining the conditional distribution of these variables conditional on x∗ .
By Bayes theorem, this posterior is easy enough to express. Aside from a normalizing constant K, it has exactly the same form as Eq. 9.1, just with the
expressions involving X evaluated at their observed values:
p (θ, β, λ, ξ | x∗ ) = K

p x∗ij | θi , βj p (θ i | λ) p β j | ξ p (λ) p (ξ) ,


K −1 = p (x∗ )
p (x∗ | θ, β, λ, ξ) ∂θ ∂β ∂λ ∂ξ

j p xij | θ i , β j p (θ i | λ) p β j | ξ p (λ) p (ξ) ∂θ ∂β ∂λ ∂ξ .
Bayesian inference proceeds by determining important features of the posterior distribution, such as means, modes, and variances of variables, picturing
their marginal distributions, calculating credibility intervals, and displaying
plots of marginal or joint distributions. If the prior laws are chosen to be
conjugate to the likelihood, then simple procedures of posterior inference are
available for independent and identically distributed samples, as illustrated in
the following section for the complete data problem.
But Eq. 9.2 in its entirety does not always have a natural conjugate, even
when all of the factors taken by themselves do (especially if some of the
variables are missing; see York 1992). So, while the form of Eq. 9.2 is simple
enough to write, the difficulty of evaluating the normalizing constant (Eq. 9.3)
renders direct inference from the posterior all but impossible. Approximations
must be found.
9.2.2 Complete Data Tables
The population distribution of proficiency variables (the proficiency model)
and conditional probability of observable outcome variables given the proficiency variables (the evidence models and link models) in discrete Bayes nets
take the form of categorical distributions, the special case being Bernoulli
distributions when there are just two categories. Bayesian inference in these
models is particularly simple, since they admit to posterior inference via conjugate prior distributions, and the forms of the priors, likelihoods, and posteriors have intuitive interpretations. This section describes Bayesian updating


9 Learning in Models with Fixed Structure

for the multinomial distributions, starting with the Bernoulli distributions
and generalizing to categorical distributions. An example of parameter estimation for a complete data version of a simple Bayes net follows. We will
start with a quick review of Bayesian inference for Bernoulli and categorical
A Bernoulli random variable X has two possible outcomes, which we may
denote 0 and 1. Let π be the probability that X = 1, typically representing
a “success” or “occurrence” event, or with dichotomous test items a correct
response. It follows that 1 − π is the probability that X = 0, a “failure” or
“nonoccurrence,” or an incorrect item response. The probability for a single
observation is π x (1 − π)1−x . The probability for r successes in n independent
replications is
p (r |n, π ) ∝ π r (1 − π)n−r .
Such counts of successes in a specified number n of independent Bernoulli
trials with a common parameter π are said to follow a binomial distribution.
Once n trials occur and r successes are observed, Eq. 9.4 is interpreted as a
likelihood function, which we denote as L (π | r, n). The maximum likelihood
estimate (MLE) of π is the value that maximizes the likelihood, or equivalently maximizes its log likelihood (π | r, n) = log [L (π | r, n)]. In the case
of Bernoulli trials, the MLE is π
ˆ = r/n, the sample proportion of successes,
a sufficient statistic for π. The squared standard error of the mean, or error
variance, is π (1 − π) /n.
For Bayesian inference, the conjugate prior for the Bernoulli and binomial
distribution is the beta distribution (defined in Sect. 3.5.3). What that means
is this: If we can express our prior belief about π in terms of a beta law,
then the posterior for π that results from combining this prior through Bayes
theorem with a likelihood in the form of Eq. 9.4 is also a beta law. Bayesian
inference through conjugate distributions has the advantages of eliminating
concern about the normalizing constant (in the case of the beta distribution,
the normalizing constant is the beta function evaluated at the parameters that
can be looked up in a table), and relegating posterior analysis to features of
a well-known family of distributions (conjugate families).
Leaving out the normalization constant, the p.d.f. for the beta distribution
p (π|a, b) ∝ π a−1 (1 − π)
Comparing the form of the beta distribution in Eq. 9.5 with the form of the
binomial distribution in Eq. 9.4 suggests that the beta can be thought of as
what one learns about the unknown parameter π of a binomial distribution
from observing a − 1 successes and b − 1 failures or, in other words, from
a + b − 2 trials of which the proportions (a − 1)/(a + b − 2) are successes. This
interpretation is reinforced by the combination of a beta prior and a binomial