1 Data, Models, and Plate Notation
Tải bản đầy đủ
280
9 Learning in Models with Fixed Structure
directed graphical notation discussed in Chap. 4, with the major elaboration
being for structurally identical replications of models and variables. Plate
notation is introduced below with some simple examples that show the correspondence between plate diagrams and probability models. A probability
model and a plate diagram for a general educational measurement model are
then presented.
We then discuss inference about structures like these—parameter estimation, as statisticians typically call it, or “learning,” as it is more often called
in the artiﬁcial intelligence community. A natural starting point is estimating
the structural parameters of the model, namely the population proﬁciency distributions and conditional distributions for observable variables, when examinees’ proﬁciency variables are known. This so-called complete data problem
is a straightforward application of the Bayesian updating in the binomial and
multinomial models. It is also at the heart of the more realistic incomplete
data problem, where neither examinees’ proﬁciencies nor structural parameters are known. Two approaches are discussed: the EM algorithm (Dempster
et al. 1977) and MCMC estimate (Gilks et al. 1996). Many good references for
these techniques are available, so the focus here is on the key ideas and their
application in a simple problem. The special problem of bringing new tasks
into an established assessment, a recurring activity in ongoing assessment
systems, is also addressed.
9.1.1 Plate Notation
Plate notation (Buntine 1996) extends the notation for directed graphs we
developed in Chap. 4. As before, there are nodes that represent parameters and
variables, and directed edges that represent dependency relationships among
them. The new idea is an eﬃcient way to depict replications of variables and
structures by displaying a single representative on a “plate” that indicates
multiplicity.
As a ﬁrst simple example, consider the outcomes Xj of four draws from a
Bernoulli distribution with known parameter θ. The joint probability distribution is p(xj |θ). The usual digraph is shown in the top panel of Fig. 9.1.
The bottom panel represents the same joint distribution using a plate for
the replicated observations. The node for θ lies outside the plate—it is not
replicated—and inﬂuences all observations in the same manner. That is, the
structure of the conditional probability distributions for all four Xj s given θ
is the same. The double edge on the θ node indicates that this parameter is
known, while the single edge on the Xj node indicates that their values are
not known.
Plates can be nested. Consider a hierarchical model extending the previous
example such that four responses are observed each of N students. For each
student i, the response variables Xij are Bernoulli distributed with probability
θi 1 . These success probabilities are not known, and prior belief about them is
1
This is a variable not a parameter as it is student speciﬁc.
9.1 Data, Models, and Plate Notation
281
Fig. 9.1 Expanded and plate digraphs for four Bernoulli variables
Upper ﬁgure shows the full graph, lower ﬁgure shows the same structure with the
plate notation. Reprinted with permission from ETS.
expressed through a beta distribution with known parameters α and β. The
joint probability distribution is now
p (xij | θi )p (θi | α, β) ,
i
j
and the digraph using plate notation is as shown in Fig. 9.2. The replications of responses for a given examinee are conditionally independent and
all depend in the same way on the same Bernoulli probability θi , as implied
by j p (xij | θi ). Similarly, all the θi s depend in the same way on the same
higher level parameters α and β, as implied by i p (θi | α, β).
A ﬁnal introductory example is the Rasch model for dichotomous items,
shown in Fig. 9.3. At the center of the digraph, where plates for the proﬁciency
variables θi for students and diﬃculty parameters βj for items overlap, is the
probability pij of a correct answer by student i to item j:
pij = P (Xij = 1 | θi , βj ) = Ψ (θi − βj ) ≡
exp (θi − βj )
,
1 + exp (θi − βj )
where Ψ (·) denotes the cumulative logistic distribution. This probability is
known if θi and βj are known, through the functional form of the Rasch
282
9 Learning in Models with Fixed Structure
Fig. 9.2 Plate digraph for hierarchical Beta-Bernoulli model
Reprinted with permission from ETS.
model; this functional relationship rather than the stochastic relationship is
indicated in the digraph by a double edge on the node. That is, the double edge
indicates that the value of a variable is known or that it is known conditional
on the values of its parents. Logical functions such as an AND gate have this
property too, but not a noisy-AND because its outcome is probabilistic.
Examinee proﬁciency parameters are posited to follow a normal distribution with unknown parameters mean μθ and variance σθ2 . Higher level
2
distributions for the examinee proﬁciency distribution are μθ ∼ N (μw , σw
)
2
and σθ ∼ Gamma(aθ , bθ ), with the parameters of the higher level distributions known. Item parameters are also posited to follow a normal distribution,
with a mean ﬁxed at zero to set the scale and an unknown variance σβ2 , and
σβ2 ∼ Gamma(aβ , bβ ).
9.1.2 A Bayesian Framework for a Generic Measurement Model
In educational and psychological measurement models, observable variables
are outcomes of the confrontation between a person and a situation, or
more speciﬁcally, a task. In particular, observable variables X are modeled
as independent given unobservable, or latent, person variables2 θ and task
2
In the context of maximum likelihood estimation, these are called person parameters because they must be estimated, but this book is following the convention
of calling person-speciﬁc values variables rather than parameters.
9.1 Data, Models, and Plate Notation
μ0
aβ
aθ
σ0
σθ
θi
bβ
σβ
bθ
μθ
283
βj
pij
Xij
Observable j
Student i
Fig. 9.3 Plate digraph for hierarchical Rasch model
Reprinted with permission from ETS.
parameters β. In the Rasch model, for example, examinees have more or less
proﬁciency in the same amount with respect to all items, and items are more
or less diﬃcult for all examinees. (Section 9.1.3 gives extensions where this is
no longer technically true, including diﬀerential item functioning (DIF) and
mixture models.)
The discrete Bayes nets that are useful in cognitive diagnosis exhibit this
character. It is useful, however, to cast estimation in these models in a more
general framework in order to connect it with the broader psychometric and
statistical literature. The same characterization applies to the models of item
response theory (IRT), latent class analysis, factor analysis, latent proﬁle
analysis, and, at the level of parallel tests, parametric classical test theory.
Examples of familiar models that often have their own history, notation, and
terminology are shown in Table 9.13 . All of these models diﬀer only in the
nature of the observable variables and student-model variables—discrete vs.
continuous, for example, or univariate vs. multivariate—and the form of the
link model, or probability model for observables given person variables and
task parameters. We will write p xj | θ, βj , and include a subscript i as in
p xij | θi , β j if we need to focus on the responses of a particular examinee i.
In discrete Bayesian models in cognitive diagnosis, p xj | θ, β j takes the
form of a collection of categorical distributions: For a given value of the proﬁciency variable(s) θ, there is a categorical probability distribution over the
possible values of xj , the value of the observable variable. The (vector valued)
3
The table is not exhaustive, as for example, classical test theory is readily applied
to vectors, factor analysis can be carried out without assuming normality, and
there are mixture models that include both discrete classes and classes with probabilities structured by IRT models (e.g., Yamamoto 1987).
284
9 Learning in Models with Fixed Structure
parameter βj comprises the probabilities for each response category for each
given value of θ.
A full Bayesian model also requires distributions (or laws) for θs and βs
(as they are unknown). Treating examinees as exchangeable means that before
seeing responses, we have no information about which examinees might be
more proﬁcient than others, or whether there are groups of examinees more
similar to one another than to those in other groups. It is appropriate in
this case to model θs as independent and identically distributed, possibly
conditional on the parameter(s) λ of a higher level distribution or law. That
is, for all examinees i,
θi ∼ p (θ | λ) .
In the normal-prior Rasch model example in the previous section, λ = (μθ , σθ2 )
(the mean and variance of the normal law). If θ is the categorical variable, λ
will be category probabilities or a smaller number of parameters that imply
category probabilities in a more parsimonious way (e.g., the models discussed
in Chap. 8).
Similarly, if we have no prior beliefs to distinguish among item diﬃculties,
we can model βs as exchangeable given the parameter(s) of their law:
β j ∼ p (β | ξ) .
In the Rasch example, ξ = σβ2 . When βs are probabilities in categorical distributions, as they are in discrete Bayes nets, then either p (β | ξ) is a beta or
Dirichlet prior on the probabilities or a function of smaller number of variables
that imply the conditional probabilities. The DiBello–Samejima distributions
in the previous chapter are an example of the latter.
Both λ and ξ may be speciﬁed as known, or as unknown with higher level
distributions, or laws, p(λ) and p(ξ).
The full Bayesian probability model for educational measurement is thus
p xij | θ i , β j p (θ i | λ) p β j | ξ p (λ) p (ξ) . (9.1)
p (X, θ, β, λ, ξ) =
i
j
The corresponding graph, using plate notation, is shown in Fig. 9.4.
9.1.3 Extension to Covariates
We brieﬂy note three common extensions of the basic measurement model that
involve covariates, or collateral information, about examinees, denoted Zi , or
about tasks, denoted Yj . (Note that the entries in a Q-matrix are collateral
information about tasks.) The ﬁrst is the extension to conditional exchangeability about examinee or task parameters. The second concerns interactions
between examinee covariates and response probabilities, or DIF. The third
concerns interactions as well, but when examinee covariates are not fully
known. These are mixture models.
9.1 Data, Models, and Plate Notation
285
Table 9.1 Special cases of the generic measurement model
Model
Parametric
classical test
theory
Dichotomous IRT
Graded response
IRT
Multidimensional
dichotomous IRT
Latent class
Cognitive
diagnosis
Factor analysis
Latent proﬁle
analysis
Student
variables (θ)
Unidimensional
continuous
(true score)
Unidimensional
continuous
Unidimensional
continuous
Multivariate
continuous
Observables
(X)
Continuous
observed
scores
Link
function
Normal
Dichotomous
responses
Bernoulli
Ordered
categorical
responses
Dichotomous
responses
Discrete (class Categorical
memberships) responses
(including
dichotomous)
Discrete
Categorical
(attribute
responses
masteries)
(including
dichotomous)
Multivariate
continuous
Discrete (class
memberships)
Dichotomous IRT Unidimenmixture
sional
continuous
(proﬁciencies)
and discrete
(class
memberships)
(Discrete) Bayes Discrete
nets
(proﬁciencies
and class
memberships)
Continuous
scores
Continuous
scores
Dichotomous
responses
Categorical
responses
Task
parameters (β)
Error variance
Item diﬃculty,
discrimination,
etc.
Categorical Item step
diﬃculties
Bernoulli
Item diﬃculty,
discrimination,
etc.
Categorical Conditional
(including probabilities
Bernoulli)
Categorical Conditional
(including probabilities
Bernoulli)
(and
parameters in
functional
forms)
Normal
Factor
loadings
Multivariate Means and
normal
covariances for
each class
Bernoulli
Item
parameters for
each class
Categorical Conditional
(including probabilities
Bernoulli)
(and
parameters in
functional
forms)
286
9 Learning in Models with Fixed Structure
ξ
λ
βj
θi
pij
Xij
Observable j
Student i
Fig. 9.4 Graph for generic measurement model
Reprinted with permission from ETS.
Z refers to known information about examinees such as demographic group
or instructional background. When this information aﬀects our beliefs about
examinees’ proﬁciencies, examinees are no longer exchangeable. Eighth grade
students are more likely to know how to subtract fractions than fourth grade
students, for example. We may however posit conditional exchangeability,
or exchangeability among examinees with the same covariates, by making
the distributions on θ conditional on covariates, namely p(θ|z, λ). Through
p(θ|z, λ), we model an expectation that examinees with the same value of Z
are more similar to one another than those with other values of Z, or that
eighth graders generally have higher proﬁciencies than fourth graders.
Similarly, we can incorporate covariates for tasks, through p(β|y, ξ). A task
model variable Y can indicate the particular task model used to generate a
task (or characteristics of the task deﬁned by the task model variable), so
we can use information about the parameters of similar tasks to estimate the
parameters of a new task from the same family, or allow for the fact that
problems with more steps are usually harder than problems with fewer steps
(Mislevy et al. 1993; Glas and van der Linden 2003).
Figure 9.5 depicts the extension to covariates that inﬂuence belief about
item or student parameters, but do not aﬀect responses directly. That is,
responses remain dependent on item parameters and student variables only.
If we knew an examinee’s θ and a task’s β, learning Z or Y would not change
our expectations about potential responses.
DIF occurs when learning Z or Y would change our expectations about
responses, beyond parameters for the student and task (Holland and Wainer
1993). One example occurred in a reading assessment that surveyed grades 8
and 12: A task that required ﬁlling in an employment application was relatively easier for 12th grade students than for 8th grade students. Presumably
many 12th grade students had personal experience with employment applica-
9.2 Techniques for Learning with Fixed Structure
287
ξ
λ
βj
θi
pij
Yi
Xij
Observable j
Zi
Student i
Fig. 9.5 Graph with covariates and no DIF
Reprinted with permission from ETS.
tions, a factor that would make the item easier for such a student than one
who had not, but was otherwise generally similar in reading proﬁciency. In
this case, p xj | θ, β j , Grade = 8 = p xj | θ, β j , Grade = 12 . In such cases,
task parameters would diﬀer for at least some tasks, and be subscripted with
respect to covariates as βj(z) . Figure 9.6 shows how DIF appears in a graph.
DIF concerns interactions between task response probabilities and examinees’ background variables Z, when Z is known. Mixture models posit that
such interactions may exist, but examinees’ background variables are not
observed (e.g., Rost 1990). An example would be mixed number subtraction
items that vary in their diﬃculty depending on whether a student solves them
by breaking them into whole number and fraction subproblems or by converting everything to mixed numbers and then subtracting (Tatsuoka 1983).
Figure 9.7 depicts the digraph for a generic mixture model. It diﬀers from the
DIF model only in that Z is marked as unobserved rather than observed. Note
that this requires specifying a prior (population) distribution for Z. Now the
higher level parameter λ for student variables has two components: λθ for the
law for θ and λZ representing the probability parameter in, say, a Bernoulli
law for Z if there are two classes, or a vector of probabilities for a categorical
law if there are more than two classes.
9.2 Techniques for Learning with Fixed Structure
How, then, does one learn the parameters β of conditional probability distributions and λ of examinee proﬁciency distributions? This section addresses
inference in terms of the full Bayesian probability model of Eq. 9.1, the basics
of which are outlined below in Sect. 9.2.1. Section 9.2.2 discusses the simpler problem in which students’ θs are observed as well as their xs—the
“complete data” problem, in the terminology introduced in Dempster et al.
288
9 Learning in Models with Fixed Structure
ξ
λ
Yi
Zi
θi
βij
pij
Xij
Observable j
Student i
Fig. 9.6 Graph with DIF
Reprinted with permission from ETS.
(1977)’s description of their EM algorithm. Section 9.3 relates the complete
data problem to the “incomplete data” problem we face in Eq. 9.1. This chapter introduces two common approaches to solve it, namely the EM algorithm
(Sect. 9.4) and MCMC (Sect. 9.5). Although both algorithms are general
enough to work with any model covered by Eq. 9.1, the focus of this chapter
is on discrete Bayes nets measurement models.
9.2.1 Bayesian Inference for the General Measurement Model
Fig. 9.7 Graph for mixture model
Reprinted with permission from ETS.
The full Bayesian model (Eq. 9.1) contains observable variables X, proﬁciency variables (sometimes called person parameters) θ, task parameters
9.2 Techniques for Learning with Fixed Structure
289
β, and higher level parameters λ and ξ. Equation 9.1 represents knowledge
about the interrelationships among all the variables in the generic measurement model without covariates, before any observations are made. The only
observations that can be made are those for observable variables. We observe
realized values of X, say x∗ . In Bayesian inference, the basis of inference about
person variables and task parameters and their distributions is obtained by
examining the conditional distribution of these variables conditional on x∗ .
By Bayes theorem, this posterior is easy enough to express. Aside from a normalizing constant K, it has exactly the same form as Eq. 9.1, just with the
expressions involving X evaluated at their observed values:
p (θ, β, λ, ξ | x∗ ) = K
p x∗ij | θi , βj p (θ i | λ) p β j | ξ p (λ) p (ξ) ,
i
j
(9.2)
where
K −1 = p (x∗ )
=
p (x∗ | θ, β, λ, ξ) ∂θ ∂β ∂λ ∂ξ
∗
=
i
j p xij | θ i , β j p (θ i | λ) p β j | ξ p (λ) p (ξ) ∂θ ∂β ∂λ ∂ξ .
(9.3)
Bayesian inference proceeds by determining important features of the posterior distribution, such as means, modes, and variances of variables, picturing
their marginal distributions, calculating credibility intervals, and displaying
plots of marginal or joint distributions. If the prior laws are chosen to be
conjugate to the likelihood, then simple procedures of posterior inference are
available for independent and identically distributed samples, as illustrated in
the following section for the complete data problem.
But Eq. 9.2 in its entirety does not always have a natural conjugate, even
when all of the factors taken by themselves do (especially if some of the
variables are missing; see York 1992). So, while the form of Eq. 9.2 is simple
enough to write, the diﬃculty of evaluating the normalizing constant (Eq. 9.3)
renders direct inference from the posterior all but impossible. Approximations
must be found.
9.2.2 Complete Data Tables
The population distribution of proﬁciency variables (the proﬁciency model)
and conditional probability of observable outcome variables given the proﬁciency variables (the evidence models and link models) in discrete Bayes nets
take the form of categorical distributions, the special case being Bernoulli
distributions when there are just two categories. Bayesian inference in these
models is particularly simple, since they admit to posterior inference via conjugate prior distributions, and the forms of the priors, likelihoods, and posteriors have intuitive interpretations. This section describes Bayesian updating
290
9 Learning in Models with Fixed Structure
for the multinomial distributions, starting with the Bernoulli distributions
and generalizing to categorical distributions. An example of parameter estimation for a complete data version of a simple Bayes net follows. We will
start with a quick review of Bayesian inference for Bernoulli and categorical
distributions.
A Bernoulli random variable X has two possible outcomes, which we may
denote 0 and 1. Let π be the probability that X = 1, typically representing
a “success” or “occurrence” event, or with dichotomous test items a correct
response. It follows that 1 − π is the probability that X = 0, a “failure” or
“nonoccurrence,” or an incorrect item response. The probability for a single
observation is π x (1 − π)1−x . The probability for r successes in n independent
replications is
(9.4)
p (r |n, π ) ∝ π r (1 − π)n−r .
Such counts of successes in a speciﬁed number n of independent Bernoulli
trials with a common parameter π are said to follow a binomial distribution.
Once n trials occur and r successes are observed, Eq. 9.4 is interpreted as a
likelihood function, which we denote as L (π | r, n). The maximum likelihood
estimate (MLE) of π is the value that maximizes the likelihood, or equivalently maximizes its log likelihood (π | r, n) = log [L (π | r, n)]. In the case
of Bernoulli trials, the MLE is π
ˆ = r/n, the sample proportion of successes,
a suﬃcient statistic for π. The squared standard error of the mean, or error
variance, is π (1 − π) /n.
For Bayesian inference, the conjugate prior for the Bernoulli and binomial
distribution is the beta distribution (deﬁned in Sect. 3.5.3). What that means
is this: If we can express our prior belief about π in terms of a beta law,
then the posterior for π that results from combining this prior through Bayes
theorem with a likelihood in the form of Eq. 9.4 is also a beta law. Bayesian
inference through conjugate distributions has the advantages of eliminating
concern about the normalizing constant (in the case of the beta distribution,
the normalizing constant is the beta function evaluated at the parameters that
can be looked up in a table), and relegating posterior analysis to features of
a well-known family of distributions (conjugate families).
Leaving out the normalization constant, the p.d.f. for the beta distribution
is:
b−1
.
(9.5)
p (π|a, b) ∝ π a−1 (1 − π)
Comparing the form of the beta distribution in Eq. 9.5 with the form of the
binomial distribution in Eq. 9.4 suggests that the beta can be thought of as
what one learns about the unknown parameter π of a binomial distribution
from observing a − 1 successes and b − 1 failures or, in other words, from
a + b − 2 trials of which the proportions (a − 1)/(a + b − 2) are successes. This
interpretation is reinforced by the combination of a beta prior and a binomial