5 DiBello's Effective Theta Distributions
Tải bản đầy đủ
8.5 DiBello’s Eﬀective Theta Distributions
S1
a
S2
...
T
Sk
S1
b
S2
...
255
Sk
T
Fig. 8.9 Separable inﬂuences. If the conditional probability table for P(T /S1 ,
. . . , SK ) has the separable inﬂuence property, the Graph (a) becomes Graph (b).
Reprinted with permission from ETS.
each conﬁguration of skill variables into an eﬀective theta—a real number
representing the student’s propensity to be able to perform tasks of that
type. The assumption was that even though the test is multidimensional, the
proﬁciency combination associated with scoring well on any given observable
within a task represents a single direction within that multidimensional space.
Once we represent our beliefs about the student’s state as a real number θ
(a distance along that direction), we can press familiar IRT models into service to determine the probabilities for the distribution tables which drive the
Bayes nets. This reduces the number of parameters, and furthermore relates
the parameters to concepts like diﬃculty and discrimination that are already
familiar to psychometricians. The approach is akin to that of structured or
located latent class models (Almond et al. 2001; Formann 1985; von Davier
2008).
The DiBello eﬀective theta method proceeds in three steps:
1. For each input variable, designate a real number to serve as an eﬀective
theta for the contribution of that skill (Sect. 8.5.1).
2. The inputs each represent a separate dimension. Use a combination function to combine them into an eﬀective theta for the task (Sect. 8.5.2).
There are a number of combination functions that can be used, to model
for example compensatory, conjunctive, and inhibitor relationships.
3. Apply a link function to calculate probabilities for the dependent variable from the combined eﬀected theta. DiBello proposed using Samejima’s
graded response model as the link function, i.e., the DiBello–Samejima
model (Sect. 8.5.3).3 For representing relationships among proﬁciency
variables, for example, we recommend a diﬀerent method, based on cut
points of the normal distribution (Sect. 8.5.4).
A key feature of this class of models is the combination function (Step 2).
The choice of this function dictates the type of relationship (e.g., sum for compensatory, min for conjunctive). In typical modeling situations, the experts
3
Other models could be used to parameterize link functions at this point, such as
the generalized partial credit model (Muraki 1992) and the model for nominal
response categories (Bock 1972).
256
8 Parameters for Bayesian Network Models
provide not only which variables are parents of a given variable but also what
the type of relationship is. They also indicate the relative importance of each
variable in the combination. Section 8.6 describes the technique we use to
elicit the information from the experts. Note that the generalized diagnostic
model (von Davier 2008) also uses combination and link functions, although
it does not use the name “combination function.”
8.5.1 Mapping Parent Skills to θ Space
For the moment, we assume that each parent variable lies in its own separate
dimension. Each category for a parent variable represents an interval along this
dimension. It is common in IRT to assume that proﬁciency follows a standard
normal distribution. We will build eﬀective theta values from standard normal
distributions.
Suppose that one of the parent variables in the relationship is Skill 1.
Suppose further that it has three levels: Low, Medium, and High. Positing for
the moment that the skill labels reﬂect an equal-probabilities partitioning of
the distribution, we can break the area under the normal curve into three
segments (Fig. 8.10).
Effective Theta
m0
c1
m1
c2
m2
Fig. 8.10 Midpoints of intervals on the normal distribution
Reprinted from Yan et al. (2004) with permission from ETS.
This is fairly easy to do in practice. Assume that the parent variable has
K-ordered levels, indexed 0, . . . , K − 1. Let mk be the eﬀective theta value
associated with Level k. Let pk = (2 ∗ k + 1)/2K be the probability up to
and including this level. Then, mk = Φ−1 (pk ), where Φ(·) is the cumulative
normal distribution function. For two levels, for example, the values are −.67
and +.67. For three levels, the values are −.97, 0, and +.97.
When an observable has just one proﬁciency parent, these are the eﬀective
theta values we use to model conditional probabilities for it. Section 8.5.3
8.5 DiBello’s Eﬀective Theta Distributions
257
shows how this is done. First, we discuss how to combine eﬀective thetas
across proﬁciencies when there are multiple parents.
8.5.2 Combining Input Skills
Each mapping from a parent variable to its eﬀective theta value is done independently. That is, each knowledge, skill, or ability represented by a parent variable is assumed to have a separate dimension. The task observation
(or dependent proﬁciency variable) also has its own dimension. As in regression, the individual parents’ eﬀective thetas are independent variables and the
response variable is a dependent variable. The idea is to make a projection
from the space of the parent dimensions to a point on the child dimension.
This projection is the eﬀective theta, synthesized accordingly across parents,
as it applies to this particular child. The value will then be used in the Samejima graded response IRT model to produce probabilities for the response
categories of the child.
The easiest way to do this is to use a combination function, g(·), that produces a single theta value from a list of others, and the easiest combination
function to understand, and therefore the one we will start with, is the sum.
When eﬀective thetas are summed, having more of one parent skill can compensate for having less of another. For this reason, we call the distribution we
build using this combination rule the compensatory distribution. If θ1 , . . . , θK
are the eﬀective thetas for the parent dimensions, then the eﬀective theta for
the child dimension is
K
θ˜ = g(θ1 , . . . , θK ) =
k=1
α
√ k θk − β .
K
(8.8)
This model has one slope parameter, αk , for each parent variable and one
intercept parameter. Following IRT terminology, we sometimes call these the
discrimination and diﬃculty parameters; in fact, β is given a negative sign
so that it will be interpreted as a diﬃculty (higher values of β mean that
the problem
√ is “harder”—the probability of getting it right is lower). The
factor 1/ K is a variance stabilization term. If we assume that the variance
of each of the θk ’s is 1 (unit normal assumption), then the variance of θ˜ will
be
α2k /K. In other words, the variance of the eﬀective theta will not grow
with the number of parent variables. Table 8.1 gives a simple example for two
choices of the αs and βs.
Equation 8.8 does not include a subscript to indicate which probability
table we are deﬁning. Technically, there should be another subscript on all the
parameters in that equation to indicate the child variable. We have suppressed
that subscript to simplify the exposition. However, it is worth noting that
we are assigning a separate set of slope and diﬃculty parameters to each
observable outcome variable. Sometimes the values of the slope parameters are
constrained to be the same across multiple observable variables. In particular,
258
8 Parameters for Bayesian Network Models
Table 8.1 Eﬀective thetas for a compensatory combination function
Skill 1
High
High
High
Medium
Medium
Medium
Low
Low
Low
θ1 Skill 2
+0.97
+0.97
+0.97
0.00
0.00
0.00
−0.97
−0.97
−0.97
High
Medium
Low
High
Medium
Low
High
Medium
Low
θ2
+0.97
0.00
−0.97
+0.97
0.00
−0.97
+0.97
0.00
−0.97
Eﬀective thetas
α1 = α2 = β1 = 1 α1 = 1, α2 = 0.5,β1 = 0
+0.37
+1.03
−0.32
+0.68
−1.00
+0.34
−0.32
+0.34
−1.00
0.00
−1.68
−0.34
−1.00
−0.34
−1.68
−0.68
−2.37
−1.03
the Rasch IRT model uses a single slope parameter for all items. A constrained
model is possible, but if the slope parameter is estimated rather than ﬁxed, the
constrained model does not have the global parameter independence property.
The compensatory distribution produces a considerable savings in the
number of parameters we must elicit from the experts or learn from data.
The total number of parameters is K + 1, where K is the number of parent variables. Contrast this with the hyper-Dirichlet model which requires
2K parameters when all variables are binary, and more if any variable is not
binary.
We can change the combination function to model other relationships
among the parent and child variable. The most common is to replace the
sum with a minimization or maximization. Thus, the combination function
K
θ˜ = g(θ1 , . . . , θK ) = min αk θk − β ,
k=1
(8.9)
produces a conjunctive distribution where all the skills are necessary to solve
the problem. Equation 8.9 posits that the examinee will behave with the
weakest of the skills as the eﬀective theta level. Similarly, the combination
function
K
θ˜ = g(θ1 , . . . , θK ) = max αk θk − β ,
k=1
produces a disjunctive distribution where each parent variable represent an
alternative solution path. The examinee will behave with the strongest of the
skills as the eﬀective level. Table 8.2 gives examples for the conjunctive and
disjunctive combination functions; for simplicity, α1 = α2 = 1 and β = 0.
So far, all of the combination functions have been symmetric (or, more
properly, the asymmetry has been modeled by diﬀerent values for the slope
parameters). One interesting type of asymmetric combination function is the
inhibitor distribution (or more optimistically, the enabler distribution). Here,
we assume that a minimal level of the ﬁrst skill is required as a prerequisite,
8.5 DiBello’s Eﬀective Theta Distributions
259
Table 8.2 Eﬀective thetas for the conjunctive and disjunctive combination functions
Skill 1
High
High
High
Medium
Medium
Medium
Low
Low
Low
θ1 Skill 2
+0.97
+0.97
+0.97
0.00
0.00
0.00
−0.97
−0.97
−0.97
High
Medium
Low
High
Medium
Low
High
Medium
Low
θ2
+0.97
0.00
−0.97
+0.97
0.00
−0.97
+0.97
0.00
−0.97
Eﬀective thetas
Conjunctive Disjunctive
+0.97
+0.97
0.00
+0.97
−0.97
+0.97
0.00
+0.97
0.00
0.00
0.00
0.00
−0.97
+0.97
−0.97
0.00
−0.97
−0.97
but once the examinee have reached the minimal level in the ﬁrst skill, the
second skill takes over.
Example 8.3 (Story Problem). A typical example is a “story problem”
in a math assessment. Usually the skills we wish to assess are related to
mathematics, i.e., the ability to translate the verbal representation of the
problem into the symbolic one, then solve the symbolic problem. However, in
order to have any hope of solving the problem the examinee needs a minimal
competency in the language of the test. This is an inhibitor relationship.
Assume that only two skills are necessary, and that the ﬁrst one is the
inhibitor with the prerequisite that this skill is at least at level r. Also, let
θkm be the eﬀective theta value associated with the mth level of Skill k. Then
we can write the inhibitor relationship as follows:
θ˜ = g(θ1 , θ2 ) =
α2 θ2 − β
α2 θ2,0 − β
for θ1 ≥ θ1r ,
otherwise.
(8.10)
The numerical example in Table 8.3 shows two skills, both with three levels,
and θ1 acting as an inhibitor: If θ1 < Medium, the combined eﬀective theta
is assigned to the lowest level of θ2 . If θ1 ≥ Medium, the combined eﬀective
theta is that of θ2 . For simplicity, α2 = 1 and β = 0. Contrast this with the
conjunctive combination function in Table 8.2
The inhibitor distribution has an interesting evidentiary interpretation.
Return to the English “story problem” example (Example 8.3). If the examinee’s English proﬁciency is above the threshold, then the story problem can
provide evidence for the mathematical skills it is designed to measure. On the
other hand, if the English proﬁciency is below the threshold, the distribution
produces no evidence for the mathematical skills; the lack of a prerequisite
inhibits any evidence from ﬂowing from the observable to the proﬁciency (see
Exercise 8.10).
260
8 Parameters for Bayesian Network Models
Table 8.3 Eﬀective thetas for inhibitor combination functions
Skill 1
High
High
High
Medium
Medium
Medium
Low
Low
Low
Skill 2
High
Medium
Low
High
Medium
Low
High
Medium
Low
θ2 Eﬀective thetas
+0.97
+0.97
0.00
0.00
−0.97
−0.97
+0.97
+0.97
0.00
0.00
−0.97
−0.97
+0.97
−0.97
0.00
−0.97
−0.97
−0.97
It is possible to make more complex combination functions by mixing the
combination functions given above. For example, a mix of conjunctive and
disjunctive combination functions could model a task that admits two solution
paths, one using Skills 1 and 2 and the other using Skills 3 and 4. In fact,
mixing the inhibitor eﬀect with compensatory and conjunctive combination
functions is quite common.
Distributions can also be chained together by introducing dummy latent
variables as stand-ins for combinations of skills. In the example above, new
variables would be deﬁned for having the conjunction of Skills 1 and 2, say
Skill 1&2, and similarly deﬁning Skill 3&4. Then a disjunctive combination
would follow, of having Skill 1&2 or Skill 3&4. Even though there are more
variables, computation is simpliﬁed because the conditional probability matrices are smaller. On the other hand, there is a loss of precision with this
approach. An unrestricted conditional probability matrix can have diﬀerent
probabilities for each possible combination of parents, whereas the stage-wise
combinations collapse categories. Whether this is a good trade-oﬀ depends on
whether the collapsed parent combinations are substantively diﬀerent. There
also may be some other issues noted in Sect. 8.5.4.
8.5.3 Samejima’s Graded Response Model
Applying the combination function to the eﬀective thetas for each of the
parent variable produces one eﬀective theta for each row of the conditional
probability table (CPT). Completing the CPT requires turning that eﬀective
theta into a set of conditional probabilities, one for each possible state of
the child variable. This is the role of the link function. This section explores
a class of link functions based on the logistic function (commonly used in
IRT models). Now that we have projected the examinee’s skill state (i.e., on
the parent variables), we will calculate probabilities with, the graded response
model of Samejima (1969). Section 8.5.4 describes an alternative link function
based on the normal distribution.
8.5 DiBello’s Eﬀective Theta Distributions
261
The most common IRT models posit a single continuous skill variable, θ,
and binary response variables. The relationship between the observation and
the skill variable is a logistic regression. Various parameterizations are found
in the literature. The simplest is the one parameter logistic (1PL) or Rasch
model (same function, diﬀerent conceptual underpinnings):
P(X = 1|θ) = logit−1 (θ − d) =
exp(D(θ − d))
,
1 + exp(D(θ − d))
(8.11)
where b is an item diﬃculty parameter and X is a binary observed outcome,
1 for correct and 0 for incorrect. The constant D = 1.7 is sometimes used to
scale the logistic function so that it looks like a normal CDF (Sect. 8.5.4).
The 2PL additionally has a slope or discrimination parameter for each item,
so logit−1 Da(θ − d).
Samejima’s graded response model extends the this model to an observable X that can take one of the ordered values x0 ≺ · · · ≺ xM−1 . It is usually
developed from the 2PL, but we will build from the 1PL because the 2PL’s discrimination parameter is redundant with the slope parameters in the eﬀective
theta combination function. For m = 1, . . . , M − 1, we ﬁrst deﬁne cumulative
conditional probabilities for the response categories:
∗
(θ) = logit−1 D(θ − dm ),
P(X ≥ xm |θ) = Pm
(8.12)
where dm is a category diﬃculty parameter. There is always one fewer cumulative probability curve than the number of possible outcomes, as P(X ≥ x0 ) =
1. The category response probabilities P(X = xm |θ) can be calculated from
the diﬀerences of the cumulative probabilities given by Equation 8.12, with
P(X = x0 ) = 1 − P(X ≥ x1 ). That is,
∗
∗
(θ) − Pm+1
(θ).
P(X = xm |θ) = Pm
(8.13)
Figure 8.11 illustrates response category probabilities for a three-category
task, d1 = −1, and d2 = +1. For very low values of θ, the lowest level of
response is most likely. As θ increases, probabilities increase for higher-valued
responses in an orderly manner. Given the item parameters, a single value of
θ speciﬁes the full conditional distribution for all possible responses.
All that remains is to specify the values for dm . As we have a diﬃculty
parameter, β, additional constraints are needed for the ds in the eﬀective theta
combination step. One way to do this is to require M−1
m=1 dm = 0. For three
categories, this means that d1 = −d2 . Or we can set d1 = −1 and dM−1 = 1,
or d1 = 0 and dM−1 = 1 (as we do in Chap. 15). Furthermore, as the categories
of the output variable are ordered with respect to diﬃculty, the dm ’s should
be increasing. Other than that we can specify any values we like for these
parameters. When there are more than three categories, d2 < . . . < dm−1
can be estimated, or set at equal spacing, or, when M is large, determined
as a function of some smaller number of parameters (e.g., Andrich 1985).
262
8 Parameters for Bayesian Network Models
1.0
●
Pr(X = x0)
Pr(X = x1)
0.8
Probability
Pr(X = x2)
0.6
●
0.4
●
0.2
●
0.0
−3
−2
−1
0
1
Effective Theta
2
3
Fig. 8.11 Probabilities for Graded Response model
Reprinted from Almond et al. (2006a) with permission from ETS.
Almond et al. (2001) took them to be equally spaced in the interval [−1, 1].
The intercept β controls the overall diﬃculty. The d parameters control how
spread out the probabilities for the various child variable values are, with
the slope parameter(s) αk determining how much varying each of the parents
changes the probabilities.
Figure 8.11 illustrates the idea for a simple conditional probability table
for a child variable with a single parent, both of which have three levels. The
combination function is α1 θ1 − β (as there is only one parent, it does not
matter if it is compensatory or conjunctive), and for simplicity, set α1 = 1
and β = 0, also let d0 = −1 which forces d1 = 1. Figure 8.11 shows the
three graded response curves, and the points are the values of those curves
evaluated at the eﬀective thetas. Putting this in tabular form yields the table
shown in Fig. 8.4.
Example 8.4. Compensatory Graded Response Distribution. As a more complete example, consider the CPT for an observable variable Credit which represents three possible scoring levels for a short constructed response task Full,
Partial, or No credit. This task taps two measured skills which are labeled
Skill 1 and Skill 2 each of which have three possible levels: High, Medium, and
Low. The domain experts have determined that the relationship between these
skills is compensatory and that they are equally important, and that the task
should be of average diﬃculty. The combination function is therefore the one
given in Eq. 8.8, with α1 = α2 = 1 and β = 0. The eﬀective thetas are given in
Table 8.1. The graded response link functions are given in Eqs. 8.12 and 8.13.
Table 8.5 shows the ﬁnal conditional probability table.
Even though we may use equally-spaced values of dm for elicitation from
the experts, we may not want to preserve the equal spacing when we have a lot
8.5 DiBello’s Eﬀective Theta Distributions
263
Table 8.4 Conditional probability table for simple graded response model
Skill
Eﬀective theta
High
+0.967
Medium
0.000
Low
−0.967
Full Partial None
0.656 0.278 0.066
0.269 0.462 0.269
0.066 0.278 0.656
Table 8.5 Compensatory combination function and graded response link function
Skill 1
High
High
High
Medium
Medium
Medium
Low
Low
Low
Skill 2 Eﬀective thetas
High
+1.03
Medium
+0.68
Low
+0.34
High
+0.34
Medium
0.00
Low
−0.34
High
−0.34
Medium
−0.68
Low
−1.03
Full Partial No
0.678 0.262 0.060
0.541 0.356 0.103
0.397 0.433 0.171
0.397 0.433 0.171
0.269 0.462 0.269
0.171 0.433 0.397
0.171 0.433 0.397
0.103 0.356 0.541
0.060 0.262 0.678
of data. Instead, we could estimate distinct dm ’s for each category, free to vary
independently as long as they remain in increasing order. In this approach,
the model will ﬁt the marginal proportions in each category exactly.
Following the advice of Kadane (1980), we should try to elicit priors in
terms of quantities the experts are used to observing. Here, the best parameters to elicit would be the marginal proportions of observed outcome (Kadane
et al. 1980 and Chaloner and Duncan 1983 describe applications of this idea for
regression models and binomial proportion models, respectively). In a Bayes
net, marginal proportions depend on both the conditional probability distributions described above and population distribution for the parent variables
(usually proﬁciency variables). If the population distribution for the parent
variables have already been elicited, it should be possible to pick bj ’s and
dj,m ’s to match the marginal proportions.
8.5.4 Normal Link Function
The IRT-based link functions shown in the preceding section work fairly well
when the child variable is an observable outcome and the parent variable is
proﬁciency variable. This makes sense as IRT models were designed to model
the relationship between a latent ability and an item outcome. When the
child variable is a proﬁciency variable, a diﬀerent link function—one based
on the normal distribution—works better (Almond 2010a). This link function
264
8 Parameters for Bayesian Network Models
is motivated by many users’ familiarity with basic regression with normally
distributed residuals.
Assume again that the population distribution on the eﬀective theta space
for the child variable is normally distributed, and that (before taking into
account information from the parent variables) it is equally divided between
the categories. We can establish cut points, ci , between the intervals. This
is the solid curve in Fig. 8.12). (Actually, equal divisions are not necessary,
Almond 2010a extends the method to use any desired marginal distribution
for the child variable. This section will stick to the equal divisions, as they
are easier to explain.)
Effective Theta
c1
c2
Fig. 8.12 Output translation method. The solid curve represents standard normal reference curve for child variable; cut points are established relative to equal
probability intervals on this curve. The dashed curve represents a displaced curve
after taking parent variables into account. Probabilities for child variables are areas
between cut points under the dashed curve. Reprinted with permission from ETS.
Next, we consider how the parent variables would shift our beliefs about
the child. We can think about this as a kind of a regression where the eﬀective
theta for the child variable is predicted from the eﬀective theta for the parent
variables, so that:
αk
√ θk − β,
(8.14)
θ˜ =
K
k∈parents
where the θk s on the right are the eﬀective thetas of the parent variables and θ˜
on the left is the combined eﬀective theta√of the child. As in the compensatory
combination function, the factor of 1/ K stabilizes variance of the linear
predictor the compensatory distribution. Equation 8.14 gives the mean of the
dashed curve in Fig. 8.12. Changing the value of β shifts this curve to the
right or the left, and changing the value of the αk ’s changes the amount that
the curve shifts with the change in the parent variables.
A given individual will not have an eﬀective theta precisely at the mean
of this curve, but rather somewhere around it. Let σ be the residual standard
8.5 DiBello’s Eﬀective Theta Distributions
265
deviation. As the eﬀective thetas for the parent variables are scaled to have
˜ = σ 2 + K α2 /K. When σ 2 is small compared to the
variance one, Var(θ)
k=1 k
αk ’s, then the distributions will be tightly clustered around the value predicted
by Eq. 8.14; in other words the parents will have more predictive power. When
σ 2 is large, then there will be more uncertainty about the state of the child
variable given the parents.
Continuing the analogy to multiple regression, we can deﬁne
R2 =
K
2
k=1 αk /K
.
2
σ2 + K
k=1 αk /K
We need to be careful in interpreting this R2 value though. In the typical
eﬀective-theta application, it is describing the squared correlation between
latent variables. Latent variable correlations are considerably higher than the
observed variable correlations that experts are used to seeing. An observed
variable correlation of 0.5 might correspond to an latent correlation of 0.8 or
higher (depending on the reliability of the measures of the latent variables.)
The ﬁnal probabilities come from the area between the cut points in the
prediction (dashed) curve. The probability for the lowest state is the area
below c1 . The probability for the second state is the area between c1 and c2 ,
and so on up to the highest state which is the area above the highest cut
point. Thus, if the child variable is X, then:
P (X ≥ xm | pa(X)) = Φ
cm − θ˜
σ
(8.15)
where m > 1. Naturally, P(X ≥ x1 ) = 1, so the individual probabilities can
be calculated from the diﬀerences:
P (X = xm | pa(X)) = Φ
cm−1 − θ˜
σ
−Φ
cm − θ˜
σ
(8.16)
The value of θ˜ is given by Eq. 8.14.
There is a strong similarity between Eqs. 8.12 and 8.15. First, the shape
of the logistic curve and the normal CDF is very similar; the factor D = 1.7
commonly used in IRT models makes the two curves almost the same shape.
In particular, the normal link function described here is really the probit
function, and the graded response link function described in the previous
section is the logistic link function.
There are four diﬀerences between the two link functions. The ﬁrst is the
metaphor that motivates the curve: Eq. 8.12 is based on IRT while Eq. 8.15 is
based on factor analysis and regression. The second is the role of the parameter
σ, the residual standard deviation. In the normal model, this eﬀectively scales
the αk ’s and β, so that the equivalent values using the graded respond link