3 EXAMPLE: ONE-WAY TREATMENT STRUCTURE WITH EQUAL SLOPES MODEL
Tải bản đầy đủ - 0trang
4
Analysis of Messy Data, Volume III: Analysis of Covariance
TABLE 7.2
Analysis of Covariance Using All Five of the Possible Covariates
proc glm data=ex_7_1; classes treat; model y=treat
x1 x2 x3 x4 x5/solution;
Source
Model
Error
Corr Total
df
8
31
39
SS
130.117
88.883
219.000
MS
16.265
2.867
FValue
5.673
ProbF
0.0002
Source
treat
x1
x2
x3
x4
x5
df
3
1
1
1
1
1
SS (III)
91.166
16.172
13.198
2.818
0.710
12.406
MS
30.389
16.172
13.198
2.818
0.710
12.406
FValue
10.599
5.640
4.603
0.983
0.248
4.327
ProbF
0.0001
0.0239
0.0399
0.3292
0.6222
0.0459
Estimate
0.337
–0.839
0.027
–0.074
0.162
StdErr
0.142
0.391
0.027
0.149
0.078
tValue
2.375
–2.145
0.991
–0.498
2.080
Probt
0.0239
0.0399
0.3292
0.6222
0.0459
Parameter
x1
x2
x3
x4
x5
TABLE 7.3
Adjusted Means and Pairwise Comparisons
Based on the Model with Five Covariates
lsmeans treat/pdiff stderr;
Treat
1
2
3
4
LSM Num
1
2
3
4
LSMEAN
17.720
18.312
20.280
21.688
StdErr
0.542
0.555
0.560
0.551
LSMEANNumber
1
2
3
4
_1
_2
0.4487
_3
0.0028
0.0207
0.4487
0.0028
0.0000
0.0207
0.0002
_4
0.0000
0.0002
0.0844
0.0844
the two variables are possibly not needed in the model. The estimate of the variance
is 2.867, which is based on 31 degrees of freedom. Table 7.3 contains the adjusted
or least squares means for the four treatments as well as pairwise comparisons of
the treatments. Using a Fisher’s protected LSD approach, the means of Treatments 1
© 2002 by CRC Press LLC
Variable Selection in the Analysis of Covariance Model
5
TABLE 7.4
PROC GLM Code to Fit the Analysis of Variance Model to the Response
Variable and Each of the Possible Covariates and Compute the Residuals
for Each
proc glm; classes treat;
model y x1 x2 x3 x4 x5=treat; * fit models 7.1 and 7.3;
output out=resids r=ry r1 r2 r3 r4 r5; * compute the residuals;
and 2 and of Treatments 3 and 4 are not significantly different while all other
comparisons have significance levels less than 0.05.
Since at least two of the possible covariates have slopes that are not significantly
different from zero, the model building process described in Section 7.2 is used to
carry out variable selection for determining the adequate set of covariates for the
model. The PROC GLM statement in Table 7.4 fits Model 7.1 to the response variable
y and Model 7.3 to each of the possible covarites x1, x2, x3, x4, and x5. The main
product of these analyses is the computation of the sets of residuals for each of the
variables. The output statement provides a file, called “resids,” that contains all of
the residuals, ry, r1, r2, r3, r4, and r5. The REG procedure in Tables 7.5 to 7.9 uses
the computed residuals and model selection procedures to select variables for the
analysis of covariance.
Five out of several available different variable selection methods (Draper and
Smith, 1981; SAS, 1989; and Ott, 1988) were used to demonstrate some of the
aspects of model building. The methods used were stepwise, forward, backward,
adjusted R2, and CP. There is no guarantee that these procedures will yield the same
model and in most cases that involves many possible covariates, the sets of selected
variables will not be identical. Tables 7.5 through 7.9 contain the results of the model
building processes.
The PROC REG code and results of using the stepwise method are in Table 7.5.
The stepwise variable selection method starts with no variables in the model and
includes variables in a stepwise manner. At each step after including a new variable,
a variable with the largest significance level is eliminated (when the significance
level is greater than a pre-set value). In this case variables r1, r5, and r2 were selected.
TABLE 7.5
PROC REG Code to Use the Stepwise Variable Selection Procedure
and Results
proc reg data=resids;
stepwise: model ry=r1 r2 r3 r4 r5/selection = stepwise;
Step
1
2
3
Entered
r1
r5
r2
© 2002 by CRC Press LLC
Var In
1
2
3
PartialR**2
0.161
0.082
0.094
ModelR**2
0.161
0.243
0.337
Cp
8.506
6.159
3.155
FValue
7.301
4.005
5.124
ProbF
0.010
0.053
0.030
6
Analysis of Messy Data, Volume III: Analysis of Covariance
TABLE 7.6
PROC REG Code to Use the Forward Variable Selection Procedure
and Results
proc reg data=resids;
forward: model ry = r1 r2 r3 r4 r5 /selection = forward;
Step
1
2
3
4
Entered
r1
r5
r2
r3
Var In
1
2
3
4
PartialR**2
0.161
0.082
0.094
0.017
ModelR**2
0.161
0.243
0.337
0.354
Cp
8.506
6.159
3.155
4.272
FValue
7.301
4.005
5.124
0.902
ProbF
0.010
0.053
0.030
0.349
TABLE 7.7
PROC REG Code to Use the Backward Variable Selection Procedure
and Results
proc reg data=resids;
backward: model ry = r1 r2 r3 r4 r5 /selection = backward;
Step
1
2
Removed
r4
r3
Var In
4
3
PartialR**2
0.005
0.017
ModelR**2
0.354
0.337
Cp
4.272
3.155
FValue
0.272
0.902
ProbF
0.606
0.349
TABLE 7.8
PROC REG Code to Use the adjrsq Variable
Selection Procedure and Results for Top Five
Combinations of Variables
proc reg data=resids;
adjrsq: model ry = r1 r2 r3 r4 r5
/selection = adjrsq;
Dependent
ry
ry
ry
ry
ry
Var In
3
4
5
4
2
Adjrsq
0.2822
0.2802
0.2649
0.2633
0.2022
RSquare
0.3374
0.3541
0.3592
0.3389
0.2431
VarsInModel
r1 r2 r5
r1 r2 r3 r5
r1 r2 r3 r4 r5
r1 r2 r4 r5
r1 r5
Thus, the analysis indicates that X1, X2, and X5 are needed as possible covariates
in the analysis of the response variable.
Table 7.6 contains the PROC REG code and results of using the forward method.
The forward variable selection process starts with no variables in the model and
includes the next most important variable at each step. The forward variable selection method selects X1, X2, X3, and X5, although the significance level for X3 in
© 2002 by CRC Press LLC
Variable Selection in the Analysis of Covariance Model
7
TABLE 7.9
PROC REG Code to Use the CP Variable Selection
Procedure and Results for Top Five Combinations
of Variables
proc reg data=resids;
cp: model ry = r1 r2 r3 r4 r5 /selection = cp;
Dependent
ry
ry
ry
ry
ry
Var In
3
4
4
5
2
Cp
3.155
4.272
5.078
6.000
6.159
RSquare
0.337
0.354
0.339
0.359
0.243
VarsInModel
r1 r2 r5
r1 r2 r3 r5
r1 r2 r4 r5
r1 r2 r3 r4 r5
r1 r5
the final model is 0.349. This indicates that X3 is most likely not needed in the
model.
The backward variable selection PROC REG code and results are in Table 7.7.
The backward variable selection method starts with all covariates in the model and
eliminates the least important variable at each step (that variable with the largest
significance level). The backward method eliminated variables r3 and r4, indicating
that r1, r2, and r5 are remaining in the model.
Table 7.8 contains the PROC REG code to use the method “adjrsq” to select
variables for the model. The process is to fit models that include all possible
combinations of the variables and compute the adjusted R2 for each model. The
selected model consists of that set of variables with the largest adjusted R2. With
five variables, this process fits 25 – 1 = 31 models. The results of the five sets of
variables with the largest adjusted R2 are included in Table 7.8. The set of variables
with the largest adjusted R2 consists of r1, r2, and r5.
Finally, Table 7.9 contains the PROC REG code and results of using the CP
method of variable selection. As for the “adjrsq” method, the CP method fits models
with all possible combinations of variables and selects that model where CP
approaches “p,” the number of parameters in the model including the intercept. That
combination of variables with the CP value closest to “p” is r1, r2, r3, and r5 with
CP = 4.272. When fitting a model with all four of these variables, the significance
level corresponding to X3 is 0.3705, indicating that given the other variables are in
the model, variable X3 is not needed. Just as for the adjusted R2 method, the CP
method fits all possible combinations of the variables, which can become an unmanageable number when the number of possible covariates becomes large.
Using the approach of not including any variables with large significance levels
in the model, all of the procedures indicate that the needed variables are X1, X2, and
X5. Remember that the degrees of freedom associated with the residual sum of
squares for any of the above models is larger than they are supposed to be since the
regression code does not take into account the fact that the data being analyzed are
residuals. In this case, four degrees of freedom for residual were used to estimate
the means for the treatments for use in computing the residuals. Thus, all of the
© 2002 by CRC Press LLC
8
Analysis of Messy Data, Volume III: Analysis of Covariance
TABLE 7.10
PROC GLM Code to Fit the Final Model with Three
Covariates, Sums of Squares, and Estimates
of the Slopes for Each of the Covariates
proc glm data=ex_7_1; classes treat;
model y=treat x1 x2 x5/solution;
lsmeans treat/pdiff stderr;
Source
Model
Error
Corr Total
df
6
33
39
SS
127.097
91.903
219.000
MS
21.183
2.785
FValue
7.606
ProbF
0.0000
Source
treat
x1
x2
x5
df
3
1
1
1
SS (III)
91.799
17.678
13.081
17.419
MS
30.600
17.678
13.081
17.419
FValue
10.988
6.348
4.697
6.255
ProbF
0.0000
0.0168
0.0375
0.0175
Estimate
0.340
–0.833
0.184
StdErr
0.135
0.384
0.073
tValue
2.519
–2.167
2.501
Probt
0.0168
0.0375
0.0175
Parameter
x1
x2
x5
degrees of freedom associated with a residual sum of squares are inflated by four.
This means that the computed t-statistics, adjusted R2 values, and CP values are not
correct. The values of these statistics could be recomputed before decisions are made
concerning the variables to be included in the model, but, the results without recomputation from the model building procedures provide really good approximations
and provide adequate means for making decisions.
Table 7.10 contains the PROC GLM code to fit the final model with X1, X2, and
X5 as covariates. The mean square error has a value of 2.785 as compared to the
mean square error for the model with all five covariates (see Table 7.2) which has
a value of 2.867. When covariates are included in the model that are not needed,
the degrees of freedom for error are reduced more than the error sum of squares are
reduced, thus increasing the value of the estimate of the variance. The significance
levels corresponding to the statistics for testing the individual slopes of the covariates
are equal to zero are 0.0168, 0.0375, and 0.0175 for X1, X2, and X5, respectively.
The significance level corresponding to source Treat is 0.0000, indicating the intercepts are not equal or that distances between the various parallel hyper-planes are
not zero. Table 7.11 contains the adjusted means, predicted values on the hyperplanes at the average values of X1, X2, and X5 which are 25.68, 4.05, and 36.20,
respectively. Using a Fisher’s protected LSD method to make pairwise comparisons
of the treatment means indicates that Treatments 1 and 2 are not significantly
different while all other comparisons have significance levels less than 0.05. There
is one additional comparison, 3 vs. 4, that is significant for the model with three
covariates than for the model with five covariates.
© 2002 by CRC Press LLC
Variable Selection in the Analysis of Covariance Model
9
TABLE 7.11
Adjusted Means and Pairwise Comparisons
Using the Final Model with Three Covariates
lsmeans treat/pdiff stderr;
treat
1
2
3
4
LSM Num
1
2
3
4
LSMean
17.710
18.366
20.190
21.734
StdErr
0.532
0.544
0.535
0.541
Probt
0.0000
0.0000
0.0000
0.0000
LSM Num
1
2
3
4
_1
_2
0.3925
_3
0.0025
0.0242
_4
0.0000
0.0001
0.0492
0.3925
0.0025
0.0000
0.0242
0.0001
0.0492
TABLE 7.12
PROC GLM Code to Fit the Residual Model with Three
Covariates to Provide, Sums of Squares, and Estimates
of the Slopes for Each of the Covariates
proc glm data=resids;
model ry=r1 r2 r5/solution;
Source
Model
Error
Corrected Total
df
3
36
39
SS
46.799
91.903
138.702
MS
15.600
2.553
FValue
6.111
ProbF
0.0018
Source
r1
r2
r5
df
1
1
1
SS(III)
17.678
13.081
17.419
MS
17.678
13.081
17.419
FValue
6.925
5.124
6.823
ProbF
0.0124
0.0297
0.0130
Estimate
0.000
0.340
–0.833
0.184
StdErr
0.253
0.129
0.368
0.070
tValue
0.000
2.631
–2.264
2.612
Probt
1.0000
0.0124
0.0297
0.0130
Parameter
Intercept
r1
r2
r5
For comparison purposes, the residuals of y were regressed on the residuals of
X1, X2, and X5 and the results are in Table 7.12. The error sum of squares is 91.903,
the same as in Table 7.10. The mean square error is 2.553 = 91.903/36 instead of
2.785 = 91.903/33 since the degrees of freedom for error from the residual model
are 36 instead of the 33 as in Table 7.10. The estimates of the slopes are identical
for both models (as shown by the theory), but the estimated standard errors from
© 2002 by CRC Press LLC
10
Analysis of Messy Data, Volume III: Analysis of Covariance
the residual model are smaller than those from the final model. Again this is the
result of using 36 degrees of freedom for error rather than using 33 degrees of
freedom. The standard errors from Table 7.12 can be recomputed as
stderrslope = stderrslope from residual model 36 33
0.135 = 0.129 36 33
Even though the variable selection procedure is not exact, the results are adequate
enough that effective models can be selected for carrying out analysis of covariance.
7.4 SOME THEORY
The analysis of covariance model can be expressed in general matrix notation as
y = Mµ + Xβ + ε
(7.7)
where y is n × 1 vector of the dependent variable, M is the design matrix, µ is the
associated parameters corresponding to the treatment and design structures (all
considered as fixed effects for this purpose), X is the matrix of possible covariates,
β is the vector of slopes corresponding to each of the covariates, and ε is the error
distributed N(0, σ 2 In ). The estimates of the slopes can be obtained by using a
stepwise process where the first step is to fit the Mµ part of the model, computing
the residuals, and then the second step is to fit the Xβ part of the model, i.e., first fit
y = Mµ + ε
(7.8)
and compute the residuals as
(
)
r = I − M M− y
where M– denotes a generalized inverse of M (Graybill, 1976).
A model for these residuals is a model that is free of the Mµ effects since the
model for r is
(
)
r = I − M M− X β + ε +
where ε+ ~ N(0, σ 2 (I – M M–)). Next the BLUE of β (assuming β is estimable) is
)(
) ] X′(I − M M ) (I − M M )y
[ (
= [X′ (I − M M )X] X′(I − M M ) y .
βˆ = X′ I − M M − I − M M − X
−
© 2002 by CRC Press LLC
−1
−1
−
−
−
(7.9)
Variable Selection in the Analysis of Covariance Model
11
The estimate of β is a function of r = (I – M M–)y, the residuals of y from Model 7.8,
and is a function of (I – M M–)X, but each column of (I – M M–)X is as a set of
residuals computed from fitting the model xk = Mµk + εk where xk is the kth column
of X. Thus, computing the residuals of y and of each candidate covariate from a
model with the design matrix of the treatment and design structures and then
performing a variable selection procedure using those residuals provides the appropriate estimates of the slopes. Since the covariance matrix of the residuals of y, r,
is not positive definite [it is of rank n-Rank (M)], the error degrees of freedom from
using variable selection method on the residuals is inflated by Rank (M). The correct
degrees of freedom could be used in the final steps of the variable selection procedure
to compute the appropriate significance levels. The overall effect of the inflated error
degrees of freedom depends on the sample size and the Rank (M). For example if
n = 100, R (M) = 30, and q = 10 (number of candidate covariates), there is not much
difference between t percentage points with 60 and 90 degrees of freedom. On the
other hand if n = 50, R(M) = 30, and q = 10, there is a big difference between t
percentage points with 10 and 40 degrees of freedom.
7.5 WHEN SLOPES ARE POSSIBLY UNEQUAL
When slopes are unequal, the procedure in Section 7.1 may not determine the
appropriate covariates, particularly when some treatments have positive slopes and
others have negative slopes. To extend the procedure to handle unequal slopes for
each covariate, an independent variable needs to be constructed for each level of the
treatment (or levels of treatment combinations) which has the value of the covariate
corresponding to observations of that treatment and has the value zero for observations not belonging to that treatment. In effect, the following model needs to be
constructed
y ij = α i + βi1X ij1 + βi 2 X ij2 + … + βik X ijk + ε ij .
For “t” treatments and two covariates, construct the matrix model
y11 1
y 1
12
M M
y1n 1
y 0
21
y 22 0
M M
=
y 2 n 0
M M
y 0
t1
y t 2 0
M M
y tn 0
0
0
M
0
1
1
M
1
M
0
0
M
0
L
L
L
L
© 2002 by CRC Press LLC
0
0
M
0
0
0
M
0
M
1
1
M
1
X111
X121
M
X1n1
0
0
M
0
M
0
0
M
0
L
L
L
L
0
0
M
0
X 211
X 221
M
X 2 n1
M
0
0
M
0
L
L
L
L
0
0
M
0
0
0
M
0
M
X t11
X t 21
M
X tn1
X112
X122
M
X 1n 2
0
0
M
0
M
0
0
M
0
0
0
M
0
X 212
X 222
M
X2n2
M
0
0
M
0
L
L
L
L
0
α
0 1
α
2
M
0
α
0 t
β11
0
β 21
M +ε
0
β t1
β
X t12 12
β
X t 22 22
M
β
X tn 2 t 2
12
Analysis of Messy Data, Volume III: Analysis of Covariance
[
]
y = D, x11, x 21, …, x t1, x12 , x 22 …, x t 2 β + ε
where D denotes the part of the design matrix with ones and zeros and
x ′is = (0, 0, …, 0, x i1s , x i 2s , …, x ins , 0, … 0) .
Next fit the models
y=D α+ε
x is = D α is + ε is
i = 1, 2, …, t, s = 1, 2, …, k
and compute the residuals, denoted by
r, and rik i = 1, 2, …, t, s = 1, 2, …, k.
Finally, the variable selection procedure can be applied to the resulting sets of
residuals as in Section 7.1.
REFERENCES
Draper, N. R. and Smith, H. (1981). Applied Regression Analysis, Second Edition, New York:
John Wiley & Sons.
Graybill, F. A. (1976). Theory and Application of the Linear Model, Pacific Grove, CA.
Wadsworth and Brooks/Cole.
Ott, Lyman (1988). An Introduction to Statistical Methods and Data Analysis, Boston: PWSKent.
SAS Institute Inc. (1989). SAS/STAT ® User’s Guide, Version 6, Fourth Edition, Volume 2,
Cary, NC.
Yang, S. S. (1989). Personal communication.
EXERCISES
EXERCISE 7.1: Carry out an analysis of covariance for the following data set by
determining the appropriate model and then making the needed treatment comparisons. Y is the response variable and X, Z, and W are the covariates. Use a regression
model building strategy.
EXERCISE 7.2: Use the data in Section 4.6 with the variable selection procedures
to select variables to be included in the model. The discussion in Section 4.6 indicates
there are some unequal slopes, so the method in Section 7.5 will need to be utilized.
EXERCISE 7.3: Use the data in Section 4.4 with the variable selection process to
determine if the models can be improved by including the square of height, the
square of weight, and the cross-product of height and weight in addition to height
© 2002 by CRC Press LLC
Variable Selection in the Analysis of Covariance Model
13
and weight as possible covariates. Make the necessary treatment comparisons using
the final model.
Data for Exercise 7.1
TRT
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
© 2002 by CRC Press LLC
X
2.9
7.3
4.5
4.0
2.8
6.2
5.5
3.1
3.0
3.8
5.9
2.1
3.5
6.9
4.5
8.1
2.8
6.2
3.3
4.1
5.9
5.1
8.1
8.8
7.0
5.7
2.0
5.7
5.8
3.9
Z
4.9
4.2
4.2
4.8
4.6
4.3
4.3
5.0
4.7
4.3
4.7
4.7
4.7
4.7
4.7
4.6
4.1
4.9
4.8
4.8
4.9
4.1
4.1
4.6
4.1
4.8
4.7
4.7
4.5
4.1
W
2.2
3.2
1.9
9.2
6.6
5.5
1.0
3.7
0.4
7.7
2.7
2.4
6.1
7.1
9.4
8.3
5.9
5.5
0.9
0.6
7.3
7.6
3.2
9.9
9.4
5.5
9.4
4.3
8.0
7.4
Y
11.9
17.5
21.5
18.1
9.5
16.8
14.0
16.3
13.4
15.6
20.8
13.3
13.9
15.7
16.1
11.4
13.2
16.5
6.9
8.9
12.1
8.4
14.1
12.9
10.4
12.2
15.0
10.3
12.8
12.5
TRT
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
X
5.2
7.4
5.6
5.1
2.4
4.2
8.6
6.2
6.9
7.8
6.2
3.0
2.4
8.6
6.0
3.7
2.9
4.6
2.0
7.4
5.3
4.0
5.7
7.5
7.2
2.2
6.5
7.3
8.8
8.4
Z
4.2
4.7
4.2
4.5
4.3
4.9
4.3
4.6
4.0
4.2
4.1
4.6
4.5
4.3
4.7
4.2
4.5
4.8
4.2
4.1
4.1
4.4
4.5
4.7
4.0
4.7
4.7
4.9
4.2
4.9
W
5.6
5.4
9.3
3.8
5.4
6.9
6.3
0.2
9.1
1.4
6.8
0.5
5.6
4.5
4.3
8.3
1.0
3.4
1.0
6.6
9.7
1.4
8.8
4.6
7.7
4.0
8.1
4.9
8.1
3.3
Y
13.3
15.0
15.4
12.3
13.6
18.4
13.0
12.6
17.9
18.1
16.7
21.1
15.4
13.2
14.5
19.6
20.5
12.8
23.5
17.9
11.4
21.6
24.6
17.0
18.4
16.3
15.0
16.9
12.7
18.4
8
Comparing Models
for Several Treatments
8.1 INTRODUCTION
Once an adequate covariance model has been selected to describe the relationship
between the dependent variable and the covariates, it often is of interest to see if
the models differ from one treatment to the next or from treatment combination to
treatment combination. If one is concerned about the experiment-wise error rate in
an analysis involving many tests of hypotheses, this procedure can provide that
protection if it is used as a first step in comparing the treatments’ models. Suppose
the selected analysis of covariance model is
y ij = α i + βi1x1ij + βi 2 x 2 ij + L + βiq x qij + ε ij
(8.1)
for i = 1, 2, …, t and j = 1, 2, …, ni. The equal model hypothesis is
α1 α 2
αt
β β
β
11 21
t1
H 0 : β12 = β22 = L = β t 2 vs. H a : ( not H 0 ) .
M M
M
β1q β2 q
β tq
This type of hypothesis can be tested by constructing a set of contrast statements in
either PROC GLM or PROC MIXED or the model comparison method can be used
to compute the value of the test statistic. The methodology described in this chapter
is an application of the model comparison method that can easily be used to test the
equality of models in many different settings. Schaff et al. (1988) and Hinds and
Milliken (1987) used the method to compare nonlinear models. Section 8.2 describes
the methodology to develop the statistics to test the equal model hypothesis for a
one-way treatment structure, and methodology for the two-way treatment structure
is discussed in Section 8.3. For two-way and higher order treatment structures, this
process generates Type II sums of squares (Milliken and Johnson, 1992). Three
examples are used to demonstrate the methods.
© 2002 by CRC Press LLC