Tải bản đầy đủ - 0 (trang)
3 EXAMPLE: ONE-WAY TREATMENT STRUCTURE WITH EQUAL SLOPES MODEL

# 3 EXAMPLE: ONE-WAY TREATMENT STRUCTURE WITH EQUAL SLOPES MODEL

Tải bản đầy đủ - 0trang

4

Analysis of Messy Data, Volume III: Analysis of Covariance

TABLE 7.2

Analysis of Covariance Using All Five of the Possible Covariates

proc glm data=ex_7_1; classes treat; model y=treat

x1 x2 x3 x4 x5/solution;

Source

Model

Error

Corr Total

df

8

31

39

SS

130.117

88.883

219.000

MS

16.265

2.867

FValue

5.673

ProbF

0.0002

Source

treat

x1

x2

x3

x4

x5

df

3

1

1

1

1

1

SS (III)

91.166

16.172

13.198

2.818

0.710

12.406

MS

30.389

16.172

13.198

2.818

0.710

12.406

FValue

10.599

5.640

4.603

0.983

0.248

4.327

ProbF

0.0001

0.0239

0.0399

0.3292

0.6222

0.0459

Estimate

0.337

–0.839

0.027

–0.074

0.162

StdErr

0.142

0.391

0.027

0.149

0.078

tValue

2.375

–2.145

0.991

–0.498

2.080

Probt

0.0239

0.0399

0.3292

0.6222

0.0459

Parameter

x1

x2

x3

x4

x5

TABLE 7.3

Based on the Model with Five Covariates

lsmeans treat/pdiff stderr;

Treat

1

2

3

4

LSM Num

1

2

3

4

LSMEAN

17.720

18.312

20.280

21.688

StdErr

0.542

0.555

0.560

0.551

LSMEANNumber

1

2

3

4

_1

_2

0.4487

_3

0.0028

0.0207

0.4487

0.0028

0.0000

0.0207

0.0002

_4

0.0000

0.0002

0.0844

0.0844

the two variables are possibly not needed in the model. The estimate of the variance

is 2.867, which is based on 31 degrees of freedom. Table 7.3 contains the adjusted

or least squares means for the four treatments as well as pairwise comparisons of

the treatments. Using a Fisher’s protected LSD approach, the means of Treatments 1

© 2002 by CRC Press LLC

Variable Selection in the Analysis of Covariance Model

5

TABLE 7.4

PROC GLM Code to Fit the Analysis of Variance Model to the Response

Variable and Each of the Possible Covariates and Compute the Residuals

for Each

proc glm; classes treat;

model y x1 x2 x3 x4 x5=treat; * fit models 7.1 and 7.3;

output out=resids r=ry r1 r2 r3 r4 r5; * compute the residuals;

and 2 and of Treatments 3 and 4 are not significantly different while all other

comparisons have significance levels less than 0.05.

Since at least two of the possible covariates have slopes that are not significantly

different from zero, the model building process described in Section 7.2 is used to

carry out variable selection for determining the adequate set of covariates for the

model. The PROC GLM statement in Table 7.4 fits Model 7.1 to the response variable

y and Model 7.3 to each of the possible covarites x1, x2, x3, x4, and x5. The main

product of these analyses is the computation of the sets of residuals for each of the

variables. The output statement provides a file, called “resids,” that contains all of

the residuals, ry, r1, r2, r3, r4, and r5. The REG procedure in Tables 7.5 to 7.9 uses

the computed residuals and model selection procedures to select variables for the

analysis of covariance.

Five out of several available different variable selection methods (Draper and

Smith, 1981; SAS, 1989; and Ott, 1988) were used to demonstrate some of the

aspects of model building. The methods used were stepwise, forward, backward,

adjusted R2, and CP. There is no guarantee that these procedures will yield the same

model and in most cases that involves many possible covariates, the sets of selected

variables will not be identical. Tables 7.5 through 7.9 contain the results of the model

building processes.

The PROC REG code and results of using the stepwise method are in Table 7.5.

The stepwise variable selection method starts with no variables in the model and

includes variables in a stepwise manner. At each step after including a new variable,

a variable with the largest significance level is eliminated (when the significance

level is greater than a pre-set value). In this case variables r1, r5, and r2 were selected.

TABLE 7.5

PROC REG Code to Use the Stepwise Variable Selection Procedure

and Results

proc reg data=resids;

stepwise: model ry=r1 r2 r3 r4 r5/selection = stepwise;

Step

1

2

3

Entered

r1

r5

r2

© 2002 by CRC Press LLC

Var In

1

2

3

PartialR**2

0.161

0.082

0.094

ModelR**2

0.161

0.243

0.337

Cp

8.506

6.159

3.155

FValue

7.301

4.005

5.124

ProbF

0.010

0.053

0.030

6

Analysis of Messy Data, Volume III: Analysis of Covariance

TABLE 7.6

PROC REG Code to Use the Forward Variable Selection Procedure

and Results

proc reg data=resids;

forward: model ry = r1 r2 r3 r4 r5 /selection = forward;

Step

1

2

3

4

Entered

r1

r5

r2

r3

Var In

1

2

3

4

PartialR**2

0.161

0.082

0.094

0.017

ModelR**2

0.161

0.243

0.337

0.354

Cp

8.506

6.159

3.155

4.272

FValue

7.301

4.005

5.124

0.902

ProbF

0.010

0.053

0.030

0.349

TABLE 7.7

PROC REG Code to Use the Backward Variable Selection Procedure

and Results

proc reg data=resids;

backward: model ry = r1 r2 r3 r4 r5 /selection = backward;

Step

1

2

Removed

r4

r3

Var In

4

3

PartialR**2

0.005

0.017

ModelR**2

0.354

0.337

Cp

4.272

3.155

FValue

0.272

0.902

ProbF

0.606

0.349

TABLE 7.8

PROC REG Code to Use the adjrsq Variable

Selection Procedure and Results for Top Five

Combinations of Variables

proc reg data=resids;

adjrsq: model ry = r1 r2 r3 r4 r5

Dependent

ry

ry

ry

ry

ry

Var In

3

4

5

4

2

0.2822

0.2802

0.2649

0.2633

0.2022

RSquare

0.3374

0.3541

0.3592

0.3389

0.2431

VarsInModel

r1 r2 r5

r1 r2 r3 r5

r1 r2 r3 r4 r5

r1 r2 r4 r5

r1 r5

Thus, the analysis indicates that X1, X2, and X5 are needed as possible covariates

in the analysis of the response variable.

Table 7.6 contains the PROC REG code and results of using the forward method.

The forward variable selection process starts with no variables in the model and

includes the next most important variable at each step. The forward variable selection method selects X1, X2, X3, and X5, although the significance level for X3 in

© 2002 by CRC Press LLC

Variable Selection in the Analysis of Covariance Model

7

TABLE 7.9

PROC REG Code to Use the CP Variable Selection

Procedure and Results for Top Five Combinations

of Variables

proc reg data=resids;

cp: model ry = r1 r2 r3 r4 r5 /selection = cp;

Dependent

ry

ry

ry

ry

ry

Var In

3

4

4

5

2

Cp

3.155

4.272

5.078

6.000

6.159

RSquare

0.337

0.354

0.339

0.359

0.243

VarsInModel

r1 r2 r5

r1 r2 r3 r5

r1 r2 r4 r5

r1 r2 r3 r4 r5

r1 r5

the final model is 0.349. This indicates that X3 is most likely not needed in the

model.

The backward variable selection PROC REG code and results are in Table 7.7.

The backward variable selection method starts with all covariates in the model and

eliminates the least important variable at each step (that variable with the largest

significance level). The backward method eliminated variables r3 and r4, indicating

that r1, r2, and r5 are remaining in the model.

Table 7.8 contains the PROC REG code to use the method “adjrsq” to select

variables for the model. The process is to fit models that include all possible

combinations of the variables and compute the adjusted R2 for each model. The

selected model consists of that set of variables with the largest adjusted R2. With

five variables, this process fits 25 – 1 = 31 models. The results of the five sets of

variables with the largest adjusted R2 are included in Table 7.8. The set of variables

with the largest adjusted R2 consists of r1, r2, and r5.

Finally, Table 7.9 contains the PROC REG code and results of using the CP

method of variable selection. As for the “adjrsq” method, the CP method fits models

with all possible combinations of variables and selects that model where CP

approaches “p,” the number of parameters in the model including the intercept. That

combination of variables with the CP value closest to “p” is r1, r2, r3, and r5 with

CP = 4.272. When fitting a model with all four of these variables, the significance

level corresponding to X3 is 0.3705, indicating that given the other variables are in

the model, variable X3 is not needed. Just as for the adjusted R2 method, the CP

method fits all possible combinations of the variables, which can become an unmanageable number when the number of possible covariates becomes large.

Using the approach of not including any variables with large significance levels

in the model, all of the procedures indicate that the needed variables are X1, X2, and

X5. Remember that the degrees of freedom associated with the residual sum of

squares for any of the above models is larger than they are supposed to be since the

regression code does not take into account the fact that the data being analyzed are

residuals. In this case, four degrees of freedom for residual were used to estimate

the means for the treatments for use in computing the residuals. Thus, all of the

© 2002 by CRC Press LLC

8

Analysis of Messy Data, Volume III: Analysis of Covariance

TABLE 7.10

PROC GLM Code to Fit the Final Model with Three

Covariates, Sums of Squares, and Estimates

of the Slopes for Each of the Covariates

proc glm data=ex_7_1; classes treat;

model y=treat x1 x2 x5/solution;

lsmeans treat/pdiff stderr;

Source

Model

Error

Corr Total

df

6

33

39

SS

127.097

91.903

219.000

MS

21.183

2.785

FValue

7.606

ProbF

0.0000

Source

treat

x1

x2

x5

df

3

1

1

1

SS (III)

91.799

17.678

13.081

17.419

MS

30.600

17.678

13.081

17.419

FValue

10.988

6.348

4.697

6.255

ProbF

0.0000

0.0168

0.0375

0.0175

Estimate

0.340

–0.833

0.184

StdErr

0.135

0.384

0.073

tValue

2.519

–2.167

2.501

Probt

0.0168

0.0375

0.0175

Parameter

x1

x2

x5

degrees of freedom associated with a residual sum of squares are inflated by four.

This means that the computed t-statistics, adjusted R2 values, and CP values are not

correct. The values of these statistics could be recomputed before decisions are made

concerning the variables to be included in the model, but, the results without recomputation from the model building procedures provide really good approximations

and provide adequate means for making decisions.

Table 7.10 contains the PROC GLM code to fit the final model with X1, X2, and

X5 as covariates. The mean square error has a value of 2.785 as compared to the

mean square error for the model with all five covariates (see Table 7.2) which has

a value of 2.867. When covariates are included in the model that are not needed,

the degrees of freedom for error are reduced more than the error sum of squares are

reduced, thus increasing the value of the estimate of the variance. The significance

levels corresponding to the statistics for testing the individual slopes of the covariates

are equal to zero are 0.0168, 0.0375, and 0.0175 for X1, X2, and X5, respectively.

The significance level corresponding to source Treat is 0.0000, indicating the intercepts are not equal or that distances between the various parallel hyper-planes are

not zero. Table 7.11 contains the adjusted means, predicted values on the hyperplanes at the average values of X1, X2, and X5 which are 25.68, 4.05, and 36.20,

respectively. Using a Fisher’s protected LSD method to make pairwise comparisons

of the treatment means indicates that Treatments 1 and 2 are not significantly

different while all other comparisons have significance levels less than 0.05. There

is one additional comparison, 3 vs. 4, that is significant for the model with three

covariates than for the model with five covariates.

© 2002 by CRC Press LLC

Variable Selection in the Analysis of Covariance Model

9

TABLE 7.11

Using the Final Model with Three Covariates

lsmeans treat/pdiff stderr;

treat

1

2

3

4

LSM Num

1

2

3

4

LSMean

17.710

18.366

20.190

21.734

StdErr

0.532

0.544

0.535

0.541

Probt

0.0000

0.0000

0.0000

0.0000

LSM Num

1

2

3

4

_1

_2

0.3925

_3

0.0025

0.0242

_4

0.0000

0.0001

0.0492

0.3925

0.0025

0.0000

0.0242

0.0001

0.0492

TABLE 7.12

PROC GLM Code to Fit the Residual Model with Three

Covariates to Provide, Sums of Squares, and Estimates

of the Slopes for Each of the Covariates

proc glm data=resids;

model ry=r1 r2 r5/solution;

Source

Model

Error

Corrected Total

df

3

36

39

SS

46.799

91.903

138.702

MS

15.600

2.553

FValue

6.111

ProbF

0.0018

Source

r1

r2

r5

df

1

1

1

SS(III)

17.678

13.081

17.419

MS

17.678

13.081

17.419

FValue

6.925

5.124

6.823

ProbF

0.0124

0.0297

0.0130

Estimate

0.000

0.340

–0.833

0.184

StdErr

0.253

0.129

0.368

0.070

tValue

0.000

2.631

–2.264

2.612

Probt

1.0000

0.0124

0.0297

0.0130

Parameter

Intercept

r1

r2

r5

For comparison purposes, the residuals of y were regressed on the residuals of

X1, X2, and X5 and the results are in Table 7.12. The error sum of squares is 91.903,

the same as in Table 7.10. The mean square error is 2.553 = 91.903/36 instead of

2.785 = 91.903/33 since the degrees of freedom for error from the residual model

are 36 instead of the 33 as in Table 7.10. The estimates of the slopes are identical

for both models (as shown by the theory), but the estimated standard errors from

© 2002 by CRC Press LLC

10

Analysis of Messy Data, Volume III: Analysis of Covariance

the residual model are smaller than those from the final model. Again this is the

result of using 36 degrees of freedom for error rather than using 33 degrees of

freedom. The standard errors from Table 7.12 can be recomputed as

stderrslope = stderrslope from residual model 36 33

0.135 = 0.129 36 33

Even though the variable selection procedure is not exact, the results are adequate

enough that effective models can be selected for carrying out analysis of covariance.

7.4 SOME THEORY

The analysis of covariance model can be expressed in general matrix notation as

y = Mµ + Xβ + ε

(7.7)

where y is n × 1 vector of the dependent variable, M is the design matrix, µ is the

associated parameters corresponding to the treatment and design structures (all

considered as fixed effects for this purpose), X is the matrix of possible covariates,

β is the vector of slopes corresponding to each of the covariates, and ε is the error

distributed N(0, σ 2 In ). The estimates of the slopes can be obtained by using a

stepwise process where the first step is to fit the Mµ part of the model, computing

the residuals, and then the second step is to fit the Xβ part of the model, i.e., first fit

y = Mµ + ε

(7.8)

and compute the residuals as

(

)

r = I − M M− y

where M– denotes a generalized inverse of M (Graybill, 1976).

A model for these residuals is a model that is free of the Mµ effects since the

model for r is

(

)

r = I − M M− X β + ε +

where ε+ ~ N(0, σ 2 (I – M M–)). Next the BLUE of β (assuming β is estimable) is

)(

) ] X′(I − M M ) (I − M M )y

[ (

= [X′ (I − M M )X] X′(I − M M ) y .

βˆ = X′ I − M M − I − M M − X

© 2002 by CRC Press LLC

−1

−1

(7.9)

Variable Selection in the Analysis of Covariance Model

11

The estimate of β is a function of r = (I – M M–)y, the residuals of y from Model 7.8,

and is a function of (I – M M–)X, but each column of (I – M M–)X is as a set of

residuals computed from fitting the model xk = Mµk + εk where xk is the kth column

of X. Thus, computing the residuals of y and of each candidate covariate from a

model with the design matrix of the treatment and design structures and then

performing a variable selection procedure using those residuals provides the appropriate estimates of the slopes. Since the covariance matrix of the residuals of y, r,

is not positive definite [it is of rank n-Rank (M)], the error degrees of freedom from

using variable selection method on the residuals is inflated by Rank (M). The correct

degrees of freedom could be used in the final steps of the variable selection procedure

to compute the appropriate significance levels. The overall effect of the inflated error

degrees of freedom depends on the sample size and the Rank (M). For example if

n = 100, R (M) = 30, and q = 10 (number of candidate covariates), there is not much

difference between t percentage points with 60 and 90 degrees of freedom. On the

other hand if n = 50, R(M) = 30, and q = 10, there is a big difference between t

percentage points with 10 and 40 degrees of freedom.

7.5 WHEN SLOPES ARE POSSIBLY UNEQUAL

When slopes are unequal, the procedure in Section 7.1 may not determine the

appropriate covariates, particularly when some treatments have positive slopes and

others have negative slopes. To extend the procedure to handle unequal slopes for

each covariate, an independent variable needs to be constructed for each level of the

treatment (or levels of treatment combinations) which has the value of the covariate

corresponding to observations of that treatment and has the value zero for observations not belonging to that treatment. In effect, the following model needs to be

constructed

y ij = α i + βi1X ij1 + βi 2 X ij2 + … + βik X ijk + ε ij .

For “t” treatments and two covariates, construct the matrix model

 y11  1

 y  1

 12  

 M  M

  

 y1n  1

 y  0

 21  

 y 22  0

 M  M

 =

 y 2 n  0

  

 M  M

 y  0

 t1  

 y t 2  0

 M  M

  

 y tn  0

0

0

M

0

1

1

M

1

M

0

0

M

0

L

L

L

L

© 2002 by CRC Press LLC

0

0

M

0

0

0

M

0

M

1

1

M

1

X111

X121

M

X1n1

0

0

M

0

M

0

0

M

0

L

L

L

L

0

0

M

0

X 211

X 221

M

X 2 n1

M

0

0

M

0

L

L

L

L

0

0

M

0

0

0

M

0

M

X t11

X t 21

M

X tn1

X112

X122

M

X 1n 2

0

0

M

0

M

0

0

M

0

0

0

M

0

X 212

X 222

M

X2n2

M

0

0

M

0

L

L

L

L

0 

α 

0   1 

α

 2

 M 

0  

α

0   t 

 β11

0  

 β 21 

  M +ε

0  

  β t1 

 β 

X t12   12 

 β

X t 22   22 

 M 

 β 

X tn 2   t 2 

12

Analysis of Messy Data, Volume III: Analysis of Covariance

[

]

y = D, x11, x 21, …, x t1, x12 , x 22 …, x t 2 β + ε

where D denotes the part of the design matrix with ones and zeros and

x ′is = (0, 0, …, 0, x i1s , x i 2s , …, x ins , 0, … 0) .

Next fit the models

y=D α+ε

x is = D α is + ε is

i = 1, 2, …, t, s = 1, 2, …, k

and compute the residuals, denoted by

r, and rik i = 1, 2, …, t, s = 1, 2, …, k.

Finally, the variable selection procedure can be applied to the resulting sets of

residuals as in Section 7.1.

REFERENCES

Draper, N. R. and Smith, H. (1981). Applied Regression Analysis, Second Edition, New York:

John Wiley & Sons.

Graybill, F. A. (1976). Theory and Application of the Linear Model, Pacific Grove, CA.

Ott, Lyman (1988). An Introduction to Statistical Methods and Data Analysis, Boston: PWSKent.

SAS Institute Inc. (1989). SAS/STAT ® User’s Guide, Version 6, Fourth Edition, Volume 2,

Cary, NC.

Yang, S. S. (1989). Personal communication.

EXERCISES

EXERCISE 7.1: Carry out an analysis of covariance for the following data set by

determining the appropriate model and then making the needed treatment comparisons. Y is the response variable and X, Z, and W are the covariates. Use a regression

model building strategy.

EXERCISE 7.2: Use the data in Section 4.6 with the variable selection procedures

to select variables to be included in the model. The discussion in Section 4.6 indicates

there are some unequal slopes, so the method in Section 7.5 will need to be utilized.

EXERCISE 7.3: Use the data in Section 4.4 with the variable selection process to

determine if the models can be improved by including the square of height, the

square of weight, and the cross-product of height and weight in addition to height

© 2002 by CRC Press LLC

Variable Selection in the Analysis of Covariance Model

13

and weight as possible covariates. Make the necessary treatment comparisons using

the final model.

Data for Exercise 7.1

TRT

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

B

B

B

B

B

B

B

B

B

B

B

B

B

B

B

© 2002 by CRC Press LLC

X

2.9

7.3

4.5

4.0

2.8

6.2

5.5

3.1

3.0

3.8

5.9

2.1

3.5

6.9

4.5

8.1

2.8

6.2

3.3

4.1

5.9

5.1

8.1

8.8

7.0

5.7

2.0

5.7

5.8

3.9

Z

4.9

4.2

4.2

4.8

4.6

4.3

4.3

5.0

4.7

4.3

4.7

4.7

4.7

4.7

4.7

4.6

4.1

4.9

4.8

4.8

4.9

4.1

4.1

4.6

4.1

4.8

4.7

4.7

4.5

4.1

W

2.2

3.2

1.9

9.2

6.6

5.5

1.0

3.7

0.4

7.7

2.7

2.4

6.1

7.1

9.4

8.3

5.9

5.5

0.9

0.6

7.3

7.6

3.2

9.9

9.4

5.5

9.4

4.3

8.0

7.4

Y

11.9

17.5

21.5

18.1

9.5

16.8

14.0

16.3

13.4

15.6

20.8

13.3

13.9

15.7

16.1

11.4

13.2

16.5

6.9

8.9

12.1

8.4

14.1

12.9

10.4

12.2

15.0

10.3

12.8

12.5

TRT

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

D

D

D

D

D

D

D

D

D

D

D

D

D

D

D

X

5.2

7.4

5.6

5.1

2.4

4.2

8.6

6.2

6.9

7.8

6.2

3.0

2.4

8.6

6.0

3.7

2.9

4.6

2.0

7.4

5.3

4.0

5.7

7.5

7.2

2.2

6.5

7.3

8.8

8.4

Z

4.2

4.7

4.2

4.5

4.3

4.9

4.3

4.6

4.0

4.2

4.1

4.6

4.5

4.3

4.7

4.2

4.5

4.8

4.2

4.1

4.1

4.4

4.5

4.7

4.0

4.7

4.7

4.9

4.2

4.9

W

5.6

5.4

9.3

3.8

5.4

6.9

6.3

0.2

9.1

1.4

6.8

0.5

5.6

4.5

4.3

8.3

1.0

3.4

1.0

6.6

9.7

1.4

8.8

4.6

7.7

4.0

8.1

4.9

8.1

3.3

Y

13.3

15.0

15.4

12.3

13.6

18.4

13.0

12.6

17.9

18.1

16.7

21.1

15.4

13.2

14.5

19.6

20.5

12.8

23.5

17.9

11.4

21.6

24.6

17.0

18.4

16.3

15.0

16.9

12.7

18.4

8

Comparing Models

for Several Treatments

8.1 INTRODUCTION

Once an adequate covariance model has been selected to describe the relationship

between the dependent variable and the covariates, it often is of interest to see if

the models differ from one treatment to the next or from treatment combination to

treatment combination. If one is concerned about the experiment-wise error rate in

an analysis involving many tests of hypotheses, this procedure can provide that

protection if it is used as a first step in comparing the treatments’ models. Suppose

the selected analysis of covariance model is

y ij = α i + βi1x1ij + βi 2 x 2 ij + L + βiq x qij + ε ij

(8.1)

for i = 1, 2, …, t and j = 1, 2, …, ni. The equal model hypothesis is

 α1   α 2 

 αt 

β   β 

β 

 11   21 

 t1 

H 0 : β12  = β22  = L = β t 2  vs. H a : ( not H 0 ) .

   

 

 M   M 

 M 

β1q  β2 q 

β tq 

   

 

This type of hypothesis can be tested by constructing a set of contrast statements in

either PROC GLM or PROC MIXED or the model comparison method can be used

to compute the value of the test statistic. The methodology described in this chapter

is an application of the model comparison method that can easily be used to test the

equality of models in many different settings. Schaff et al. (1988) and Hinds and

Milliken (1987) used the method to compare nonlinear models. Section 8.2 describes

the methodology to develop the statistics to test the equal model hypothesis for a

one-way treatment structure, and methodology for the two-way treatment structure

is discussed in Section 8.3. For two-way and higher order treatment structures, this

process generates Type II sums of squares (Milliken and Johnson, 1992). Three

examples are used to demonstrate the methods.

© 2002 by CRC Press LLC

### Tài liệu bạn tìm kiếm đã sẵn sàng tải về

3 EXAMPLE: ONE-WAY TREATMENT STRUCTURE WITH EQUAL SLOPES MODEL

Tải bản đầy đủ ngay(0 tr)

×