Tải bản đầy đủ

1 Introduction: Why Model Building Is Important

262 Chapter 5 Principles of Model Building

together with data on an extensive list of independent variables, x1 , x2 , . . . , xk ,

that they thought were related to y. Among these independent variables were the

student’s IQ, scores on mathematics and verbal achievement examinations, rank in

class, and so on. They ﬁt the model

E(y) = β0 + β1 x1 + β2 x2 + · · · + βk xk

to the data, analyzed the results, and reached the conclusion that none of the

independent variables was ‘‘signiﬁcantly related’’ to y. The goodness of ﬁt of the

model, measured by the coefﬁcient of determination R 2 , was not particularly good,

and t tests on individual parameters did not lead to rejection of the null hypotheses

that these parameters equaled 0.

How could the researchers have reached the conclusion that there is no signiﬁcant relationship, when it is evident, just as a matter of experience, that some

of the independent variables studied are related to academic achievement? For

example, achievement on a college mathematics placement test should be related

to achievement in college mathematics. Certainly, many other variables will affect

achievement—motivation, environmental conditions, and so forth—but generally

speaking, there will be a positive correlation between entrance achievement test

scores and college academic achievement. So, what went wrong with the educational

researchers’ study?

Although you can never discard the possibility of computing error as a reason

for erroneous answers, most likely the difﬁculties in the results of the educational

study were caused by the use of an improperly constructed model. For example,

the model

E(y) = β0 + β1 x1 + β2 x2 + · · · + βk xk

assumes that the independent variables x1 , x2 , . . . , xk affect mean achievement E(y)

independently of each other.∗ Thus, if you hold all the other independent variables

constant and vary only x1 , E(y) will increase by the amount β1 for every unit increase

in x1 . A 1-unit change in any of the other independent variables will increase E(y)

by the value of the corresponding β parameter for that variable.

Do the assumptions implied by the model agree with your knowledge about

academic achievement? First, is it reasonable to assume that the effect of time spent

on study is independent of native intellectual ability? We think not. No matter how

much effort some students invest in a particular subject, their rate of achievement is

low. For others, it may be high. Therefore, assuming that these two variables—effort

and native intellectual ability—affect E(y) independently of each other is likely to

be an erroneous assumption. Second, suppose that x5 is the amount of time a student

devotes to study. Is it reasonable to expect that a 1-unit increase in x5 will always

produce the same change β5 in E(y)? The changes in E(y) for a 1-unit increase in x5

might depend on the value of x5 (e.g., the law of diminishing returns). Consequently,

it is quite likely that the assumption of a constant rate of change in E(y) for 1-unit

increases in the independent variables will not be satisﬁed.

Clearly, the model

E(y) = β0 + β1 x1 + β2 x2 + · · · + βk xk

was a poor choice in view of the researchers’ prior knowledge of some of the variables

involved. Terms have to be added to the model to account for interrelationships

among the independent variables and for curvature in the response function. Failure

to include needed terms causes inﬂated values of SSE, nonsigniﬁcance in statistical

tests, and, often, erroneous practical conclusions.

∗ Keep in mind that we are discussing the deterministic portion of the model and that the word independent is

used in a mathematical rather than a probabilistic sense.

The Two Types of Independent Variables: Quantitative and Qualitative

263

In this chapter, we discuss the most difﬁcult part of a multiple regression analysis:

the formulation of a good model for E(y). Although many of the models presented

in this chapter have already been introduced in optional Section 4.12, we assume

the reader has little or no background in model building. This chapter serves as a

basic reference guide to model building for teachers, students, and practitioners of

multiple regression analysis.

5.2 The Two Types of Independent Variables:

Quantitative and Qualitative

The independent variables that appear in a linear model can be one of two types.

Recall from Chapter 1 that a quantitative variable is one that assumes numerical

values corresponding to the points on a line. (Deﬁnition 1.4). An independent

variable that is not quantitative, that is, one that is categorical in nature, is called

qualitative (Deﬁnition 1.5).

The nicotine content of a cigarette, prime interest rate, number of defects in a

product, and IQ of a student are all examples of quantitative independent variables.

On the other hand, suppose three different styles of packaging, A, B, and C, are

used by a manufacturer. This independent variable, style of packaging, is qualitative,

since it is not measured on a numerical scale. Certainly, style of packaging is an

independent variable that may affect sales of a product, and we would want to

include it in a model describing the product’s sales, y.

Deﬁnition 5.1 The different values of an independent variable used in regression are called its levels.

For a quantitative variable, the levels correspond to the numerical values it

assumes. For example, if the number of defects in a product ranges from 0 to 3, the

independent variable assumes four levels: 0, 1, 2, and 3.

The levels of a qualitative variable are not numerical. They can be deﬁned only

by describing them. For example, the independent variable style of packaging was

observed at three levels: A, B, and C.

Example

5.1

In Chapter 4, we considered the problem of predicting executive salary as a function

of several independent variables. Consider the following four independent variables

that may affect executive salaries:

(a)

(b)

(c)

(d)

Years of experience

Gender of the employee

Firm’s net asset value

Rank of the employee

For each of these independent variables, give its type and describe the nature of the

levels you would expect to observe.

Solution

(a) The independent variable for the number of years of experience is quantitative,

since its values are numerical. We would expect to observe levels ranging from

0 to 40 (approximately) years.

264 Chapter 5 Principles of Model Building

(b) The independent variable for gender is qualitative, since its levels can only be

described by the nonnumerical labels ‘‘female’’ and ‘‘male.’’

(c) The independent variable for the ﬁrm’s net asset value is quantitative, with

a very large number of possible levels corresponding to the range of dollar

values representing various ﬁrms’ net asset values.

(d) Suppose the independent variable for the rank of the employee is observed at

three levels: supervisor, assistant vice president, and vice president. Since we

cannot assign a realistic measure of relative importance to each position, rank

is a qualitative independent variable.

Quantitative and qualitative independent variables are treated differently in

regression modeling. In the next section, we see how quantitative variables are

entered into a regression model.

5.2 Exercises

5.1 Buy-side versus sell-side analysts’ earnings forecasts. The Financial Analysts Journal (July/August

2008) published a study comparing the earnings

forecasts of buy-side and sell-side analysts. A team

of Harvard Business School professors used regression to model the relative optimism (y) of the

analysts’ 3-month horizon forecasts based on the following independent variables. Determine the type

(quantitative or qualitative) of each variable.

(a) Whether the analyst worked for a buy-side ﬁrm

or a sell-side ﬁrm.

(b) Number of days between forecast and ﬁscal

year-end (i.e., forecast horizon).

(c) Number of quarters the analyst had worked

with the ﬁrm.

5.2 Workplace bullying and intention to leave. Workplace bullying (e.g., work-related harassment,

persistent criticism, withholding key information,

spreading rumors, intimidation) has been shown

to have a negative psychological effect on victims, often leading the victim to quit or resign.

In Human Resource Management Journal (October

2008), researchers employed multiple regression to

model bullying victims’ intention to leave the ﬁrm

as a function of perceived organizational support

and level of workplace bullying. The dependent

variable in the analysis, intention to leave (y), was

measured on a quantitative scale. Identify the type

(qualitative or quantitative) of the two key independent variables in the study, level of bullying

(measured on a 50-point scale) and perceived organizational support (measured as ‘‘low,’’ ‘‘neutral,’’

or ‘‘high’’).

5.3 Expert testimony in homicide trials of battered

women. The Duke Journal of Gender Law and

Policy (Summer 2003) examined the impact of

expert testimony on the outcome of homicide trials

that involve battered woman syndrome. Multiple

regression was employed to model the likelihood

of changing a verdict from not guilty to guilty after

deliberations, y, as a function of juror gender and

whether or not expert testimony was given. Identify

the independent variables in the model as quantitative or qualitative.

5.4 Chemical composition of rain water. The Journal of Agricultural, Biological, and Environmental

Statistics (March 2005) presented a study of the

chemical composition of rain water. The nitrate

concentration, y (milligrams per liter), in a rain

water sample was modeled as a function of two

independent variables: water source (groundwater,

subsurface ﬂow, or overground ﬂow) and silica concentration (milligrams per liter). Identify the type

(quantitative or qualitative) for each independent

variable.

5.5 Psychological response of ﬁreﬁghters. The Journal of Human Stress (Summer 1987) reported on

a study of ‘‘psychological response of ﬁreﬁghters

to chemical ﬁre.’’ The researchers used multiple

regression to predict emotional distress as a function of the following independent variables. Identify

each independent variable as quantitative or qualitative. For qualitative variables, suggest several

levels that might be observed. For quantitative variables, give a range of values (levels) for which the

variable might be observed.

(a) Number of preincident psychological symptoms

(b) Years of experience

(c) Cigarette smoking behavior

(d) Level of social support

(e) Marital status

(f) Age

Models with a Single Quantitative Independent Variable

(g)

(h)

(i)

(j)

(k)

Ethnic status

Exposure to a chemical ﬁre

Education level

Distance lived from site of incident

Gender

265

5.6 Modeling a qualitative response. Which of the

assumptions about ε (Section 4.2) prohibit the use

of a qualitative variable as a dependent variable?

(We present a technique for modeling a qualitative

dependent variable in Chapter 9.)

5.3 Models with a Single Quantitative

Independent Variable

To write a prediction equation that provides a good model for a response (one that

will eventually yield good predictions), we have to know how the response might

vary as the levels of an independent variable change. Then we have to know how

to write a mathematical equation to model it. To illustrate (with a simple example),

suppose we want to model a student’s score on a statistics exam, y, as a function of

the single independent variable x, the amount of study time invested. It may be that

exam score, y, increases in a straight line as the amount of study time, x, varies from

1 hour to 6 hours, as shown in Figure 5.1a. If this were the entire range of x-values

for which you wanted to predict y, the model

E(y) = β0 + β1 x

would be appropriate.

Figure 5.1 Modeling exam

score, y, as a function of

study time, x

(a)

(b)

Now, suppose you want to expand the range of values of x to x = 8 or x = 10

hours of studying. Will the straight-line model

E(y) = β0 + β1 x

be satisfactory? Perhaps, but making this assumption could be risky. As the amount

of studying, x, is increased, sooner or later the point of diminishing returns will

be reached. That is, the increase in exam score for a unit increase in study time

will decrease, as shown by the dashed line in Figure 5.1b. To produce this type of

curvature, you must know the relationship between models and graphs, and how

types of terms will change the shape of the curve.

A response that is a function of a single quantitative independent variable can

often be modeled by the ﬁrst few terms of a polynomial algebraic function. The

equation relating the mean value of y to a polynomial of order p in one independent

variable x is shown in the box.

266 Chapter 5 Principles of Model Building

A pth-Order Polynomial with One Independent Variable

E(y) = β0 + β1 x + β2 x 2 + β3 x 3 + · · · + βp x p

where p is an integer and β0 , β1 , . . . , βp are unknown parameters that must be

estimated.

As we mentioned in Chapters 3 and 4, a ﬁrst-order polynomial in x (i.e., p = 1),

E(y) = β0 + β1 x

graphs as a straight line. The β interpretations of this model are provided in the

next box.

First-Order (Straight-Line) Model with One Independent Variable

E(y) = β0 + β1 x

Interpretation of model parameters

β0 : y-intercept; the value of E(y) when x = 0

β1 : Slope of the line; the change in E(y) for a 1-unit increase in x

In Chapter 4 we also covered a second-order polynomial model (p = 2), called

a quadratic. For convenience, the model is repeated in the following box.

A Second-Order (Quadratic) Model with One Independent Variable

E(y) = β0 + β1 x + β2 x 2

where β0 , β1 , and β2 are unknown parameters that must be estimated.

Interpretation of model parameters

β0 : y-intercept; the value of E(y) when x = 0

β1 : Shift parameter; changing the value of β1 shifts the parabola to the right

or left (increasing the value of β1 causes the parabola to shift to the right)

β2 : Rate of curvature

Graphs of two quadratic models are shown in Figure 5.2. As we learned in

Chapter 4, the quadratic model is the equation of a parabola that opens either

upward, as in Figure 5.2a, or downward, as in Figure 5.2b. If the coefﬁcient of x 2 is

positive, it opens upward; if it is negative, it opens downward. The parabola may be

shifted upward or downward, left or right. The least squares procedure uses only

the portion of the parabola that is needed to model the data. For example, if you ﬁt

a parabola to the data points shown in Figure 5.3, the portion shown as a solid curve

Models with a Single Quantitative Independent Variable

267

Figure 5.2 Graphs for two

second-order polynomial

models

(a)

(b)

passes through the data points. The outline of the unused portion of the parabola is

indicated by a dashed curve.

Figure 5.3 Example of the

use of a quadratic model

E(y)

True relationship

Second-order model

x

Figure 5.3 illustrates an important limitation on the use of prediction equations:

The model is valid only over the range of x-values that were used to ﬁt the model.

For example, the response might rise, as shown in the ﬁgure, until it reaches a

plateau. The second-order model might ﬁt the data very well over the range of

x-values shown in Figure 5.3, but would provide a very poor ﬁt if data were collected

in the region where the parabola turns downward.

How do you decide the order of the polynomial you should use to model a

response if you have no prior information about the relationship between E(y) and

x? If you have data, construct a scatterplot of the data points, and see whether you

can deduce the nature of a good approximating function. A pth-order polynomial,

when graphed, will exhibit (p − 1) peaks, troughs, or reversals in direction. Note that

the graphs of the second-order model shown in Figure 5.2 each have (p − 1) = 1

peak (or trough). Likewise, a third-order model (shown in the box) will have

(p − 1) = 2 peaks or troughs, as illustrated in Figure 5.4.

The graphs of most responses as a function of an independent variable x are, in

general, curvilinear. Nevertheless, if the rate of curvature of the response curve is

very small over the range of x that is of interest to you, a straight line might provide

an excellent ﬁt to the response data and serve as a very useful prediction equation.

If the curvature is expected to be pronounced, you should try a second-order model.

Third- or higher-order models would be used only where you expect more than one

reversal in the direction of the curve. These situations are rare, except where the

response is a function of time. Models for forecasting over time are presented in

Chapter 10.

268 Chapter 5 Principles of Model Building

Figure 5.4 Graphs of two

third-order polynomial

models

(a)

(b)

Third-Order Model with One Independent Variable

E(y) = β0 + β1 x + β2 x 2 + β3 x 3

Interpretation of model parameters

β0 :

β1 :

β2 :

β3 :

Example

5.2

y-intercept; the value of E(y) when x = 0

Shift parameter (shifts the polynomial right or left on the x-axis)

Rate of curvature

The magnitude of β3 controls the rate of reversal of curvature for the

polynomial

To operate efﬁciently, power companies must be able to predict the peak power load

at their various stations. Peak power load is the maximum amount of power that

must be generated each day to meet demand. A power company wants to use daily

high temperature, x, to model daily peak power load, y, during the summer months

when demand is greatest. Although the company expects peak load to increase as

the temperature increases, the rate of increase in E(y) might not remain constant

as x increases. For example, a 1-unit increase in high temperature from 100◦ F to

101◦ F might result in a larger increase in power demand than would a 1-unit increase

from 80◦ F to 81◦ F. Therefore, the company postulates that the model for E(y) will

include a second-order (quadratic) term and, possibly, a third-order (cubic) term.

A random sample of 25 summer days is selected and both the peak load

(measured in megawatts) and high temperature (in degrees) recorded for each day.

The data are listed in Table 5.1.

(a) Construct a scatterplot for the data. What type of model is suggested

by the plot?

(b) Fit the third-order model, E(y) = β0 + β1 x + β2 x 2 + β3 x 3 , to the data. Is there

evidence that the cubic term, β3 x 3 , contributes information for the prediction

of peak power load? Test at α = .05.

(c) Fit the second-order model, E(y) = β0 + β1 x + β2 x 2 , to the data. Test the

hypothesis that the power load increases at an increasing rate with temperature.

Use α = .05.

(d) Give the prediction equation for the second-order model, part c. Are you

satisﬁed with using this model to predict peak power loads?

Models with a Single Quantitative Independent Variable

269

POWERLOADS

Table 5.1 Power load data

Temperature

◦F

Peak Load

megawatts

Temperature

◦F

Peak Load

megawatts

Temperature

◦F

Peak Load

megawatts

94

136.0

106

178.2

76

100.9

96

131.7

67

101.6

68

96.3

95

140.7

71

92.5

92

135.1

108

189.3

100

151.9

100

143.6

67

96.5

79

106.2

85

111.4

88

116.4

97

153.2

89

116.5

89

118.5

98

150.1

74

103.9

84

113.4

87

114.7

86

105.1

90

132.0

Solution

(a) The scatterplot of the data, produced using MINITAB, is shown in Figure 5.5.

The nonlinear, upward-curving trend indicates that a second-order model

would likely ﬁt the data well.

(b) The third-order model is ﬁt to the data using MINITAB and the resulting

printout is shown in Figure 5.6. The p-value for testing

H0 : β3 = 0

Ha : β3 = 0

highlighted on the printout is .911. Since this value exceeds α = .05, there is

insufﬁcient evidence of a third-order relationship between peak load and high

temperature. Consequently, we will drop the cubic term, β3 x 3 , from the model.

Figure 5.5 MINITAB

scatterplot for power

load data

270 Chapter 5 Principles of Model Building

Figure 5.6 MINITAB

output for third-order

model of power load

Figure 5.7 MINITAB

output for second-order

model of power load

(c) The second-order model is ﬁt to the data using MINITAB and the resulting

printout is shown in Figure 5.7. For this quadratic model, if β2 is positive, then

the peak power load y increases at an increasing rate with temperature x.

Consequently, we test

H0 : β2 = 0

Ha : β2 > 0

The test statistic, t = 7.93, and two-tailed p-value are both highlighted in

Figure 5.7. Since the one-tailed p-value, p = 0/2 = 0, is less than α = .05, we

reject H0 and conclude that peak power load increases at an increasing rate

with temperature.

(d) The prediction equation for the quadratic model, highlighted in Figure 5.7,

is yˆ = 385 − 8.29x + .0598x 2 . The adjusted-R 2 and standard deviation for the

model, also highlighted, are Ra2 = .956 and s = 5.376. These values imply

that (1) more than 95% of the sample variation in peak power loads can be

explained by the second-order model, and (2) the model can predict peak

load to within about 2s = 10.75 megawatts of its true value. Based on this

high value of Ra2 and reasonably small value of 2s, we recommend using this

equation to predict peak power loads for the company.

Models with a Single Quantitative Independent Variable

271

5.3 Exercises

5.7 Order of polynomials. The accompanying graphs

fogging method for a gas turbine engine. The heat

rate (kilojoules per kilowatt per hour) was measured for each in a sample of 67 gas turbines

augmented with high-pressure inlet fogging. In

addition, several other variables were measured,

including cycle speed (revolutions per minute),

inlet temperature (◦ C), exhaust gas temperature

(◦ C), cycle pressure ratio, and air mass ﬂow rate

(kilograms per second). The data are saved in the

GASTURBINE ﬁle. (The ﬁrst and last ﬁve observations are listed in the table.) Consider using

these variables as predictors of heat rate (y) in a

regression model. Construct scatterplots relating

heat rate to each of the independent variables.

Based on the graphs, hypothesize a polynomial

model relating y to each independent variable.

depict pth-order polynomials for one independent

variable.

5.9 Study of tree frogs. The optomotor responses of

tree frogs were studied in the Journal of Experimental Zoology (September 1993). Microspectrophotometry was used to measure the threshold

quantal ﬂux (the light intensity at which the optomotor response was ﬁrst observed) of tree frogs

(a) For each graph, identify the order of the polynomial.

(b) Using the parameters β0 , β1 , β2 , etc., write an

appropriate model relating E(y) to x for each

graph.

(c) The signs (+ or −) of many of the parameters

in the models of part b can be determined by

examining the graphs. Give the signs of those

parameters that can be determined.

5.8 Cooling method for gas turbines. The Journal of

Engineering for Gas Turbines and Power (January

2005) published a study of a high-pressure inlet

GASTURBINE (Data for ﬁrst and last ﬁve gas turbines shown)

RPM

27245

14000

17384

11085

14045

..

.

18910

3600

3600

16000

14600

CPRATIO

INLET-TEMP

EXH-TEMP

AIRFLOW

HEATRATE

9.2

12.2

14.8

11.8

13.2

1134

950

1149

1024

1149

602

446

537

478

553

7

15

20

27

29

14622

13196

11948

11289

11964

14.0

35.0

20.0

10.6

13.4

1066

1288

1160

1232

1077

532

448

456

560

536

8

152

84

14

20

12766

8714

9469

11948

12414

Source: Bhargava, R., and Meher-Homji, C. B. ‘‘Parametric analysis of existing gas

turbines with inlet evaporative and overspray fogging,’’ Journal of Engineering for Gas

Turbines and Power, Vol. 127, No. 1, Jan. 2005.

272 Chapter 5 Principles of Model Building

tested at different spectral wavelengths. The data

revealed the relationship between the log of quantal ﬂux (y) and wavelength (x) shown in the graph

on p. 271. Hypothesize a model for E(y) that corresponds to the graph.

(a) Graph the data in a scatterplot.

(b) If you were given the information for x = 30,

31, 32, and 33 only, what kind of model would

you suggest? For x = 33, 34, 35, and 36? For

all the data?

5.10 Tire wear and pressure. Underinﬂated or overin-

5.11 Assembly times and fatigue. A company is con-

ﬂated tires can increase tire wear and decrease gas

mileage. A new tire was tested for wear at different

pressures with the results shown in the table.

sidering having the employees on its assembly line

work 4 10-hour days instead of 5 8-hour days. Management is concerned that the effect of fatigue

as a result of longer afternoons of work might

increase assembly times to an unsatisfactory level.

An experiment with the 4-day week is planned in

which time studies will be conducted on some of the

workers during the afternoons. It is believed that

an adequate model of the relationship between

assembly time, y, and time since lunch, x, should

allow for the average assembly time to decrease

for a while after lunch before it starts to increase

as the workers become tired. Write a model

relating E(y) and x that would reﬂect the management’s belief, and sketch the hypothesized shape of

the model.

TIRES2

PRESSURE

MILEAGE

x, pounds per square inch

y, thousands

30

31

32

33

34

35

36

29

32

36

38

37

33

26

5.4 First-Order Models with Two or More

Quantitative Independent Variables

Like models for a single independent variable, models with two or more independent variables are classiﬁed as ﬁrst-order, second-order, and so forth, but it

is difﬁcult (most often impossible) to graph the response because the plot is in a

multidimensional space. For example, with one quantitative independent variable,

Figure 5.8 Response

surface for ﬁrst-order

model with two

quantitative independent

variables

y

x2

x1

## 2011 (7th edition) william mendenhall a second course in statistics regression analysis prentice hall (2011)

## 2 Populations, Samples, and Random Sampling

## 3 Fitting the Model: The Method of Least Squares

## 6 Assessing the Utility of the Model: Making Inferences About the Slope β[sub(1)]

## 4 Fitting the Model: The Method of Least Squares

## 6 Testing the Utility of a Model: The Analysis of Variance F-Test

## 11 A Quadratic (Second-Order) Model with a Quantitative Predictor

## 1 Introduction: Why Use a Variable-Screening Method?

## 5 Extrapolation: Predicting Outside the Experimental Region

## 7 Follow-Up Analysis: Tukey’s Multiple Comparisons of Means

## B.7 Standard Errors of Estimators, Test Statistics, and Confidence Intervals for β[sub(0)], β[sub(1)], . . . , β[sub(k)]

Tài liệu liên quan