1 Introduction: Why Model Building Is Important
Tải bản đầy đủ
262 Chapter 5 Principles of Model Building
together with data on an extensive list of independent variables, x1 , x2 , . . . , xk ,
that they thought were related to y. Among these independent variables were the
student’s IQ, scores on mathematics and verbal achievement examinations, rank in
class, and so on. They ﬁt the model
E(y) = β0 + β1 x1 + β2 x2 + · · · + βk xk
to the data, analyzed the results, and reached the conclusion that none of the
independent variables was ‘‘signiﬁcantly related’’ to y. The goodness of ﬁt of the
model, measured by the coefﬁcient of determination R 2 , was not particularly good,
and t tests on individual parameters did not lead to rejection of the null hypotheses
that these parameters equaled 0.
How could the researchers have reached the conclusion that there is no signiﬁcant relationship, when it is evident, just as a matter of experience, that some
of the independent variables studied are related to academic achievement? For
example, achievement on a college mathematics placement test should be related
to achievement in college mathematics. Certainly, many other variables will affect
achievement—motivation, environmental conditions, and so forth—but generally
speaking, there will be a positive correlation between entrance achievement test
scores and college academic achievement. So, what went wrong with the educational
researchers’ study?
Although you can never discard the possibility of computing error as a reason
for erroneous answers, most likely the difﬁculties in the results of the educational
study were caused by the use of an improperly constructed model. For example,
the model
E(y) = β0 + β1 x1 + β2 x2 + · · · + βk xk
assumes that the independent variables x1 , x2 , . . . , xk affect mean achievement E(y)
independently of each other.∗ Thus, if you hold all the other independent variables
constant and vary only x1 , E(y) will increase by the amount β1 for every unit increase
in x1 . A 1-unit change in any of the other independent variables will increase E(y)
by the value of the corresponding β parameter for that variable.
Do the assumptions implied by the model agree with your knowledge about
academic achievement? First, is it reasonable to assume that the effect of time spent
on study is independent of native intellectual ability? We think not. No matter how
much effort some students invest in a particular subject, their rate of achievement is
low. For others, it may be high. Therefore, assuming that these two variables—effort
and native intellectual ability—affect E(y) independently of each other is likely to
be an erroneous assumption. Second, suppose that x5 is the amount of time a student
devotes to study. Is it reasonable to expect that a 1-unit increase in x5 will always
produce the same change β5 in E(y)? The changes in E(y) for a 1-unit increase in x5
might depend on the value of x5 (e.g., the law of diminishing returns). Consequently,
it is quite likely that the assumption of a constant rate of change in E(y) for 1-unit
increases in the independent variables will not be satisﬁed.
Clearly, the model
E(y) = β0 + β1 x1 + β2 x2 + · · · + βk xk
was a poor choice in view of the researchers’ prior knowledge of some of the variables
involved. Terms have to be added to the model to account for interrelationships
among the independent variables and for curvature in the response function. Failure
to include needed terms causes inﬂated values of SSE, nonsigniﬁcance in statistical
tests, and, often, erroneous practical conclusions.
∗ Keep in mind that we are discussing the deterministic portion of the model and that the word independent is
used in a mathematical rather than a probabilistic sense.
The Two Types of Independent Variables: Quantitative and Qualitative
263
In this chapter, we discuss the most difﬁcult part of a multiple regression analysis:
the formulation of a good model for E(y). Although many of the models presented
in this chapter have already been introduced in optional Section 4.12, we assume
the reader has little or no background in model building. This chapter serves as a
basic reference guide to model building for teachers, students, and practitioners of
multiple regression analysis.
5.2 The Two Types of Independent Variables:
Quantitative and Qualitative
The independent variables that appear in a linear model can be one of two types.
Recall from Chapter 1 that a quantitative variable is one that assumes numerical
values corresponding to the points on a line. (Deﬁnition 1.4). An independent
variable that is not quantitative, that is, one that is categorical in nature, is called
qualitative (Deﬁnition 1.5).
The nicotine content of a cigarette, prime interest rate, number of defects in a
product, and IQ of a student are all examples of quantitative independent variables.
On the other hand, suppose three different styles of packaging, A, B, and C, are
used by a manufacturer. This independent variable, style of packaging, is qualitative,
since it is not measured on a numerical scale. Certainly, style of packaging is an
independent variable that may affect sales of a product, and we would want to
include it in a model describing the product’s sales, y.
Deﬁnition 5.1 The different values of an independent variable used in regression are called its levels.
For a quantitative variable, the levels correspond to the numerical values it
assumes. For example, if the number of defects in a product ranges from 0 to 3, the
independent variable assumes four levels: 0, 1, 2, and 3.
The levels of a qualitative variable are not numerical. They can be deﬁned only
by describing them. For example, the independent variable style of packaging was
observed at three levels: A, B, and C.
Example
5.1
In Chapter 4, we considered the problem of predicting executive salary as a function
of several independent variables. Consider the following four independent variables
that may affect executive salaries:
(a)
(b)
(c)
(d)
Years of experience
Gender of the employee
Firm’s net asset value
Rank of the employee
For each of these independent variables, give its type and describe the nature of the
levels you would expect to observe.
Solution
(a) The independent variable for the number of years of experience is quantitative,
since its values are numerical. We would expect to observe levels ranging from
0 to 40 (approximately) years.
264 Chapter 5 Principles of Model Building
(b) The independent variable for gender is qualitative, since its levels can only be
described by the nonnumerical labels ‘‘female’’ and ‘‘male.’’
(c) The independent variable for the ﬁrm’s net asset value is quantitative, with
a very large number of possible levels corresponding to the range of dollar
values representing various ﬁrms’ net asset values.
(d) Suppose the independent variable for the rank of the employee is observed at
three levels: supervisor, assistant vice president, and vice president. Since we
cannot assign a realistic measure of relative importance to each position, rank
is a qualitative independent variable.
Quantitative and qualitative independent variables are treated differently in
regression modeling. In the next section, we see how quantitative variables are
entered into a regression model.
5.2 Exercises
5.1 Buy-side versus sell-side analysts’ earnings forecasts. The Financial Analysts Journal (July/August
2008) published a study comparing the earnings
forecasts of buy-side and sell-side analysts. A team
of Harvard Business School professors used regression to model the relative optimism (y) of the
analysts’ 3-month horizon forecasts based on the following independent variables. Determine the type
(quantitative or qualitative) of each variable.
(a) Whether the analyst worked for a buy-side ﬁrm
or a sell-side ﬁrm.
(b) Number of days between forecast and ﬁscal
year-end (i.e., forecast horizon).
(c) Number of quarters the analyst had worked
with the ﬁrm.
5.2 Workplace bullying and intention to leave. Workplace bullying (e.g., work-related harassment,
persistent criticism, withholding key information,
spreading rumors, intimidation) has been shown
to have a negative psychological effect on victims, often leading the victim to quit or resign.
In Human Resource Management Journal (October
2008), researchers employed multiple regression to
model bullying victims’ intention to leave the ﬁrm
as a function of perceived organizational support
and level of workplace bullying. The dependent
variable in the analysis, intention to leave (y), was
measured on a quantitative scale. Identify the type
(qualitative or quantitative) of the two key independent variables in the study, level of bullying
(measured on a 50-point scale) and perceived organizational support (measured as ‘‘low,’’ ‘‘neutral,’’
or ‘‘high’’).
5.3 Expert testimony in homicide trials of battered
women. The Duke Journal of Gender Law and
Policy (Summer 2003) examined the impact of
expert testimony on the outcome of homicide trials
that involve battered woman syndrome. Multiple
regression was employed to model the likelihood
of changing a verdict from not guilty to guilty after
deliberations, y, as a function of juror gender and
whether or not expert testimony was given. Identify
the independent variables in the model as quantitative or qualitative.
5.4 Chemical composition of rain water. The Journal of Agricultural, Biological, and Environmental
Statistics (March 2005) presented a study of the
chemical composition of rain water. The nitrate
concentration, y (milligrams per liter), in a rain
water sample was modeled as a function of two
independent variables: water source (groundwater,
subsurface ﬂow, or overground ﬂow) and silica concentration (milligrams per liter). Identify the type
(quantitative or qualitative) for each independent
variable.
5.5 Psychological response of ﬁreﬁghters. The Journal of Human Stress (Summer 1987) reported on
a study of ‘‘psychological response of ﬁreﬁghters
to chemical ﬁre.’’ The researchers used multiple
regression to predict emotional distress as a function of the following independent variables. Identify
each independent variable as quantitative or qualitative. For qualitative variables, suggest several
levels that might be observed. For quantitative variables, give a range of values (levels) for which the
variable might be observed.
(a) Number of preincident psychological symptoms
(b) Years of experience
(c) Cigarette smoking behavior
(d) Level of social support
(e) Marital status
(f) Age
Models with a Single Quantitative Independent Variable
(g)
(h)
(i)
(j)
(k)
Ethnic status
Exposure to a chemical ﬁre
Education level
Distance lived from site of incident
Gender
265
5.6 Modeling a qualitative response. Which of the
assumptions about ε (Section 4.2) prohibit the use
of a qualitative variable as a dependent variable?
(We present a technique for modeling a qualitative
dependent variable in Chapter 9.)
5.3 Models with a Single Quantitative
Independent Variable
To write a prediction equation that provides a good model for a response (one that
will eventually yield good predictions), we have to know how the response might
vary as the levels of an independent variable change. Then we have to know how
to write a mathematical equation to model it. To illustrate (with a simple example),
suppose we want to model a student’s score on a statistics exam, y, as a function of
the single independent variable x, the amount of study time invested. It may be that
exam score, y, increases in a straight line as the amount of study time, x, varies from
1 hour to 6 hours, as shown in Figure 5.1a. If this were the entire range of x-values
for which you wanted to predict y, the model
E(y) = β0 + β1 x
would be appropriate.
Figure 5.1 Modeling exam
score, y, as a function of
study time, x
(a)
(b)
Now, suppose you want to expand the range of values of x to x = 8 or x = 10
hours of studying. Will the straight-line model
E(y) = β0 + β1 x
be satisfactory? Perhaps, but making this assumption could be risky. As the amount
of studying, x, is increased, sooner or later the point of diminishing returns will
be reached. That is, the increase in exam score for a unit increase in study time
will decrease, as shown by the dashed line in Figure 5.1b. To produce this type of
curvature, you must know the relationship between models and graphs, and how
types of terms will change the shape of the curve.
A response that is a function of a single quantitative independent variable can
often be modeled by the ﬁrst few terms of a polynomial algebraic function. The
equation relating the mean value of y to a polynomial of order p in one independent
variable x is shown in the box.
266 Chapter 5 Principles of Model Building
A pth-Order Polynomial with One Independent Variable
E(y) = β0 + β1 x + β2 x 2 + β3 x 3 + · · · + βp x p
where p is an integer and β0 , β1 , . . . , βp are unknown parameters that must be
estimated.
As we mentioned in Chapters 3 and 4, a ﬁrst-order polynomial in x (i.e., p = 1),
E(y) = β0 + β1 x
graphs as a straight line. The β interpretations of this model are provided in the
next box.
First-Order (Straight-Line) Model with One Independent Variable
E(y) = β0 + β1 x
Interpretation of model parameters
β0 : y-intercept; the value of E(y) when x = 0
β1 : Slope of the line; the change in E(y) for a 1-unit increase in x
In Chapter 4 we also covered a second-order polynomial model (p = 2), called
a quadratic. For convenience, the model is repeated in the following box.
A Second-Order (Quadratic) Model with One Independent Variable
E(y) = β0 + β1 x + β2 x 2
where β0 , β1 , and β2 are unknown parameters that must be estimated.
Interpretation of model parameters
β0 : y-intercept; the value of E(y) when x = 0
β1 : Shift parameter; changing the value of β1 shifts the parabola to the right
or left (increasing the value of β1 causes the parabola to shift to the right)
β2 : Rate of curvature
Graphs of two quadratic models are shown in Figure 5.2. As we learned in
Chapter 4, the quadratic model is the equation of a parabola that opens either
upward, as in Figure 5.2a, or downward, as in Figure 5.2b. If the coefﬁcient of x 2 is
positive, it opens upward; if it is negative, it opens downward. The parabola may be
shifted upward or downward, left or right. The least squares procedure uses only
the portion of the parabola that is needed to model the data. For example, if you ﬁt
a parabola to the data points shown in Figure 5.3, the portion shown as a solid curve
Models with a Single Quantitative Independent Variable
267
Figure 5.2 Graphs for two
second-order polynomial
models
(a)
(b)
passes through the data points. The outline of the unused portion of the parabola is
indicated by a dashed curve.
Figure 5.3 Example of the
use of a quadratic model
E(y)
True relationship
Second-order model
x
Figure 5.3 illustrates an important limitation on the use of prediction equations:
The model is valid only over the range of x-values that were used to ﬁt the model.
For example, the response might rise, as shown in the ﬁgure, until it reaches a
plateau. The second-order model might ﬁt the data very well over the range of
x-values shown in Figure 5.3, but would provide a very poor ﬁt if data were collected
in the region where the parabola turns downward.
How do you decide the order of the polynomial you should use to model a
response if you have no prior information about the relationship between E(y) and
x? If you have data, construct a scatterplot of the data points, and see whether you
can deduce the nature of a good approximating function. A pth-order polynomial,
when graphed, will exhibit (p − 1) peaks, troughs, or reversals in direction. Note that
the graphs of the second-order model shown in Figure 5.2 each have (p − 1) = 1
peak (or trough). Likewise, a third-order model (shown in the box) will have
(p − 1) = 2 peaks or troughs, as illustrated in Figure 5.4.
The graphs of most responses as a function of an independent variable x are, in
general, curvilinear. Nevertheless, if the rate of curvature of the response curve is
very small over the range of x that is of interest to you, a straight line might provide
an excellent ﬁt to the response data and serve as a very useful prediction equation.
If the curvature is expected to be pronounced, you should try a second-order model.
Third- or higher-order models would be used only where you expect more than one
reversal in the direction of the curve. These situations are rare, except where the
response is a function of time. Models for forecasting over time are presented in
Chapter 10.
268 Chapter 5 Principles of Model Building
Figure 5.4 Graphs of two
third-order polynomial
models
(a)
(b)
Third-Order Model with One Independent Variable
E(y) = β0 + β1 x + β2 x 2 + β3 x 3
Interpretation of model parameters
β0 :
β1 :
β2 :
β3 :
Example
5.2
y-intercept; the value of E(y) when x = 0
Shift parameter (shifts the polynomial right or left on the x-axis)
Rate of curvature
The magnitude of β3 controls the rate of reversal of curvature for the
polynomial
To operate efﬁciently, power companies must be able to predict the peak power load
at their various stations. Peak power load is the maximum amount of power that
must be generated each day to meet demand. A power company wants to use daily
high temperature, x, to model daily peak power load, y, during the summer months
when demand is greatest. Although the company expects peak load to increase as
the temperature increases, the rate of increase in E(y) might not remain constant
as x increases. For example, a 1-unit increase in high temperature from 100◦ F to
101◦ F might result in a larger increase in power demand than would a 1-unit increase
from 80◦ F to 81◦ F. Therefore, the company postulates that the model for E(y) will
include a second-order (quadratic) term and, possibly, a third-order (cubic) term.
A random sample of 25 summer days is selected and both the peak load
(measured in megawatts) and high temperature (in degrees) recorded for each day.
The data are listed in Table 5.1.
(a) Construct a scatterplot for the data. What type of model is suggested
by the plot?
(b) Fit the third-order model, E(y) = β0 + β1 x + β2 x 2 + β3 x 3 , to the data. Is there
evidence that the cubic term, β3 x 3 , contributes information for the prediction
of peak power load? Test at α = .05.
(c) Fit the second-order model, E(y) = β0 + β1 x + β2 x 2 , to the data. Test the
hypothesis that the power load increases at an increasing rate with temperature.
Use α = .05.
(d) Give the prediction equation for the second-order model, part c. Are you
satisﬁed with using this model to predict peak power loads?
Models with a Single Quantitative Independent Variable
269
POWERLOADS
Table 5.1 Power load data
Temperature
◦F
Peak Load
megawatts
Temperature
◦F
Peak Load
megawatts
Temperature
◦F
Peak Load
megawatts
94
136.0
106
178.2
76
100.9
96
131.7
67
101.6
68
96.3
95
140.7
71
92.5
92
135.1
108
189.3
100
151.9
100
143.6
67
96.5
79
106.2
85
111.4
88
116.4
97
153.2
89
116.5
89
118.5
98
150.1
74
103.9
84
113.4
87
114.7
86
105.1
90
132.0
Solution
(a) The scatterplot of the data, produced using MINITAB, is shown in Figure 5.5.
The nonlinear, upward-curving trend indicates that a second-order model
would likely ﬁt the data well.
(b) The third-order model is ﬁt to the data using MINITAB and the resulting
printout is shown in Figure 5.6. The p-value for testing
H0 : β3 = 0
Ha : β3 = 0
highlighted on the printout is .911. Since this value exceeds α = .05, there is
insufﬁcient evidence of a third-order relationship between peak load and high
temperature. Consequently, we will drop the cubic term, β3 x 3 , from the model.
Figure 5.5 MINITAB
scatterplot for power
load data
270 Chapter 5 Principles of Model Building
Figure 5.6 MINITAB
output for third-order
model of power load
Figure 5.7 MINITAB
output for second-order
model of power load
(c) The second-order model is ﬁt to the data using MINITAB and the resulting
printout is shown in Figure 5.7. For this quadratic model, if β2 is positive, then
the peak power load y increases at an increasing rate with temperature x.
Consequently, we test
H0 : β2 = 0
Ha : β2 > 0
The test statistic, t = 7.93, and two-tailed p-value are both highlighted in
Figure 5.7. Since the one-tailed p-value, p = 0/2 = 0, is less than α = .05, we
reject H0 and conclude that peak power load increases at an increasing rate
with temperature.
(d) The prediction equation for the quadratic model, highlighted in Figure 5.7,
is yˆ = 385 − 8.29x + .0598x 2 . The adjusted-R 2 and standard deviation for the
model, also highlighted, are Ra2 = .956 and s = 5.376. These values imply
that (1) more than 95% of the sample variation in peak power loads can be
explained by the second-order model, and (2) the model can predict peak
load to within about 2s = 10.75 megawatts of its true value. Based on this
high value of Ra2 and reasonably small value of 2s, we recommend using this
equation to predict peak power loads for the company.
Models with a Single Quantitative Independent Variable
271
5.3 Exercises
5.7 Order of polynomials. The accompanying graphs
fogging method for a gas turbine engine. The heat
rate (kilojoules per kilowatt per hour) was measured for each in a sample of 67 gas turbines
augmented with high-pressure inlet fogging. In
addition, several other variables were measured,
including cycle speed (revolutions per minute),
inlet temperature (◦ C), exhaust gas temperature
(◦ C), cycle pressure ratio, and air mass ﬂow rate
(kilograms per second). The data are saved in the
GASTURBINE ﬁle. (The ﬁrst and last ﬁve observations are listed in the table.) Consider using
these variables as predictors of heat rate (y) in a
regression model. Construct scatterplots relating
heat rate to each of the independent variables.
Based on the graphs, hypothesize a polynomial
model relating y to each independent variable.
depict pth-order polynomials for one independent
variable.
5.9 Study of tree frogs. The optomotor responses of
tree frogs were studied in the Journal of Experimental Zoology (September 1993). Microspectrophotometry was used to measure the threshold
quantal ﬂux (the light intensity at which the optomotor response was ﬁrst observed) of tree frogs
(a) For each graph, identify the order of the polynomial.
(b) Using the parameters β0 , β1 , β2 , etc., write an
appropriate model relating E(y) to x for each
graph.
(c) The signs (+ or −) of many of the parameters
in the models of part b can be determined by
examining the graphs. Give the signs of those
parameters that can be determined.
5.8 Cooling method for gas turbines. The Journal of
Engineering for Gas Turbines and Power (January
2005) published a study of a high-pressure inlet
GASTURBINE (Data for ﬁrst and last ﬁve gas turbines shown)
RPM
27245
14000
17384
11085
14045
..
.
18910
3600
3600
16000
14600
CPRATIO
INLET-TEMP
EXH-TEMP
AIRFLOW
HEATRATE
9.2
12.2
14.8
11.8
13.2
1134
950
1149
1024
1149
602
446
537
478
553
7
15
20
27
29
14622
13196
11948
11289
11964
14.0
35.0
20.0
10.6
13.4
1066
1288
1160
1232
1077
532
448
456
560
536
8
152
84
14
20
12766
8714
9469
11948
12414
Source: Bhargava, R., and Meher-Homji, C. B. ‘‘Parametric analysis of existing gas
turbines with inlet evaporative and overspray fogging,’’ Journal of Engineering for Gas
Turbines and Power, Vol. 127, No. 1, Jan. 2005.
272 Chapter 5 Principles of Model Building
tested at different spectral wavelengths. The data
revealed the relationship between the log of quantal ﬂux (y) and wavelength (x) shown in the graph
on p. 271. Hypothesize a model for E(y) that corresponds to the graph.
(a) Graph the data in a scatterplot.
(b) If you were given the information for x = 30,
31, 32, and 33 only, what kind of model would
you suggest? For x = 33, 34, 35, and 36? For
all the data?
5.10 Tire wear and pressure. Underinﬂated or overin-
5.11 Assembly times and fatigue. A company is con-
ﬂated tires can increase tire wear and decrease gas
mileage. A new tire was tested for wear at different
pressures with the results shown in the table.
sidering having the employees on its assembly line
work 4 10-hour days instead of 5 8-hour days. Management is concerned that the effect of fatigue
as a result of longer afternoons of work might
increase assembly times to an unsatisfactory level.
An experiment with the 4-day week is planned in
which time studies will be conducted on some of the
workers during the afternoons. It is believed that
an adequate model of the relationship between
assembly time, y, and time since lunch, x, should
allow for the average assembly time to decrease
for a while after lunch before it starts to increase
as the workers become tired. Write a model
relating E(y) and x that would reﬂect the management’s belief, and sketch the hypothesized shape of
the model.
TIRES2
PRESSURE
MILEAGE
x, pounds per square inch
y, thousands
30
31
32
33
34
35
36
29
32
36
38
37
33
26
5.4 First-Order Models with Two or More
Quantitative Independent Variables
Like models for a single independent variable, models with two or more independent variables are classiﬁed as ﬁrst-order, second-order, and so forth, but it
is difﬁcult (most often impossible) to graph the response because the plot is in a
multidimensional space. For example, with one quantitative independent variable,
Figure 5.8 Response
surface for ﬁrst-order
model with two
quantitative independent
variables
y
x2
x1