Tải bản đầy đủ
A.2 Means, Variances, Covariances, and Correlations

A.2 Means, Variances, Covariances, and Correlations

Tải bản đầy đủ

A.2 

291

means, variances, covariances, and correlations

For example, suppose all the ui have the same expected value and we write
E(ui) = μ, i = 1, . . . , n. The sample mean of the ui is u = ∑ ui / n = ∑(1/ n)ui , and
the expected value of the sample mean is
E(u ) = E 


1

1

1

∑ n u  = n ∑ E(u ) = n (nµ) = µ
i

i

We say that ū is an unbiased estimate of the population mean μ, since its
expected value is μ.
A.2.2  Variance and Var Notation
The symbol Var(ui) is for the variance of ui. The variance is defined by the
equation Var(ui) = E[ui − E(ui)]2, the expected squared difference between an
observed value for ui and its mean value. The larger Var(ui), the more variable
observed values for ui are likely to be. The symbol σ2 is often used for a variance, or σ u2 might be used for the variance of the identically distributed ui if
several variances are being discussed. The square root of a variance, often σ
or σu, is the standard deviation, and is in the same units as the units of the
random variable ui. For example, if the ui are heights in centimeters, then units
of σu are also centimeters. The units of σ u2 are cm2, which can be much harder
to interpret.
The general rule for the variance of a sum of uncorrelated random
variables is


( ∑ a u ) = ∑ a Var(u )

Var a0 +

i i

2
i

(A.2)

i

The a0 term vanishes because the variance of a constant is 0. Assuming that
Var(ui) = σ2, we can find the variance of the sample mean of independently,
identically distributed ui:
Var(u ) = Var 


1

1

1

∑ n u  = n ∑ Var(u ) = n
i

2

i

2

( nσ 2 ) =

σ2
n

The standard deviation of a sum is found by computing the variance of the
sum and then taking a square root.
A.2.3  Covariance and Correlation
The symbol Cov(ui, uj) is read as the covariance between the random variables
ui and uj and is also an expected value defined by the equation
Cov(ui, u j ) = E {[ui − E(ui )][u j − E(u j )]} = Cov(u j, ui )

292

appendix

The covariance describes the way two random variables vary jointly. If the
two variables are independent, then Cov(ui, uj) = 0, but zero correlation does
not imply independence. The variance is a special case of covariance, since
Cov(ui, ui) = Var(ui).
When covariance is nonzero, common language is to say that two variables
are correlated. Formally, the correlation coefficient is defined by

ρ(ui, u j ) =

Cov(ui, u j )
Var(ui )Var(u j )

The correlation does not depend on units of measurement and has a value
between −1 and 1, with ρ(ui, uj) = 0 only if Cov(ui, uj) = 0.
The rule for covariances is
Cov(a0 + a1u1, a3 + a2u2 ) = a1a2Cov(u1, u2 )
It is left as an exercise to show that

ρ(a0 + a1u1, a3 + a2u2 ) = ρ(u1, u2 )
so the unit-free correlation coefficient does not change if the random variables
are rescaled or centered.
The general form for the variance of a linear combination of random variables is
n −1

n



n

( ∑ a u ) = ∑ a Var(u ) + 2∑ ∑ a a Cov(u , u )

Var a0 +

2
i

i i

i =1

i

i j

i

j

(A.3)

i =1 j = i + 1

A.2.4  Conditional Moments
Throughout the book, we use notation like E(Y|X) or E(Y|X = x) to denote
the mean of the random variable Y in the population for which the value of X
is fixed. Similarly, Var(Y|X) or Var(Y|X  =  x) is the variance of the random
variable Y in the population for which X is fixed.
There are simple relationships between the conditional mean and variance
of Y given X and the unconditional mean and variances (Casella and Berger,
2001):


E(Y ) = E [ E(Y |X )]

(A.4)



Var(Y ) = E [ Var(Y |X )] + Var [ E(Y |X )]

(A.5)

For example, suppose that when we condition on the predictor X we
have a simple linear regression mean function with constant variance,
E(Y|X = x) = β0 + β1x, Var(Y|X = x) = σ2. In addition, suppose the unconditional

A.3 

293

least squares for simple regression

moments of the predictor are E(X) = μx and Var( X ) = τ x2 . Then for the unconditional random variable Y,
E(Y ) = E [ E(Y |X = x)]
= E [ β 0 + β1 x ]
= β 0 + β1µ x
Var(Y ) = E [ Var(Y |X = x)] + Var [ E(Y |X = x)]
= E[σ 2 ] + Var[β0 + β1 x]
= σ 2 + β12τ x2
The expected value of the unconditional variable Y is obtained by substituting
the expected value of the unconditional variable X into the conditional
expected value formula, and the unconditional variance of Y equals the
conditional variance plus an additional quantity that depends on both β12 and
on τ x2 .
A.3  LEAST SQUARES FOR SIMPLE REGRESSION
The ols estimates of β0 and β1 in simple regression are the values that minimize
the residual sum of squares function,
n



RSS(β 0, β1 ) =

∑(y − β
i

0

− β1 xi )
2

(A.6)

i =1

One method of finding the minimizer is to differentiate with respect to β0 and
β1, set the derivatives equal to 0, and solve
n

∂RSS(β0, β1 )
= −2 ( yi − β0 − β1 xi ) = 0
β0
i =1


n

∂RSS(β0, β1 )
= −2 xi ( yi − β0 − β1 xi ) = 0
β1
i =1



Upon rearranging terms, we get



∑x = ∑y
β ∑x +β ∑x = ∑x y
0

β 0 n + β1

i

i

1

2
i

i i

i

(A.7)

Equations (A.7) are called the normal equations for the simple linear regression model (2.1). The normal equations depend on the data only through the
sufficient statistics ∑ xi , ∑ yi , ∑ xi2 , and ∑ xi yi . Using the formulas

294



appendix

∑ (x − x ) = ∑ x − nx
SXY = ∑ ( x − x )( y − y ) = ∑ x y − nxy
SXX =

i

i

2

2
i

i

i i

2

(A.8)

equivalent and numerically more stable sufficient statistics are given by
x , y, SXX, and SXY. Solving (A.7), we get


SXY

βˆ 0 = y − βˆ 1 x βˆ 1 =
SXX

(A.9)

A.4  MEANS AND VARIANCES OF LEAST
SQUARES ESTIMATES
The least squares estimates are linear combinations of the observed values
y1, . . . , yn of the response, so we can apply the results of Appendix A.2 to the
estimates found in Appendix A.3 to get the means, variances, and covariances
of the estimates. Assume the simple regression model (2.1) is correct. The
estimator βˆ 1 given at (A.9) can be written as βˆ 1 = ∑ci yi , where for each i,
ci = ( xi − x )/SXX. Since we are conditioning on the values of X, the ci are fixed
numbers. By (A.1),

(∑ c y |X = x ) = ∑ c E(y |X = x )
= ∑ c (β + β x )
= β ∑c + β ∑c x

E(βˆ 1|X ) = E

i i
0

i

0

i

i

i

1 i

1

i

i

i i

By direct summation, ∑ ci = 0 and ∑ ci xi = 1, giving
E(βˆ 1|X ) = β1
which shows that βˆ 1 is unbiased for β1. A similar computation will show that
βˆ 0 is an unbiased estimate of β0.
Since the yi are assumed independent, the variance of βˆ 1 is found by an
application of (A.2),
Var(βˆ 1|X ) = Var

( ∑ c y |X = x )
i i

i

∑ c Var( y |X = x )
= σ ∑c
=

2
i

2

i

2
i

= σ 2 /SXX

i

A.4 

295

means and variances of least squares estimates

This computation also used ∑ ci2 = ∑( xi − x )2 /SXX2 = 1/SXX. Computing the
variance of βˆ 0 requires an application of (A.3). We write
Var(βˆ 0 |X ) = Var(y − βˆ 1 x|X )



= Var(y|X ) + x 2 Var(βˆ 1 |X ) − 2xCov(y, βˆ 1 |X )



(A.10)

To complete this computation, we need to compute the covariance,
1
Cov(y, βˆ 1 |X ) = Cov 
n
1
n
σ2
=
n
=0
=

∑ y , ∑ c y |X 
i

i

i

∑ c Cov(y , y |X )
i

i

j

∑c

i

because the yi are independent and ∑ ci = 0. Substituting into (A.10) and
simplifying,
 1 x2 
Var(βˆ 0 |X ) = σ 2  +
 n SXX 
Finally,
Cov(βˆ 0 , βˆ 1 |X ) = Cov( y − βˆ 1 x , βˆ 1 |X )
= Cov( y, βˆ 1 |X ) − xCov(βˆ 1, βˆ 1 |X )
x
SXX
x
= −σ 2
SXX
= 0 −σ2

Further application of these results gives the variance of a fitted value,
yˆ = βˆ 0 + βˆ 1 x :
Var(yˆ |X = x) = Var(βˆ 0 + βˆ 1 x|X = x)
= Var(βˆ 0 |X = x) + x 2 Var(βˆ 1|X = x) + 2xCov(βˆ 0, βˆ 1|X = x)
1
x
 1 x2 
= σ2  +
+ σ 2 x2
− 2σ 2 x
 n SXX 
SXX
SXX

 1 (x − x ) 2 
= σ2  +
n
SXX 



(A.11)

296

appendix

A prediction ỹ* at the future value x* is just βˆ 0 + βˆ 1 x* . The variance of a
prediction consists of the variance of the fitted value at x* given by (A.11) plus
σ2, the variance of the error that will be attached to the future value,
 1 ( x − x )2 
Var( y*|X = x* ) = σ 2  + *
+σ2
n
SXX 
as given by (2.16).

A.5  ESTIMATING E(Y|X) USING A SMOOTHER
For a 2D scatterplot of Y versus X, a scatterplot smoother provides an estimate
of the mean function E(Y|X  =  x) as x varies, without making parametric
assumptions about the mean function. Many smoothing methods are used, and
the smoother we use most often in this book is the simplest case of the loess
smoother, Cleveland (1979); see also the first step in Algorithm 6.1.1 in Härdle
(1990, p. 192). This smoother estimates E(Y|X = xg) by ỹg via a weighted least
squares (wls) simple regression, giving more weight to points close to xg than
to points distant from xg. Here is the method:
1. Select a value for a smoothing parameter f, a number between 0 and 1.
Values of f close to 1 will give curves that are too smooth and will be
close to a straight line, while small values of f give curves that are too
rough and match all the wiggles in the data. The value of f must be chosen
to balance the bias of oversmoothing with the variability of undersmoothing. Remarkably, for many problems f  ≈  2/3 is a good choice.
There is a substantial literature on the appropriate ways to estimate a
smoothing parameter for loess and for other smoothing methods, but
for the purposes of using a smoother to help us look at a graph, optimal
choice of a smoothing parameter is not critical.
2. Find the fn closest points to xg. For example, if n = 100, and f = 0.6, then
find the fn = 60 closest points to xg. Every time the value of xg is changed,
the points selected may change.
3. Among these fn nearest neighbors to xg, compute the wls estimates for
the simple regression of Y ∼ X, with weights determined so that points
close to xg have the highest weight, and the weights decline toward 0 for
points farther from xg. We use a triangular weight function that gives
maximum weight to data at xg, and weights that decrease linearly to 0 at
the edge of the neighborhood. If a different weight function is used,
answers are somewhat different.
4. The value of ỹg is the fitted value at xg from the wls regression using
the nearest neighbors found at step 2 as the data, and the weights from
step 3.

A.5 

297

estimating e(y |x) using a smoother
400
350

Height, dm

300
250
200
f = .1
f = 2/3
f = .95
OLS

150
100
200

400

600

800

1000

Diameter at 137 cm above ground

Figure A.1  Three choices of the smoothing parameter for a loess smooth. The data used in this
plot are discussed in Section 8.1.2.

5. Repeat steps 1–4 for many values of xg that form a grid of points that
cover the interval on the x-axis of interest. Join the points.
Figure A.1 shows a plot of Height versus Diameter for western cedar
trees in the Upper Flat Creek data, along with four smoothers. The first
smoother is the ols simple regression line, which does not match the
data well because the mean function for the data in this figure is probably
curved, not straight. The loess smooth with f  =  0.1 is as expected very
wiggly, matching the local variation rather than the mean. The line for
f  =  2/3 seems to match the data very well, while the loess fit for f  =  .95
is nearly the same as for f  =  2/3, but it tends toward oversmoothing and
attempts to match the ols line. We would conclude from this graph that a
straight-line mean function is likely to be inadequate because it does not
match the data very well. Loader (2004) presents a bootstrap based lack-of-fit
test based on comparing parametric and nonparametric estimates of the
mean function.
The loess smoother is an example of a nearest neighbor smoother. Local
polynomial regression smoothers and kernel smoothers are similar to loess,
except they give positive weight to all cases within a fixed distance of the point
of interest rather than a fixed number of points. There is a large literature on
nonparametric regression, for which scatterplot smoothing is a primary tool.
Recent reference on this subject include Simonoff (1996), Bowman and Azzalini (1997), and Loader (1999).