1 Statistical Inference: Least Squares and Maximum Likelihood
Tải bản đầy đủ
156
5.1.1
5
Multiple Regression with a Single Dependent Variable
The Linear Statistical Model
The dependent variable yt is modeled as a linear function of K independent
variables:
y ¼ X
TÂ1
β þ e
TÂK KÂ1
TÂ1
(5.1)
where T ¼ number of observations (for example T periods), X ¼ matrix of
K independent variables, β ¼ vector of K weights applied to each independent
variable k, y ¼ vector of the dependent variable for t ¼ 1 to T, and e ¼ vector of
residuals corresponding to a unique aspect of y that is not explained by X.
It should be noted that X is given, fixed, observed data. X is, in fact, not only
observable but is also measured without error (the case of measurement error is
discussed in Chap. 10). We assume that X is correctly specified. This means that
X contains the proper variables explaining the dependent variable with the proper
functional form (i.e., some of the variables expressed in X may have been
transformed, for example, by taking their logarithm). Finally, the first column of
X is typically a vector where each element is 1. This means that the first element of
the parameter vector β is a parameter that corresponds to a constant term that
applies equally to each value of the dependent variable yt from t ¼ 1 to T.
5.1.1.1
Error Structure
Some assumptions are needed in order to make some statistical inferences. Not all
the assumptions below are necessarily used. In fact, in Sect. 5.1.4.3, we identify
which assumptions are necessary in order to be able to obtain the specific properties
of the estimators. Because y and X are given data points and β is the parameter
vector on which we want to make inferences, the assumptions can only be on the
unobserved factor e.
Assumption 1: Expected Value of Error Term
E ½e ¼ 0
(5.2)
Assumption 2: Covariance Matrix of Error Term
Homoscedasticity
Usually, each observation has an error term et independently and identically
distributed with the same variance:
h 0i
(5.3)
et $ iid ) E ee ¼ σ 2 IT
where I ¼ identity matrix.
This means that the variances for each observation t are the same and that they
are uncorrelated. The unknown parameters that need to be estimated are β and σ 2.
5.1 Statistical Inference: Least Squares and Maximum Likelihood
157
Heteroscedasticity
More generally
h 0i
E ee ¼ σ 2 Ψ ¼ Φ
(5.4)
Note that Φ, a covariance matrix, is a symmetric matrix. Heteroscedasticity
occurs, therefore, when Ψ 6¼ I. This occurs if either the diagonal elements of the
matrix Ψ are not identical (each error term et has a different variance) or if its
off-diagonal elements are different from zero.
Assumption 3: Normality of Distribution
The probability density function of the error vector can be written formally as per
Eq. (5.5) for the case of homoscedasticity or Eq. (5.6) for the case of
heteroscedasticity:
À
Á
e $ N 0, σ 2 I
(5.5)
e $ N ð0; ΦÞ
(5.6)
or
5.1.2
Point Estimation
Point estimates are inferences that can be made without the normality assumption
of the distribution of the error term e. The problem can be defined as follows: to find
a suitable function of the observed random variables y, given x, that will yield the
“best” estimate of unknown parameters.
We will restrict β to the class that are linear functions of y:
^ ¼ A
β
KÂ1
y
KÂT TÂ1
(5.7)
The elements {akt} of the matrix A are scalars that weight each observation; A is
a summarizing operator.
In order to solve the problem defined above, we need to (1) select a criterion,
^ , and (3) evaluate the sampling
(2) determine the A matrix and consequently β
performance of the estimator. These three issues are discussed in the following
sections.
5.1.2.1
Ordinary Least Squares Estimator
We now consider the case of homoscedasticity where
Ψ ¼ IT
(5.8)
158
5
Multiple Regression with a Single Dependent Variable
The criterion we use to estimate the “best” parameter is to minimize the sum of
squares residuals:
Min l1 ¼ e
0
0
e ¼ ðy À XβÞ ðy À XβÞ
(5.9)
1ÂT TÂ1
0
0
0
0
¼ y y À 2y Xβ þ β X Xβ,
(5.10)
noting that y0 Xβ ¼ β0 X0 y is a scalar.
We resolve the problem of finding the parameters that minimize this least
squares quantity (l1 in Eq. (5.9)) by taking the derivative relative to the parameter
vector β, setting it to zero, and solving that equation:
∂l1
0
0
¼ 2X Xβ À 2X y ¼ 0
∂β
(5.11)
Note that the derivative in Eq. (5.11) is obtained by using the following matrix
derivative rules also found in Appendix A (Chap. 14):
0
∂a v
¼ a, and
∂v
0
∂v Av
0
¼ AþA v
∂v
and especially
0
∂2y Xβ
0
¼ 2X y
∂β
(5.12)
Therefore, applying these rules to Eq. (5.10), we obtain
À1 0
^ ¼ b ¼ X0 X
β
Xy
(5.13)
This assumes that X’X can be inverted. If collinearity in the data exists, i.e., if a
variable xk is a linear combination of a subset of the other x variables, the inverse
does not exist (the determinant is zero). In a less strict case, multicollinearity can
occur if the determinant of X’X approaches zero. The matrix may still be invertible
and an estimate of β will exist. We will briefly discuss the problem in the subsection
Computation of Covariance Matrix of Sect. 5.1.4.2.
b is a linear function of y:
b ¼ Ay
(5.14)
0 À1 0
A¼ XX
X
(5.15)
where
5.1 Statistical Inference: Least Squares and Maximum Likelihood
5.1.2.2
159
Generalized Least Squares or Aitken Estimator
In the general case of heteroscedasticity, the covariance matrix of the error term
vector is positive definite symmetric:
Ψ 6¼ IT
(5.16)
The criterion is the quadratic form of the error terms weighted by the inverse of
the covariance matrix. The rationale for that criterion is best understood in the case
where Ψ is diagonal. In such a case, it can be easily seen that the observations with
the largest variances are given smaller weights than the others.
The objective is then
0
0
Min l2 ¼ e ΨÀ1 e ¼ ðy À XβÞ ΨÀ1 ðy À XβÞ
0
0
0
¼ y ΨÀ1 À β X ΨÀ1 ðy À XβÞ
0
0
0
0
0
(5.17)
(5.18)
0
À1
¼ y ΨÀ1 y þ β X ΨÀ1 Xβ À β X ΨÀ1
TÂT y À y Ψ Xβ
1Âk kÂT
0
0
0
TÂ1
0
¼ y ΨÀ1 y þ β X ΨÀ1 Xβ À 2y ΨÀ1 Xβ
(5.19)
1Â1
(5.20)
Minimizing of the quadratic expression in Eq. (5.20) is performed by solving the
equation
0
∂l2
0
¼ 2 X ΨÀ1 X β À 2X ΨÀ1 y ¼ 0
∂β
À1 0
^ ¼β
^ GLS ¼ X0 ΨÀ1 X
X ΨÀ1 y
)β
(5.21)
(5.22)
^ is still a linear function of y such as in Eq. (5.14), but with the
Consequently, β
linear weights given by
0
À1 0
A ¼ X ΨÀ1 X
X ΨÀ1
5.1.3
(5.23)
Maximum Likelihood Estimation
So far, the estimators that we have derived are point estimates. They do not allow
the researcher to perform statistical tests of significance on the parameter vector β.
In this section, we derive the maximum likelihood estimators, which leads to the
presentation of the distributional properties of these estimators. The problem is to
find the value of the parameter β that will maximize the probability of obtaining the
observed sample.
160
5
Multiple Regression with a Single Dependent Variable
The assumption that is needed to derive the maximum likelihood estimator is the
normal distribution of the error term
À
Á
e $ N 0, σ 2 IT
(5.24)
It is then possible to write the likelihood function, which for the homoscedastic
case is
&
'
À
Á
0
1
2
2 ÀT=2
exp À 2 ðy À XβÞ ðy À XβÞ
(5.25)
l1 β, σ jyÞ ¼ 2πσ
2σ
or for the case of heteroscedasticity
&
'
À
ÁÀT=2 ÀT=2
0
1
l2 β, σ 2 jyÞ ¼ 2πσ 2
exp À 2 ðy À XβÞ ΨÀ1 ðy À XβÞ
j Ψj
2σ
(5.26)
We can then maximize the likelihood or, equivalently, its logarithm:
Á
0
T À
1
Max l1 , Max Ln l1 , Max À Ln 2πσ 2 À 2 ðy À XβÞ ðy À XβÞ
2
2σ
!
(5.27)
which is equivalent to minimizing the negative of that expression, i.e.,
Min
!
Á
0
T À
1
Ln 2πσ 2 þ 2 ðy À XβÞ ðy À XβÞ
2
2σ
(5.28)
This can be done by solving the derivative of Eq. (5.28) relative to β:
0 À1 0
∂½ÀLnðl1 Þ
¼ 0 ) β1e ¼ X X
Xy
∂β
(5.29)
which is simply the least squares estimator.
Similar computations lead to the maximum likelihood estimator in the case of
heteroscedasticity, which is identical to the generalized least squares (GLS)
estimator:
0
À1 0
e
β 2 ¼ X ΨÀ1 X
X ΨÀ1 y
(5.30)
We can now compute the maximum likelihood estimator of the variance by
finding the value of σ that maximizes the likelihood or minimizes the expression in
Eq. (5.28):
0
T
1
Min Ln2π þ TLnσ þ σ À2 ðy À XβÞ ðy À XβÞ
σ
2
2
!
(5.31)
5.1 Statistical Inference: Least Squares and Maximum Likelihood
161
This is solved by setting the derivative relative to σ to zero:
Á
0
∂½ÀLnðl1 Þ T 1 À
¼ þ À2σ À3 ðy À XβÞ ðy À XβÞ ¼ 0
∂σ
σ 2
(5.32)
This results in
0
0
T 1
1
T
À 3 ðy À XβÞ ðy À XβÞ ¼ 0 ) 3 ðy À XβÞ ðy À XβÞ ¼
σ σ
σ
σ
(5.33)
which leads to the maximum likelihood estimator:
e
σ2 ¼
0
1 0
1
y À Xe
β 1 y À Xe
β 1 ¼ ^e ^e
T
T
(5.34)
where eˆ is the vector of residuals obtained when using the maximum likelihood
estimator of β to predict y.
The same computational approach can be applied for the heteroscedastic case.
5.1.4
Properties of Estimator
We have obtained estimators for the parameters β and σ. The next question is to
determine how good these estimators are. Two criteria are important for evaluating
these parameters. Unbiasedness refers to the fact that on average the parameters are
correct, i.e., on average, we obtain the true parameter. The second criterion
concerns the fact that the estimator should have the smallest possible variance.
5.1.4.1
Unbiasedness
Definition: An estimator is unbiased if its expected value is equal to the true
parameter, i.e.,
h i
^ ¼β
E β
(5.35)
e , are linear
^ , and, a fortiori, the maximum likelihood estimators βe and β
b and β
2
1
functions of random vector y. Consequently, they are also random vectors with the
following mean:
E½b ¼ E
!
0 À1 0 !
0 À1 0
XX
Xy ¼E XX
X ðXβ þ eÞ
(5.36)
162
5
Multiple Regression with a Single Dependent Variable
2
3
0 À1 0 7
6 0 À1 0
¼ E6
XXβþ XX
X e7
4 XX
5
|ﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄ}
(5.37)
I
0 À1 0
¼βþ XX
X E½e ¼ β
|{z}
(5.38)
¼0
This proves the least squares estimator is unbiased. Similarly for the GLS
estimator
!
!
h i
À1 0
0
À1 0
^ ¼ E X0 ΨÀ1 X
E β
X ΨÀ1 y ¼ β þ E X ΨÀ1 X
X ΨÀ1 e ¼ β (5.39)
This means that on average the GLS estimator is the true parameter and is thus
unbiased.
5.1.4.2
Best Linear Estimator
How do the linear rules above compare with other linear unbiased rules in terms of
the precision, i.e., in terms of the covariance matrix? We want an estimator that has
the smallest variance possible. This means that we need to compute the covariance
matrix of the estimator, and then we need to show that it has minimum variance.
Computation of Covariance Matrix
The covariance of the least squares estimator b is
h
i
0
Σb ¼ E ðb À E½bÞðb À E½bÞ
KÂK
h
i
0
¼ E ðb À βÞðb À βÞ
À 0 ÁÀ1 0
À 0 ÁÀ1 0
0 !
XX XyÀβ
¼E
XX XyÀβ
À 0 ÁÀ1 0
0 !
ÁÀ1 0
0
X X X ðXβ þ eÞ À β
X X X ðXβ þ eÞ À β
hÀ 0 ÁÀ1 0 0 À 0 ÁÀ1 i
¼ E X X X ee X X X
À 0 ÁÀ1 0 Â 0 Ã À 0 ÁÀ1
¼ X X X E ee X X X
À 0 ÁÀ1 0
À 0 ÁÀ1
¼ X X X ðσ 2 IÞX X X
À 0 ÁÀ1 0 À 0 ÁÀ1
¼ σ2 X X X X X X
À 0 ÁÀ1
¼ σ2 X X
¼E
À
(5.40)
5.1 Statistical Inference: Least Squares and Maximum Likelihood
163
Therefore,
0 À1
Σb ¼ σ 2 X X
(5.41)
KÂK
In the case of multicollinearity, (X0 X)À 1 is very large (because the determinant
is close to zero). This means that the variance of the estimator will be very large.
Consequently, multicollinearity results in parameter estimates that are unstable.
Following similar calculations, the variance–covariance matrix of the GLS
^ is
estimator β
^ ¼E
Σβ
0 !
À1 0
0
À1
0
^ À β ¼ E X0 ΨÀ1 X
β
X ΨÀ1 ee ΨÀ1 X X ΨÀ1 X
^ Àβ
β
(5.42)
À1
^ ¼ σ 2 X0 ΨÀ1 X
Σβ
(5.43)
Best Linear Unbiased Estimator
Out of the class of linear unbiased rules, the ordinary least squares (OLS) (or the
GLS depending on the error term covariance structure) estimator is the best, i.e.,
provides minimum variance. We will provide the proof with the OLS estimator
when Ψ ¼ IT; however, the proof is similar for the GLS estimator when Ψ 6¼ IT.
The problem is equivalent to minimizing the variance of a linear combination of
the K parameters for any linear combination.
Let φ be a vector of constants. The scalar θ is the linear combination of the
KÂ1
regression parameters β:
θ ¼ φ
1Â1
0
β
1ÂK KÂ1
The least squares estimator of θ is
0 À1 0
0
0
^
Xy
θ LS ¼ φ b ¼ φ X X
(5.44)
The problem is therefore to determine if there exists another unbiased linear
estimator that is better than the least squares estimator.
An alternative linear estimator would be written in a general way as
0
^
θ ¼ A
1Â1
y þ a
1ÂT TÂ1
^
θ should be unbiased. This means that
1Â1
(5.45)
164
5
Multiple Regression with a Single Dependent Variable
h i
0
E ^
θ ¼φβ
8β :
(5.46)
By substitution of the expression of the estimator ^θ ,
h 0
i
Â Ã
0
E θ^ ¼ E A y þ a ¼ A E½y þ a
(5.47)
0
¼ A Xβ þ a
(5.48)
For ^
θ to be unbiased, Eq. (5.46) must be verified, i.e.,
0
0
φ β ¼ A Xβ þ a
(5.49)
a¼0
(5.50)
This can only be true if
and
0
0
φ ¼AX
(5.51)
What is the value of A that will minimize the variance of the estimator? The
variance is
Â Ã
0
V ^
θ ¼ A V½yA
(5.52)
However,
h
V ½y ¼ V½Xβ þ e
TÂ1
¼ E ððXβ þ eÞ À EðXβ þ eÞÞððXβ þ eÞ À EðXβ þ eÞÞ
Â 0Ã
¼ E ee ¼ σ 2 I
0
i
(5.53)
Therefore,
Â Ã
0
V ^
θ ¼ σ2 A A
(5.54)
Â Ã
The problem now is to minimize V ^
θ subject to the unbiasedness restrictions
stated in Eqs. (5.50) and (5.51), i.e.,
0
Min σ 2 A A
0
0
s:t:φ ¼ A X
5.1 Statistical Inference: Least Squares and Maximum Likelihood
165
This is a Lagrangian multiplier problem.
The Lagrangian is
L ¼ σ2 A
0
0
A þ2 λ
1ÂT TÂ1
φ À X
1ÂK
KÂ1
0
A
KÂT TÂ1
∂L
0
0
0
¼ 2σ 2 A À 2λ X ¼ 0
∂A
(5.55)
(5.56)
Therefore,
0
0
0
0
0
σ2A À λ X ¼ 0
0
σ2 A X À λ X
0
À X0 ¼ÁÀ1
0
0
λ ¼ σ2 A X X X
0
λ ¼
0 À1
0
σ2 φ X X
(5.57)
In addition,
∂L
0
0
¼φ ÀAX¼0
∂λ
(5.58)
Considering again the derivative relative to A given in Eq. (5.56), i.e.,
∂L
0
0
0
¼ 2σ 2 A À 2λ X
∂A
replacing λ by the expression obtained in Eq. (5.57), we obtain
0 À1 0
∂L
0
0
¼ 2σ 2 A À 2σ 2 φ X X
X ¼0
∂A
(5.59)
and, therefore,
0 À1 0
0
0
A ¼φ XX
X
(5.60)
However,
0
θ¼Ay
Thus, the minimum variance linear unbiased estimator of φ0 β is obtained by
replacing A0 with the expression in Eq. (5.60):
0 À1 0
0
^
θ ¼φ XX
Xy
(5.61)
166
5
Table 5.1 Properties
of estimators
Multiple Regression with a Single Dependent Variable
Property
E[b|X] ¼ β
V[b|X,s2] ¼ σ 2(X0 X)À 1
b is BLUE
b is the MLE
b $ N(β, σ 2(X0 X)À 1)
Assumption(s) needed
No.1
No.1, 2
No.1, 2
No.3
No.3
which is the one obtained from the OLS estimator:
0
^
θ ¼φb
(5.62)
We have just shown that the OLS estimator has minimum variance.
5.1.4.3
Summary of Properties
Not all three assumptions discussed in Sect. 5.1.1 are needed for all the properties of
the estimator. Unbiasedness only requires assumption no.1. The computation of the
variance and the best linear unbiased estimator (BLUE) property of the estimator
only involve assumptions no.1 and no.2, and do not require the normal distributional assumption of the error term. Statistical tests about the significance of the
parameters can only be performed with assumption no.3 about the normal distribution of the error term. These properties are shown in Table 5.1.
5.1.5
R-Squared as a Measure of Fit
We first present the R-squared measure and its interpretation as a percentage of
explained variance in the presence of homoscedasticity. We then discuss the issues
that appear when the error term is heteroscedastic.
5.1.5.1
Normal Case of Homoscedasticity
y¼^
y þ ^e
(5.63)
Let y be the T Â 1 vector containing T times the mean of y. Subtracting y from each
side of Eq. (5.63):
yÀy ¼^
y À y þ ^e
(5.64)