Chapter 52. Random Vectors and Linear Statistical Models
Tải bản đầy đủ - 0trang
52-2
52.2
Handbook of Linear Algebra
Introduction to Statistics and Random Variables
Notation:
In this chapter most uppercase, light-face italic letters, in particular X, Y , and Z, denote scalar random
variables, but the notation P (A) is reserved for the probability of the set A. Throughout this chapter, the
uppercase, light-face roman letter E denotes expectation.
Definitions:
The focus in statistics is on making inferences concerning a large body of data called a population based
on a subset collected from it called a sample. An experiment associated with this sample is repeatable and
its outcome is not predetermined. A simple event is one associated with an experiment, which cannot
be decomposed, and a simple event corresponds to one and only one sample point. The sample space
associated with an experiment is the set of all possible sample points.
Suppose that S is a sample space associated with an experiment. To every subset A of S we assign a real
number P (A), called the probability of A, so that the following axioms hold: (1) P (A) ≥ 0, (2) P (S) = 1,
(3) If A1 , A2 , A3 , . . . form a sequence of pairwise mutually exclusive subsets in S, i.e., Ai ∩ A j = ∅ if i = j ,
∞
then P (A1 ∪ A2 ∪ A3 ∪ · · ·) = i =1 P (Ai ).
A random variable is a real-valued function for which the domain is a sample space.
A random variable which can assume a finite or countably infinite number of values is discrete.
The probability P (Y = y) that the random variable Y takes on the value y is defined as the sum of the
probabilities of all sample points in S that have the value y, and the probability function of Y is the set
of all the probabilities P (Y = y).
If the random variable Y has the probability function P (Y = 0) = p and P (Y = 1) = 1 − p for some
real number p in the interval [0, 1], then Y is a Bernoulli random variable.
The cumulative distribution function (cdf) is F (y) = P (Y ≤ y) of the random variable Y for
−∞ < y < ∞. A random variable Y is continuous when its cdf F (y) is continuous for −∞ < y < ∞,
and then its probability density function (pdf) f (y) = d F (y)/d y.
Suppose that the continuous random variable Y has pdf f (y) = 1 for 0 ≤ y ≤ 1 and f (y) = 0
otherwise. Then Y follows a uniform distribution on the interval [0, 1].
The expectation (or expected value or mean or mean value) E(Y ) of the random variable Y is E(Y ) =
+∞
y y P (Y = y) when Y is discrete with probability function P (Y = y) and is E(Y ) = −∞ y f (y)d y
when Y is continuous with pdf f (y).
The variance σ 2 of the random variable Y is
σ 2 = var(Y ) = E (Y − µ)2 ,
where µ = E(Y ) and the standard deviation σ =
√
σ 2.
Facts:
1. The variance σ 2 = var(Y ) = E(Y 2 ) − E2 (Y ).
2. For any random variable Y , the expectation of its square E(Y 2 ) ≥ E2 (Y ) with equality if and only
if the random variable Y = E(Y ) with probability 1.
3. If Y is a Bernoulli random variable with probability function P (Y = 0) = p and P (Y = 1) = 1− p,
then the expectation E(Y ) = 1 − p and the variance var(Y ) = p(1 − p).
4. If the random variable Z follows a uniform distribution on the interval [0, 1], then the expectation
E(Z) = 1/2 and the variance var(Z) = 1/12.
Random Vectors and Linear Statistical Models
52-3
Examples:
1. Every person’s blood type is A, B, AB, or O. In addition, each individual either has the Rhesus (Rh)
factor (+) or does not (–). A medical technician records a person’s blood type and Rh factor. The
sample space for this experiment is {A+, B+, AB+, O+, A−, B−, AB−, O−} with eight sample
points.
2. Consider the experiment of tossing a single fair coin and define the random variable Y = 0 if the
outcome is “heads,” and Y = 1 if the outcome is “tails.” Then Y is a Bernoulli random variable,
and P (Y = 0) = 12 = P (Y = 1), E(Y ) = 12 , var(Y ) = 14 .
3. Suppose that a bus always arrives at a particular stop in the interval between 12 noon and 1 p.m.
and that the probability that the bus will arrive in any given subinterval of time is proportional
only to the length of the subinterval. Let Y denote the length of time that a person arriving at
the stop at 12 noon must wait for the bus to arrive, and let us code 12 noon as 0 and measure
the time in hours. Then the random variable Y follows a uniform distribution on the interval
[0, 1].
52.3
Random Vectors: Basic Definitions and Facts
Linear algebra is extensively used in the study of random vectors, where we consider the simultaneous
behavior of two or more random variables assembled as a vector. In this section all vectors and matrices
are real.
Notation:
In this section uppercase, light-face italic letters, such as X, Y , and Z, denote scalar random variables and
lowercase bold roman letters, such as x, y, and z, denote random vectors. Uppercase, light-face italic letters
such as A and B denote nonrandom matrices.
Definitions:
Let A ∈ Rn×k and B ∈ Rn×q . Then the partitioned matrix [A | B] is the n × (k + q ) matrix formed by
placing A next to B.
A k ×1 random vector y is a vector y = [Y1 , . . . , Yk ]T of k random variables Y1 , . . . , Yk . The expectation
(or expected value or mean vector) of y is the k × 1 vector E(y) = [E(Y1 ), . . . , E(Yk )]T . Sometimes, for
clarity, a vector of constants (belonging to Rn ) is called a nonrandom vector and a matrix of constants a
nonrandom matrix. (Random matrices are not considered in this chapter.)
The covariance cov(Y, Z) between the two random variables Y and Z is cov(Y, Z) = E (Y −µ)(Z−ν) ,
where µ = E(Y ) and ν = E(Z).
The correlation (or correlation coefficient or product-moment
correlation) cor(Y, Z) between the
√
two random variables Y and Z is cor(Y, Z) = cov(Y, Z)/ var(Y )var(Z).
The covariance matrix (or variance-covariance matrix or dispersion matrix) of the k × 1 random
vector y = [Y1 , . . . , Yk ]T is the k × k matrix var(y) = of variances and covariances of all the entries
of y:
var(y) =
= [σi j ] = [cov(Yi , Y j )] = [E(Yi − µi )(Y j − µ j )]
= E (y − µ)(y − µ)T ,
where µ = E(y). The determinant det is the generalized variance of the random vector y. The variances
σii are often denoted as σi2 and, in this, chapter, we will assume that they are all positive. If σi = 0, then
the random variable Yi = E(Yi ) with probability 1, and then we interpret Yi as a constant. In statistics it is
quite common to denote standard deviations as σi . (The reader should note that in all the other chapters
of this book except in the two statistics chapters, σi denotes the i th largest singular value.)
52-4
Handbook of Linear Algebra
The cross-covariance matrix cov(y, z) between the k × 1 random vector y = [Y1 , . . . , Yk ]T and the
q × 1 random vector z = [Z 1 , . . . , Zq ]T is the k × q matrix of all the covariances cov(Yi , Z j ); i = 1, . . . , k
and j = 1, . . . , q :
cov(y, z) = [cov(Yi , Z j )] = [E(Yi − µi )(Z j − ν j )] = E (y − µ)(z − ν)T ,
where µ = [µi ] = E(y) and ν = [ν j ] = E(z). The random vectors y and z are uncorrelated whenever
the cross-covariance matrix cov(y, z) = 0.
The correlation matrix cor(y) = R, say, of the k×1 random vector y = [Y1 , . . . , Yk ]T , is the k×k matrix
√
σ
of correlations of all the entries in y: cor(y) = R = [ρi j ] = [cor(Yi , Y j )] = [ σi iσj j ], where σi = σii =
standard deviation of Yi ; σi , σ j > 0.
Let 1k denote the k × 1 column vector with every entry equal to 1. Then J k = 1k 1kT is the k × k all-ones
matrix (with all k 2 entries equal to 1) and C k = Ik − k1 J k is the k × k centering matrix.
k
Suppose that the real positive numbers p1 , p2 , . . . , pk are such that i =1 pi = 1. Then the k ×1 random
T
vector y = [Y1 , . . . , Yk ] follows a multinomial distribution with parameters n and p1 , p2 , . . . , pk if the
joint probability function of Y1 , Y2 , . . . , Yk is given by
P (Y1 = y1 , Y2 = y2 , . . . , Yk = yk ) =
n!
y y
y
p 1 p 2 · · · pk k ,
y1 !y2 ! · · · yk ! 1 2
k
where for each i , yi = 0, 1, 2, . . . , n and i =1 yi = n. When k = 2 the distribution is binomial, and when
k = 3 the distribution is trinomial.
Let the symmetric matrices A ∈ Rk×k and B ∈ Rk×k . Then A B means A − B is positive semidefinite
and A
B means A − B is positive definite. The partial ordering induced by is called the partial
semidefinite ordering (or Loewner partial ordering or Lăowner partial ordering). (See Section 8.5 for
more information.)
Let the (k + q ) × 1 random vector x have covariance matrix
x=
y
z
,
E(x) =
µ
ν
,
var(x) =
. Consider the following partitioning:
=
yy
yz
zy
zz
,
where y and z have k and q elements, respectively. Then the partial covariance matrix zz·y of the q × 1
random vector z after adjusting for (or controlling for or removing the effect of or allowing for) the
k × 1 random vector y is the (uniquely defined) generalized Schur complement
/
yy
= [σi j ·y ] =
zz
−
zy
−
yy
yz
−
−
yy in ; any generalized inverse yy satisfying yy =
yy yy yy may be chosen.
−1
−
instead of a generalized inverse yy
and refer to
When yy is positive definite, we use the inverse yy
−1
as
the
Schur
complement
of
in
.
/ yy = zz − zy yy
yz
yy
−
(y − µ) is the (uniquely defined) vector of residuals
The q × 1 random vector ez·y = z − ν − zy yy
of the q × 1 random vector z from its regression on the k × 1 random vector y.
The i j th entry of the partial correlation matrix of q × 1 random vector z = [Z 1 , . . . , Zq ]T after
adjusting for the k × 1 random vector y is the partial correlation coefficient between z i and z j after
adjusting for y:
of
ρi j ·y = √
σi j ·y
; i, j = 1, . . . , q ,
σii ·y σ j j ·y
which is well defined provided the diagonal entries of the associated partial covariance matrix are all
positive.
52-5
Random Vectors and Linear Statistical Models
Facts:
1. [WMS02, Th. 5.13, p. 265] Let the k × 1 random vector y = [Y1 , . . . , Yk ]T follow a multinomial
distribution with parameters n and p1 , . . . , pk and let the k × 1 vector p = [ p1 , . . . , pk ]T . Then:
r The random variable Y can be represented as the sum of n independently and identically disi
tributed Bernoulli random variables with parameter pi ; i = 1, . . . , k.
r The expectation E(y) = np and the covariance matrix
var(y) = n diag(p) − ppT =
k,
say, where diag(p) is the k × k diagonal matrix formed from the k × 1 nonrandom vector p.
r The covariance matrix
k − 1.
k
is singular since all its row (and column) totals are 0, and the rank(
k)
=
2. When the k × 1 multinomial probability vector p = 1k /k, then the multinomial covariance matrix
n
k = k C k , where C k is the k × k centering matrix.
3. The k × k covariance matrix var(y) = = E(yyT ) ààT , where à = E(y).
4. The k ì k correlation matrix cor(y) = [diag( )]−1/2 [diag( )]−1/2 , where = var(y).
5. The k × q cross-covariance matrix
cov(y, z) =
yz
=
T
zy
= cov(z, y)
T
= E(yzT ) − µν T ,
where µ = E(y) and ν = E(z).
6. [RM71, Lemma 2.2.4, p. 21]: The product AB − C (for A = 0, C = 0) is invariant with respect to
the choice of B − ⇐⇒ range(C ) ⊆ range(B) and range(AT ) ⊆ range(B T ).
7. Consider the (k + q ) × (k + q ) covariance matrix
=
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
yy
yz
zy
zz
.
Then the range (or column space) range( yz ) ⊆ range( yy ) and range( zy ) ⊆ range( zz ), and,
−
−
hence, the matrix zy yy
yz and the generalized Schur complement / yy =
zz − zy yy yz
−
are invariant (unique) with respect to the choice of generalized inverse yy .
Let the k × 1 random vector x = [X 1 , . . . , X k ]T . Then the k × 1 centered vector C k x = [X 1 −
¯ T , where C k is the k × k centering matrix and the arithmetic mean (or average)
¯ . . . , X k − X]
X,
k
X¯ = i =1 X i /k.
A nonsingular positive semidefinite matrix is positive definite.
A covariance matrix is always symmetric and positive semidefinite.
A cross-covariance matrix is usually rectangular.
A correlation matrix is always symmetric and positive semidefinite.
The diagonal entries of a correlation matrix are all equal to 1 and the off-diagonal entries are all at
most equal to 1 in absolute value.
[PS05, p. 168] The (generalized) Schur complement of the leading principal submatrix of a positive
semidefinite matrix is positive semidefinite.
Let y be a k × 1 random vector with covariance matrix . Then the variance var(aT y) = aT a
for all nonrandom a ∈ Rk . (Since variance must be nonnegative this fact shows that a covariance
matrix must be positive semidefinite.)
Let y be a k × 1 random vector with expectation µ = E(y), covariance matrix = var(y), and
let the matrix A ∈ Rn×k and the nonrandom vector b ∈ Rn . Then the expectation E( Ay + b) =
AE(y) + b = Aµ + b and the covariance matrix var(Ay + b) = Avar(y)AT = A AT .
Let y be a k × 1 random vector with expectation µ = E(y), covariance matrix = var(y), and let
the matrix A ∈ Rk×k , not necessarily symmetric. Then E(yT Ay) = µT Aµ + tr(A ).
52-6
Handbook of Linear Algebra
18. [Rao73a, p. 522] Let y be a k × 1 random vector with expectation µ = E(y) and covariance matrix
= var(y), and let [à | ] denote the k ì (k + 1) partitioned matrix with µ as its first column.
Then y − µ ∈ range( ) and y ∈ range([µ | ]), both with probability 1.
19. Let the (k + q ) × 1 random vector x have covariance matrix . Consider the following partitioning:
x=
y
E(x) =
,
z
µ
=
,
ν
yy
yz
zy
zz
,
where y and z have k and q components, respectively. Then
r The variance var(aT y + bT z) = aT
yy a
for all nonrandom b ∈ R .
q
+ 2aT
yz b
+ bT
zz b
for all nonrandom a ∈ Rk and
r The covariance matrix var(Ay + Bz) = A
T
T
yy A + A yz B + B
A∈R
n×k
and all B ∈ R
n×q
.
zy A
T
+B
zz B
T
for all
r [PS05b, pp. 187–188] For any A ∈ Rq ×k the covariance matrix
var(z − Ay)
var(z −
−
yy y)
zy
with respect to the partial semidefinite ordering, and the partial covariance matrix
var(z −
zy
−
yy y)
=
the generalized Schur complement of
yy
zz
−
in
.
zy
−
yy
yz
=
zz·y
=
/
yy
,
r Let q = k. Then the covariance matrix var(y + z) = var(y) + var(z) if and only if cov(y, z) =
−cov(z, y), i.e., the cross-covariance matrix cov(y, z) is skew-symmetric; the condition that
cov(y, z) = 0 is sufficient, but not necessary (unless k = 1).
r The vector
−
yy ,
−
yy y
is not necessarily invariant with respect to the choice of generalized inverse
−
−
but its covariance matrix var( zy yy
y) = zy yy
yz is invariant (and, hence, unique).
zy
Examples:
1. Let the 4 × 1 random vector x have covariance matrix
⎡
x=
y
z
,
1
. Consider the following partitioning:
0
⎢0 1
⎢
⎣a c
b d
=⎢
a
c
b
⎤
d⎥
⎥
⎥,
1
0⎦
0
1
where y and z each have 2 components. Then var(y + z) = var(y) + var(z) if and only if a = d = 0
and c = −b, with b 2 ≤ 1.
2. Let the k × 1 random vector y = [Y1 , . . . , Yk ]T follow a multinomial distribution with parameters
n and p = [ p1 , . . . , pk ]T , with p1 + · · · + pk = 1 and p1 > 0, . . . , pk > 0, and let the k × k matrix
k
A = [ai j ]. Then the expectation E(yT Ay) = n(n − 1)pT Ap + n i =1 aii pi .
T
3. Let the 3 × 1 random vector y = [Y1 , Y2 , Y3 ] follow a trinomial distribution with parameters n
and p1 , p2 , p3 , with p1 + p2 + p3 = 1 and p1 > 0, p2 > 0, p3 > 0, and let the 3 × 1 vector
p = [ p1 , p2 , p3 ]T . Then:
r The expectation E(y) = n[ p , p , p ]T and the covariance matrix
1
2
3
⎡
3
⎢
p1 (1 − p1 )
= var(y) = n ⎣ − p1 p2
− p1 p3
− p1 p2
p2 (1 − p2 )
− p2 p3
− p1 p3
⎤
⎥
− p2 p3 ⎦ ,
p3 (1 − p3 )
52-7
Random Vectors and Linear Statistical Models
which has rank equal to 2 since
2
3 equals n p1 p2 p3 > 0.
3
is singular and the determinant of the top left-hand corner of
r The partial covariance matrix of Y and Y adjusting for Y is the Schur complement
1
2
3
3/
p3 (1 −
p3 ) = nS, say, where
S=
−
p1 (1 − p1 )
− p1 p2
− p1 p2
p2 (1 − p2 )
− p1 p3
− p2 p3
1
− p1 p3
p3 (1 − p3 )
p1 p2
1 −1
,
1
p1 + p2 −1
− p2 p3 =
which has rank equal to 1, and so rank(
3)
= 2.
r When p = p = p = 1/3, then the covariance matrix
1
2
3
3 = (n/3)C 3 and the partial covariance
matrix of Y1 and Y2 adjusting for Y3 is (n/3)C 2 ; here, C h is the h × h centering matrix, h = 2, 3.
4. [PS05, p. 183] If the 3 × 3 symmetric matrix
⎡
1
r 12
R3 = ⎣r 12
1
r 13
r 23
⎢
r 13
⎤
⎥
r 23 ⎦ =
1
R2
r2
r 2T
1
is a correlation matrix, then r i2j ≤ 1 for all 1 ≤ i < j ≤ 3. But not all symmetric matrices with
diagonal elements all equal to 1 and all off-diagonal elements r i j such that r i2j ≤ 1 are correlation
2
2
matrices. For example, consider R3 with r 13
≤ 1 and r 23
≤ 1. Then R3 is a correlation matrix if
and only if
r 13 r 23 −
2
2
(1 − r 13
)(1 − r 23
) ≤ r 12 ≤ r 13 r 23 +
2
2
(1 − r 13
)(1 − r 23
).
When r 13 = 0 and r 12 = r 23 = r, say, then this condition becomes r 2 ≤ 1/2 and so the matrix
⎡
1
0.8
0
⎤
⎢
⎥
⎣0.8 1 0.8⎦
0 0.8 1
is not a correlation matrix.
2
≤ 1, then the matrix R3 is a correlation matrix if and only if any one of the following
When r 12
conditions holds:
r det(R ) = 1 − r 2 − r 2 − r 2 + 2r r r ≥ 0.
3
12 13 23
12
13
23
r (i) r ∈ range(R ) and (ii) 1 ≥ rT R − r for some and, hence, for every generalized inverse
2
2
2 2 2
R2− .
5. Let the random vector x be 2 ì 1 and write
x=
Y
Z
,
E(x) =
à
,
var(x) =
with σ y2 > 0. Then the residual vector ez·y = z − ν −
its regression on y becomes the scalar residual
e Z·Y = Z − ν −
zy
=
−
yy (y − µ)
σ yz
(Y − µ)
σ y2
σ y2
σ yz
σ yz
σz2
,
of the random vector z from
52-8
Handbook of Linear Algebra
of the random variable Z from its regression on Y . The matrix of partial covariances of the random
vector z after adjusting for y becomes the single partial variance
2
= σz2 −
σz·y
2
σ yz
σ y2
2
= σz2 (1 − ρ yz
)
of the random variable Z after adjusting for the random variable Y ; here, the correlation coefficient
ρ yz = σ yz /(σ y σz ).
52.4
Linear Statistical Models: Basic Definitions and Facts
Notation:
In this section, the uppercase, light-face italic letter X is reserved for the nonrandom n × p model matrix
and V is reserved for an n × n covariance matrix. The uppercase, light-face italic letter H is reserved for the
(symmetric idempotent) n × n hat matrix X(X T X)− X T and M = I − H is reserved for the (symmetric
idempotent) n × n residual matrix. The lowercase, bold-face roman letter y is reserved for an observable
n × 1 random vector and x is reserved for a column of the n × p model matrix X.
Definitions:
The general linear model (or Gauss–Markov model or Gauß–Markov model) is the model
M = {y, Xβ, σ 2 V }
defined by the equation y = Xβ + ε, where E(y) = Xβ, E(ε) = 0, var(y) = var(ε) = σ 2 V. The vector
y is an n × 1 observable random vector, ε is an n × 1 unobservable random error vector, X is a known
n × p model matrix (or design matrix, particularly when its entries are −1, 0, or +1), β is a p × 1 vector
of unknown parameters, V is a known n × n positive semidefinite matrix, and σ 2 is an unknown positive
constant. The realization of the n × 1 observable random vector y will also be denoted by y.
The classical theory of linear statistical models covers the full-rank model, where X has full column
rank and V is positive definite. In the full-rank model, the ordinary least squares estimator
OLSE(β) = βˆ = (X T X)−1 X T y = X † y
and the generalized least squares estimator (or Aitken estimator)
GLSE(β) = β˜ = (X T V −1 X)−1 X T V −1 y,
where X † denotes the Moore–Penrose inverse of X.
When either X or V is (or both X and V are) rank deficient, then it is usually assumed that rank(X) <
rank(V ). The model M = {y, Xβ, σ 2 V } is called a weakly singular model (or Zyskind–Martin model)
whenever range(X) ⊆ range(V ), and then rank(X) < rank(V ), and is consistent if the realization y
satisfies y ∈ range([X | V ]).
Let βˆ be any vector minimizing y − Xβ 2 = (y − Xβ)T (y − Xβ). Then yˆ = X βˆ = OLSE(Xβ) =
the ordinary least squares estimator (OLSE) of Xβ. When rank(X) < p, then βˆ is an ordinary least
squares solution to minβ (y − Xβ)T (y − Xβ). Moreover, βˆ is any solution to the normal equations
X T X βˆ = X T y. The vector of OLS residuals is e = y − yˆ = y − X βˆ and the residual sum of squares
S S E = eT e = (y − yˆ )T (y − yˆ ).
The coefficient of determination (or coefficient of multiple determination or squared multiple
correlation) R 2 = 1 − (S S E /yT C n y) identifies the proportion of variance explained in a multiple linear
regression where the model matrix X = [1n | x[1] | · · · | x[ p−1] ] with p − 1 regressor vectors (or regressors) x[1] , . . . , x[ p−1] each n × 1. In simple linear regression p = 2 and the model matrix X = [1n | x]
52-9
Random Vectors and Linear Statistical Models
with the single regressor vector x. The sample correlation coefficient r = xT C n y/ xT C n x · yT C n y,
where it is usually assumed that x is an n × 1 nonrandom vector (such as a regressor vector) and y is a
realization of the n × 1 random vector y.
Let the matrix A ∈ Rk×n and let the matrix K ∈ Rk× p . Then the linear estimator Ay is a linear
unbiased estimator (LUE) of K β if E(Ay) = K β for all β ∈ R p . Let the matrix B ∈ Rk×n . Then the
LUE By of K β is the best linear unbiased estimator (BLUE) of K β if it has the smallest covariance
matrix (in the positive semidefinite ordering) in that var( Ay) var(By) for all LUEs Ay of K β.
The hat matrix H = X(X T X)− X T associated with the model matrix X is so named since yˆ = Hy.
The residual matrix M = I − H and vector of OLS residuals is e = y − yˆ = y − Hy = My. Let the
nonrandom vector a ∈ Rn . Then the linear estimator aT y, which is unbiased for 0, i.e., E(aT y) = 0, is a
linear zero function.
The Watson efficiency φ under the full-rank model M = {y, Xβ, σ 2 V }, with the n × p model matrix
X having full column rank equal to p < n and with the n × n covariance matrix V positive definite,
measures the relative efficiency of the OLSE(β) = βˆ vs. the BLUE(β) = β˜ and is defined by the ratio of
the corresponding generalized variances:
φ=
˜
det2 (X T X)
det[var(β)]
=
.
ˆ
det(X T V X) · det(X T V −1 X)
det[var(β)]
The Bloomfield–Watson efficiency ψ under the general linear model M = {y, Xβ, σ 2 V } with no
rank assumptions measures the relative efficiency of the OLSE(Xβ) = X βˆ vs. the BLUE(β) = β˜ and is
defined by: ψ = 12 H V − V H 2 = H V M 2 , where the norm A = tr1/2 (AT A) is defined for any
k × q matrix A.
The n ×n covariance matrix (1−ρ)In +ρ1n 1nT = (1−ρ)In +ρ J n has intraclass correlation structure
(or equicorrelation structure) and is the intraclass correlation matrix (or the equicorrelation matrix).
The parameter ρ is the intraclass correlation (or intraclass correlation coefficient).
Facts:
The following facts, except for those with a specific reference, can be found in [Gro04], [PS89], or [SJ03,
§4.1–4.3]. Throughout this set of facts, X denotes the n × p nonrandom model matrix.
1. The hat matrix H = X(X T X)− X T associated with the model matrix X is invariant (unique)
with respect to choice of generalized inverse (X T X)− and is a symmetric idempotent matrix:
H = H T = H 2 , and rank(H) = tr(H) = rank(X). Moreover, the hat matrix H is the orthogonal
projector onto range(X).
2. If the p × p matrix Q is nonsingular, then the hat matrix associated with the model matrix X Q
equals the hat matrix associated with the model matrix X.
ˆ where M is the
3. The residual sum of squares S S E = yT My = (y − yˆ )T (y − yˆ ) = yT y − yT X β,
ˆ
residual matrix and β = OLSE(β).
4. In simple linear regression the coefficient of determination R 2 = r 2 , the square of the sample correlation coefficient. In multiple linear regression with model matrix X = [1n | X 0 ] =
[1n | x[1] | · · · | x[ p−1] ] and ( p − 1) × 1 nonrandom vector a ∈ R p ,
R 2 = max r a2 = max
a
a
(aT X 0T C n y)2
,
aT X 0T C n X 0 a · yT C n y
the square of the sample correlation coefficient r a between the variables whose observed values are
in vectors y and X 0 a.
ˆ but
5. The vector X βˆ is invariant (unique) with respect to the choice of least squares solution β,
ˆ
ˆ
β is unique if and only if X has full column rank equal to p ≤ n, and then β = OLSE(β) =
(X T X)−1 X T y = X † y, where X † is the Moore–Penrose inverse of X. The covariance matrix
ˆ = σ 2 (X T X)−1 X T V X(X T X)−1 .
var(β)
52-10
Handbook of Linear Algebra
6. The Watson efficiency φ is always positive, and φ ≤ 1 with equality if and only if OLSE(β) =
BLUE(β).
7. [DLL02, p. 477], [Gus97, p. 67] Bloomfield–Watson–Knott Inequality. The Watson efficiency
φ=
det2 (X T X)
≥
T
det(X V X) · det(X T V −1 X)
m
i =1
4λi λn−i +1
,
(λi + λn−i +1 )2
for all n× p model matrices X with full column rank p. Here m = min( p, n− p) and λ1 ≥ · · · ≥ λn
denote the necessarily positive eigenvalues of the n × n positive definite covariance matrix V . The
ratios 4λi λn−i +1 /(λi + λn−i +1 )2 in the lower bound for the Watson efficiency are the squared
antieigenvalues of the covariance matrix V .
8. [DLL02, p. 454] Let p = 1 and set the n × 1 model matrix X = x. Then the Bloomfield–Watson–
Knott Inequality is the Kantorovich Inequality (or Frucht–Kantorovich Inequality):
4λ1 λn
(xT x)2
≥
,
xT V x · xT V −1 x
(λ1 + λn )2
where λ1 and λn are, respectively, the largest and smallest eigenvalues of the n × n positive definite
covariance matrix V .
9. The Bloomfield–Watson efficiency
1
H V − V H 2 = H V M 2 = tr(H V MV H) = tr(H V MV )
2
= tr(H V 2 − H V H V ) = tr(H V 2 ) − tr (H V )2 ≥ 0,
ψ=
with equality if and only if OLSE(β) = BLUE(β) if and only if the Watson efficiency φ = 1.
10. [DLL02, p. 473] The Bloomfield–Watson Trace Inequality. Let A be a nonrandom symmetric n × n
matrix, not necessarily positive semidefinite. Then for all the nonrandom matrices U ∈ Rn× p that
satisfy U T U = I p :
tr(U T A2 U ) − tr (U T AU )2 ≤
1
4
min( p,n− p)
(αi − αn−i +1 )2 ,
i =1
where α1 ≥ · · · ≥ αn denote the eigenvalues of the n × n matrix A.
11. The Bloomfield–Watson efficiency
ψ = tr(H V 2 ) − tr (H V )2 ≤
1
4
min( p,n− p)
(λi − λn−i +1 )2 ,
i =1
for all n × n hat matrices H with rank p (and so for all n × p model matrices X with full column
rank p). Here, λ1 ≥ · · · ≥ λn denote the necessarily positive eigenvalues of the n × n positive
definite covariance matrix V .
12. The n × n intraclass correlation matrix Ric = (1 − ρ)In − ρ1n 1nT has eigenvalues 1 − ρ with
multiplicity n − 1 and 1 + ρ(n − 1) with multiplicity 1, and so Ric is singular if and only if
ρ = −1/(n − 1) or ρ = 1.
13. The intraclass correlation coefficient ρ is such that −1/(n − 1) ≤ ρ ≤ 1 and the n × n intraclass
correlation matrix is positive definite if and only if −1/(n − 1) < ρ < 1.
14. The inverse of the n × n positive definite intraclass correlation matrix
(1 − ρ)In − ρ1n 1nT
−1
=
1
1−ρ
In −
ρ
1n 1nT
1 + ρ(n − 1)
.
52-11
Random Vectors and Linear Statistical Models
15. Gauss–Markov Theorem (or Gauß–Markov Theorem). In the full-rank model {y, Xβ, σ 2 V },
the generalized least squares estimator β˜ = GLSE(β) = (X T V −1 X)−1 X T V −1 y = BLUE(β).
In the full-rank model {y, Xβ, σ 2 I }, the ordinary least-squares estimator OLSE(β) = βˆ =
(X T X)−1 X T y = X † y = BLUE(β).
16. In the model {y, Xβ, σ 2 V }, where V is positive definite, but with X possibly with less than full
column rank, the
BLUE(Xβ) = X(X T V −1 X)− X T V −1 y.
17. [Sea97, §5.4] Let the matrix K ∈ Rk× p . Then K β is estimable ⇐⇒ ∃ matrix A ∈ Rn×k : K T =
X T A ⇐⇒ range(K T ) ⊆ range(X T ) ⇐⇒ K βˆ is invariant for any choice of βˆ = (X T X)− X T y.
18. [Rao73b, p. 282] Consider the general linear model {y, Xβ, σ 2 V }, where X and V need not be
of full rank. Let the matrix G ∈ Rn×n . Then G y = BLUE(Xβ) ⇐⇒ G [X | V M] = [X | 0],
where the residual matrix M = I − H. Let the matrix A ∈ Rk×n and the matrix K ∈ Rk× p . Then
the corresponding condition for Ay to be the BLUE of an estimable parametric function K β is
A[X | V M] = [K | 0].
19. Let G 1 and G 2 both be n ×n. If G 1 y and G 2 y are two BLUEs of Xβ under the model {y, Xβ, σ 2 V },
then G 1 y = G 2 y for all y ∈ range([X | V ]). The matrix G yielding the BLUE is unique if and only
if range([X | V ]) = Rn .
20. Every linear zero function can be written as bT My for some nonrandom b ∈ Rn . Let the matrix
G ∈ Rn×n . Then an unbiased estimator G y = BLUE(Xβ) if and only if G y is uncorrelated with
every linear zero function.
21. [Rao71] Let the matrix A ∈ Rn×n . Then the linear estimator Ay = BLUE(Xβ) under the model
{y, Xβ, σ 2 V } if and only if there exists a matrix so that A is a solution to Pandora’s box
V
X
AT
X
T
0
=
0
XT
.
22. [Rao71] Let the (n + p) × (n + p) matrix B be defined as any generalized inverse:
B=
V
X
T
X
0
−
=
B1
B2
B3
−B4
.
˜ =
Let kT β be estimable; then the BLUE(kT β) = kT β˜ = kT B2T y = kT B3 y, the variance var(kT β)
2 T
T
2
σ k B4 k, and the quadratic form y B1 y/ f is an unbiased estimator of σ with f = rank([V | X])−
rank(X).
23. [PS89] In the model {y, Xβ, σ 2 V } with no rank assumptions, the OLSE(Xβ) = BLUE(Xβ) if and
only if any one of the following equivalent conditions holds:
r H V = V H.
r H V = H V H.
r H V M = 0.
r X T V L = 0, where the n × l matrix L has range(L ) = range(M).
r range(V X) ⊆ range(X).
r range(V X) = range(X) ∩ range(V ).
r H V H ≤ V , i.e., V − H V H is positive semidefinite.
r rank(V − H V H) = rank(V ) − rank(H V H).
r rank(V − H V H) = rank(V ) − rank(V X).
r range(X) has a basis consisting of r eigenvectors of V , where r = rank(X).
r V can be expressed as V = α I + X AX T + L B L T , where α ∈ R, range(L ) = range(M), and
the p × p matrices A and B are symmetric, and such that V is positive semidefinite.
52-12
Handbook of Linear Algebra
More conditions can be obtained by replacing V with its Moore–Penrose inverse V † and the hat
matrix H with the residual matrix M = I − H.
24. Suppose that the positive definite covariance matrix V has h distinct eigenvalues: λ{1} > λ{2} >
h
· · · > λ{h} > 0 with multiplicities m1 , . . . , mh , i =1 mi = n, and with associated orthonormalized
sets of eigenvectors U{1} , . . . , U{h} , respectively, n×m1 , . . . , n×mh . Then OLSE(Xβ) = BLUE(Xβ)
if and only if any one of the following equivalent conditions holds:
r rank(U T X) + · · · + rank(U T X) = rank(X).
{1}
{h}
r U T HU = (U T HU )2 for all i = 1, . . . , h.
{i }
{i }
{i }
{i }
r U T HU = 0 for all i = j ; i, j = 1, . . . , h.
{ j}
{i }
25. [Rao73b] Let the p × p matrix U be such that the n × n matrix W = V + XU X T has range(W) =
range([X | V ]). Then the BLUE(Xβ) = X(X T W − X)− X T W − y.
26. When V is nonsingular, the n × n matrix G such that G y is the BLUE of Xβ is unique, but
when V is singular this may not be so. However, the numerical value of BLUE(Xβ) is unique with
probability 1.
27. [SJ03, §7.4] The residual vector associated with the BLUE(Xβ) is
e˜ = y − X β˜ = V M(MV M)− My = My + H V M(MV M)− My,
which is invariant (unique) with respect to choice of generalized inverse (MV M)− . The weighted
sum of squares of BLUE residuals, which is needed when estimating σ 2 , can be written as
˜ = e˜T V − e˜ = yT M(MV M)− My.
˜ T V − (y − X β)
(y − X β)
⎡
Examples:
1
1. Let n = 3 and p = 2 with the model matrix X = ⎣1
1
⎤
1
0⎦ . Then X has full column rank equal
−1
to 2, the matrix X T X is nonsingular, and the hat matrix is
⎡
H = X(X T X)− X T = X(X T X)−1 X T =
5
1⎢
⎣ 2
6
−1
2
−1
⎤
⎥
2
2⎦
2
5
with rank(H) = tr(H) = 2. The OLSE(β) is
βˆ = (X T X)−1 X T y =
1
(y
3 1
+ y2 + y3 )
1
(y
2 1
− y3 )
where y = [y1 , y2 , y3 ]T . The vector of OLS residuals is
⎡
,
⎤⎡ ⎤
⎡ ⎤
y1
1
1⎢
⎥⎢ ⎥ 1
⎢ ⎥
My = ⎣−2
4 −2⎦ ⎣ y2 ⎦ = (y1 − 2y2 + y3 ) ⎣−2⎦
6
6
y3
1 −2
1
1
1
−2
1
with residual sum of squares S S E = (y1 − 2y2 + y3 )2 /6.
Now let the variance σ 2 = 1 and let the covariance matrix
⎡
⎤
1
0
0
V = ⎣0
0
δ
0⎦
0
1
⎢
⎥