Tải bản đầy đủ - 0 (trang)
Chapter 52. Random Vectors and Linear Statistical Models

# Chapter 52. Random Vectors and Linear Statistical Models

Tải bản đầy đủ - 0trang

52-2

52.2

Handbook of Linear Algebra

Introduction to Statistics and Random Variables

Notation:

In this chapter most uppercase, light-face italic letters, in particular X, Y , and Z, denote scalar random

variables, but the notation P (A) is reserved for the probability of the set A. Throughout this chapter, the

uppercase, light-face roman letter E denotes expectation.

Definitions:

The focus in statistics is on making inferences concerning a large body of data called a population based

on a subset collected from it called a sample. An experiment associated with this sample is repeatable and

its outcome is not predetermined. A simple event is one associated with an experiment, which cannot

be decomposed, and a simple event corresponds to one and only one sample point. The sample space

associated with an experiment is the set of all possible sample points.

Suppose that S is a sample space associated with an experiment. To every subset A of S we assign a real

number P (A), called the probability of A, so that the following axioms hold: (1) P (A) ≥ 0, (2) P (S) = 1,

(3) If A1 , A2 , A3 , . . . form a sequence of pairwise mutually exclusive subsets in S, i.e., Ai ∩ A j = ∅ if i = j ,

then P (A1 ∪ A2 ∪ A3 ∪ · · ·) = i =1 P (Ai ).

A random variable is a real-valued function for which the domain is a sample space.

A random variable which can assume a finite or countably infinite number of values is discrete.

The probability P (Y = y) that the random variable Y takes on the value y is defined as the sum of the

probabilities of all sample points in S that have the value y, and the probability function of Y is the set

of all the probabilities P (Y = y).

If the random variable Y has the probability function P (Y = 0) = p and P (Y = 1) = 1 − p for some

real number p in the interval [0, 1], then Y is a Bernoulli random variable.

The cumulative distribution function (cdf) is F (y) = P (Y ≤ y) of the random variable Y for

−∞ < y < ∞. A random variable Y is continuous when its cdf F (y) is continuous for −∞ < y < ∞,

and then its probability density function (pdf) f (y) = d F (y)/d y.

Suppose that the continuous random variable Y has pdf f (y) = 1 for 0 ≤ y ≤ 1 and f (y) = 0

otherwise. Then Y follows a uniform distribution on the interval [0, 1].

The expectation (or expected value or mean or mean value) E(Y ) of the random variable Y is E(Y ) =

+∞

y y P (Y = y) when Y is discrete with probability function P (Y = y) and is E(Y ) = −∞ y f (y)d y

when Y is continuous with pdf f (y).

The variance σ 2 of the random variable Y is

σ 2 = var(Y ) = E (Y − µ)2 ,

where µ = E(Y ) and the standard deviation σ =

σ 2.

Facts:

1. The variance σ 2 = var(Y ) = E(Y 2 ) − E2 (Y ).

2. For any random variable Y , the expectation of its square E(Y 2 ) ≥ E2 (Y ) with equality if and only

if the random variable Y = E(Y ) with probability 1.

3. If Y is a Bernoulli random variable with probability function P (Y = 0) = p and P (Y = 1) = 1− p,

then the expectation E(Y ) = 1 − p and the variance var(Y ) = p(1 − p).

4. If the random variable Z follows a uniform distribution on the interval [0, 1], then the expectation

E(Z) = 1/2 and the variance var(Z) = 1/12.

Random Vectors and Linear Statistical Models

52-3

Examples:

1. Every person’s blood type is A, B, AB, or O. In addition, each individual either has the Rhesus (Rh)

factor (+) or does not (–). A medical technician records a person’s blood type and Rh factor. The

sample space for this experiment is {A+, B+, AB+, O+, A−, B−, AB−, O−} with eight sample

points.

2. Consider the experiment of tossing a single fair coin and define the random variable Y = 0 if the

outcome is “heads,” and Y = 1 if the outcome is “tails.” Then Y is a Bernoulli random variable,

and P (Y = 0) = 12 = P (Y = 1), E(Y ) = 12 , var(Y ) = 14 .

3. Suppose that a bus always arrives at a particular stop in the interval between 12 noon and 1 p.m.

and that the probability that the bus will arrive in any given subinterval of time is proportional

only to the length of the subinterval. Let Y denote the length of time that a person arriving at

the stop at 12 noon must wait for the bus to arrive, and let us code 12 noon as 0 and measure

the time in hours. Then the random variable Y follows a uniform distribution on the interval

[0, 1].

52.3

Random Vectors: Basic Definitions and Facts

Linear algebra is extensively used in the study of random vectors, where we consider the simultaneous

behavior of two or more random variables assembled as a vector. In this section all vectors and matrices

are real.

Notation:

In this section uppercase, light-face italic letters, such as X, Y , and Z, denote scalar random variables and

lowercase bold roman letters, such as x, y, and z, denote random vectors. Uppercase, light-face italic letters

such as A and B denote nonrandom matrices.

Definitions:

Let A ∈ Rn×k and B ∈ Rn×q . Then the partitioned matrix [A | B] is the n × (k + q ) matrix formed by

placing A next to B.

A k ×1 random vector y is a vector y = [Y1 , . . . , Yk ]T of k random variables Y1 , . . . , Yk . The expectation

(or expected value or mean vector) of y is the k × 1 vector E(y) = [E(Y1 ), . . . , E(Yk )]T . Sometimes, for

clarity, a vector of constants (belonging to Rn ) is called a nonrandom vector and a matrix of constants a

nonrandom matrix. (Random matrices are not considered in this chapter.)

The covariance cov(Y, Z) between the two random variables Y and Z is cov(Y, Z) = E (Y −µ)(Z−ν) ,

where µ = E(Y ) and ν = E(Z).

The correlation (or correlation coefficient or product-moment

correlation) cor(Y, Z) between the

two random variables Y and Z is cor(Y, Z) = cov(Y, Z)/ var(Y )var(Z).

The covariance matrix (or variance-covariance matrix or dispersion matrix) of the k × 1 random

vector y = [Y1 , . . . , Yk ]T is the k × k matrix var(y) = of variances and covariances of all the entries

of y:

var(y) =

= [σi j ] = [cov(Yi , Y j )] = [E(Yi − µi )(Y j − µ j )]

= E (y − µ)(y − µ)T ,

where µ = E(y). The determinant det is the generalized variance of the random vector y. The variances

σii are often denoted as σi2 and, in this, chapter, we will assume that they are all positive. If σi = 0, then

the random variable Yi = E(Yi ) with probability 1, and then we interpret Yi as a constant. In statistics it is

quite common to denote standard deviations as σi . (The reader should note that in all the other chapters

of this book except in the two statistics chapters, σi denotes the i th largest singular value.)

52-4

Handbook of Linear Algebra

The cross-covariance matrix cov(y, z) between the k × 1 random vector y = [Y1 , . . . , Yk ]T and the

q × 1 random vector z = [Z 1 , . . . , Zq ]T is the k × q matrix of all the covariances cov(Yi , Z j ); i = 1, . . . , k

and j = 1, . . . , q :

cov(y, z) = [cov(Yi , Z j )] = [E(Yi − µi )(Z j − ν j )] = E (y − µ)(z − ν)T ,

where µ = [µi ] = E(y) and ν = [ν j ] = E(z). The random vectors y and z are uncorrelated whenever

the cross-covariance matrix cov(y, z) = 0.

The correlation matrix cor(y) = R, say, of the k×1 random vector y = [Y1 , . . . , Yk ]T , is the k×k matrix

σ

of correlations of all the entries in y: cor(y) = R = [ρi j ] = [cor(Yi , Y j )] = [ σi iσj j ], where σi = σii =

standard deviation of Yi ; σi , σ j > 0.

Let 1k denote the k × 1 column vector with every entry equal to 1. Then J k = 1k 1kT is the k × k all-ones

matrix (with all k 2 entries equal to 1) and C k = Ik − k1 J k is the k × k centering matrix.

k

Suppose that the real positive numbers p1 , p2 , . . . , pk are such that i =1 pi = 1. Then the k ×1 random

T

vector y = [Y1 , . . . , Yk ] follows a multinomial distribution with parameters n and p1 , p2 , . . . , pk if the

joint probability function of Y1 , Y2 , . . . , Yk is given by

P (Y1 = y1 , Y2 = y2 , . . . , Yk = yk ) =

n!

y y

y

p 1 p 2 · · · pk k ,

y1 !y2 ! · · · yk ! 1 2

k

where for each i , yi = 0, 1, 2, . . . , n and i =1 yi = n. When k = 2 the distribution is binomial, and when

k = 3 the distribution is trinomial.

Let the symmetric matrices A ∈ Rk×k and B ∈ Rk×k . Then A B means A − B is positive semidefinite

and A

B means A − B is positive definite. The partial ordering induced by is called the partial

semidefinite ordering (or Loewner partial ordering or Lăowner partial ordering). (See Section 8.5 for

Let the (k + q ) × 1 random vector x have covariance matrix

x=

y

z

,

E(x) =

µ

ν

,

var(x) =

. Consider the following partitioning:

=

yy

yz

zy

zz

,

where y and z have k and q elements, respectively. Then the partial covariance matrix zz·y of the q × 1

random vector z after adjusting for (or controlling for or removing the effect of or allowing for) the

k × 1 random vector y is the (uniquely defined) generalized Schur complement

/

yy

= [σi j ·y ] =

zz

zy

yy

yz

yy in ; any generalized inverse yy satisfying yy =

yy yy yy may be chosen.

−1

instead of a generalized inverse yy

and refer to

When yy is positive definite, we use the inverse yy

−1

as

the

Schur

complement

of

in

.

/ yy = zz − zy yy

yz

yy

(y − µ) is the (uniquely defined) vector of residuals

The q × 1 random vector ez·y = z − ν − zy yy

of the q × 1 random vector z from its regression on the k × 1 random vector y.

The i j th entry of the partial correlation matrix of q × 1 random vector z = [Z 1 , . . . , Zq ]T after

adjusting for the k × 1 random vector y is the partial correlation coefficient between z i and z j after

of

ρi j ·y = √

σi j ·y

; i, j = 1, . . . , q ,

σii ·y σ j j ·y

which is well defined provided the diagonal entries of the associated partial covariance matrix are all

positive.

52-5

Random Vectors and Linear Statistical Models

Facts:

1. [WMS02, Th. 5.13, p. 265] Let the k × 1 random vector y = [Y1 , . . . , Yk ]T follow a multinomial

distribution with parameters n and p1 , . . . , pk and let the k × 1 vector p = [ p1 , . . . , pk ]T . Then:

r The random variable Y can be represented as the sum of n independently and identically disi

tributed Bernoulli random variables with parameter pi ; i = 1, . . . , k.

r The expectation E(y) = np and the covariance matrix

var(y) = n diag(p) − ppT =

k,

say, where diag(p) is the k × k diagonal matrix formed from the k × 1 nonrandom vector p.

r The covariance matrix

k − 1.

k

is singular since all its row (and column) totals are 0, and the rank(

k)

=

2. When the k × 1 multinomial probability vector p = 1k /k, then the multinomial covariance matrix

n

k = k C k , where C k is the k × k centering matrix.

3. The k × k covariance matrix var(y) = = E(yyT ) ààT , where à = E(y).

4. The k ì k correlation matrix cor(y) = [diag( )]−1/2 [diag( )]−1/2 , where = var(y).

5. The k × q cross-covariance matrix

cov(y, z) =

yz

=

T

zy

= cov(z, y)

T

= E(yzT ) − µν T ,

where µ = E(y) and ν = E(z).

6. [RM71, Lemma 2.2.4, p. 21]: The product AB − C (for A = 0, C = 0) is invariant with respect to

the choice of B − ⇐⇒ range(C ) ⊆ range(B) and range(AT ) ⊆ range(B T ).

7. Consider the (k + q ) × (k + q ) covariance matrix

=

8.

9.

10.

11.

12.

13.

14.

15.

16.

17.

yy

yz

zy

zz

.

Then the range (or column space) range( yz ) ⊆ range( yy ) and range( zy ) ⊆ range( zz ), and,

hence, the matrix zy yy

yz and the generalized Schur complement / yy =

zz − zy yy yz

are invariant (unique) with respect to the choice of generalized inverse yy .

Let the k × 1 random vector x = [X 1 , . . . , X k ]T . Then the k × 1 centered vector C k x = [X 1 −

¯ T , where C k is the k × k centering matrix and the arithmetic mean (or average)

¯ . . . , X k − X]

X,

k

X¯ = i =1 X i /k.

A nonsingular positive semidefinite matrix is positive definite.

A covariance matrix is always symmetric and positive semidefinite.

A cross-covariance matrix is usually rectangular.

A correlation matrix is always symmetric and positive semidefinite.

The diagonal entries of a correlation matrix are all equal to 1 and the off-diagonal entries are all at

most equal to 1 in absolute value.

[PS05, p. 168] The (generalized) Schur complement of the leading principal submatrix of a positive

semidefinite matrix is positive semidefinite.

Let y be a k × 1 random vector with covariance matrix . Then the variance var(aT y) = aT a

for all nonrandom a ∈ Rk . (Since variance must be nonnegative this fact shows that a covariance

matrix must be positive semidefinite.)

Let y be a k × 1 random vector with expectation µ = E(y), covariance matrix = var(y), and

let the matrix A ∈ Rn×k and the nonrandom vector b ∈ Rn . Then the expectation E( Ay + b) =

AE(y) + b = Aµ + b and the covariance matrix var(Ay + b) = Avar(y)AT = A AT .

Let y be a k × 1 random vector with expectation µ = E(y), covariance matrix = var(y), and let

the matrix A ∈ Rk×k , not necessarily symmetric. Then E(yT Ay) = µT Aµ + tr(A ).

52-6

Handbook of Linear Algebra

18. [Rao73a, p. 522] Let y be a k × 1 random vector with expectation µ = E(y) and covariance matrix

= var(y), and let [à | ] denote the k ì (k + 1) partitioned matrix with µ as its first column.

Then y − µ ∈ range( ) and y ∈ range([µ | ]), both with probability 1.

19. Let the (k + q ) × 1 random vector x have covariance matrix . Consider the following partitioning:

x=

y

E(x) =

,

z

µ

=

,

ν

yy

yz

zy

zz

,

where y and z have k and q components, respectively. Then

r The variance var(aT y + bT z) = aT

yy a

for all nonrandom b ∈ R .

q

+ 2aT

yz b

+ bT

zz b

for all nonrandom a ∈ Rk and

r The covariance matrix var(Ay + Bz) = A

T

T

yy A + A yz B + B

A∈R

n×k

and all B ∈ R

n×q

.

zy A

T

+B

zz B

T

for all

r [PS05b, pp. 187–188] For any A ∈ Rq ×k the covariance matrix

var(z − Ay)

var(z −

yy y)

zy

with respect to the partial semidefinite ordering, and the partial covariance matrix

var(z −

zy

yy y)

=

the generalized Schur complement of

yy

zz

in

.

zy

yy

yz

=

zz·y

=

/

yy

,

r Let q = k. Then the covariance matrix var(y + z) = var(y) + var(z) if and only if cov(y, z) =

−cov(z, y), i.e., the cross-covariance matrix cov(y, z) is skew-symmetric; the condition that

cov(y, z) = 0 is sufficient, but not necessary (unless k = 1).

r The vector

yy ,

yy y

is not necessarily invariant with respect to the choice of generalized inverse

but its covariance matrix var( zy yy

y) = zy yy

yz is invariant (and, hence, unique).

zy

Examples:

1. Let the 4 × 1 random vector x have covariance matrix

x=

y

z

,

1

. Consider the following partitioning:

0

⎢0 1

⎣a c

b d

=⎢

a

c

b

d⎥

⎥,

1

0⎦

0

1

where y and z each have 2 components. Then var(y + z) = var(y) + var(z) if and only if a = d = 0

and c = −b, with b 2 ≤ 1.

2. Let the k × 1 random vector y = [Y1 , . . . , Yk ]T follow a multinomial distribution with parameters

n and p = [ p1 , . . . , pk ]T , with p1 + · · · + pk = 1 and p1 > 0, . . . , pk > 0, and let the k × k matrix

k

A = [ai j ]. Then the expectation E(yT Ay) = n(n − 1)pT Ap + n i =1 aii pi .

T

3. Let the 3 × 1 random vector y = [Y1 , Y2 , Y3 ] follow a trinomial distribution with parameters n

and p1 , p2 , p3 , with p1 + p2 + p3 = 1 and p1 > 0, p2 > 0, p3 > 0, and let the 3 × 1 vector

p = [ p1 , p2 , p3 ]T . Then:

r The expectation E(y) = n[ p , p , p ]T and the covariance matrix

1

2

3

3

p1 (1 − p1 )

= var(y) = n ⎣ − p1 p2

− p1 p3

− p1 p2

p2 (1 − p2 )

− p2 p3

− p1 p3

− p2 p3 ⎦ ,

p3 (1 − p3 )

52-7

Random Vectors and Linear Statistical Models

which has rank equal to 2 since

2

3 equals n p1 p2 p3 > 0.

3

is singular and the determinant of the top left-hand corner of

r The partial covariance matrix of Y and Y adjusting for Y is the Schur complement

1

2

3

3/

p3 (1 −

p3 ) = nS, say, where

S=

p1 (1 − p1 )

− p1 p2

− p1 p2

p2 (1 − p2 )

− p1 p3

− p2 p3

1

− p1 p3

p3 (1 − p3 )

p1 p2

1 −1

,

1

p1 + p2 −1

− p2 p3 =

which has rank equal to 1, and so rank(

3)

= 2.

r When p = p = p = 1/3, then the covariance matrix

1

2

3

3 = (n/3)C 3 and the partial covariance

matrix of Y1 and Y2 adjusting for Y3 is (n/3)C 2 ; here, C h is the h × h centering matrix, h = 2, 3.

4. [PS05, p. 183] If the 3 × 3 symmetric matrix

1

r 12

R3 = ⎣r 12

1

r 13

r 23

r 13

r 23 ⎦ =

1

R2

r2

r 2T

1

is a correlation matrix, then r i2j ≤ 1 for all 1 ≤ i < j ≤ 3. But not all symmetric matrices with

diagonal elements all equal to 1 and all off-diagonal elements r i j such that r i2j ≤ 1 are correlation

2

2

matrices. For example, consider R3 with r 13

≤ 1 and r 23

≤ 1. Then R3 is a correlation matrix if

and only if

r 13 r 23 −

2

2

(1 − r 13

)(1 − r 23

) ≤ r 12 ≤ r 13 r 23 +

2

2

(1 − r 13

)(1 − r 23

).

When r 13 = 0 and r 12 = r 23 = r, say, then this condition becomes r 2 ≤ 1/2 and so the matrix

1

0.8

0

⎣0.8 1 0.8⎦

0 0.8 1

is not a correlation matrix.

2

≤ 1, then the matrix R3 is a correlation matrix if and only if any one of the following

When r 12

conditions holds:

r det(R ) = 1 − r 2 − r 2 − r 2 + 2r r r ≥ 0.

3

12 13 23

12

13

23

r (i) r ∈ range(R ) and (ii) 1 ≥ rT R − r for some and, hence, for every generalized inverse

2

2

2 2 2

R2− .

5. Let the random vector x be 2 ì 1 and write

x=

Y

Z

,

E(x) =

à

,

var(x) =

with σ y2 > 0. Then the residual vector ez·y = z − ν −

its regression on y becomes the scalar residual

e Z·Y = Z − ν −

zy

=

yy (y − µ)

σ yz

(Y − µ)

σ y2

σ y2

σ yz

σ yz

σz2

,

of the random vector z from

52-8

Handbook of Linear Algebra

of the random variable Z from its regression on Y . The matrix of partial covariances of the random

vector z after adjusting for y becomes the single partial variance

2

= σz2 −

σz·y

2

σ yz

σ y2

2

= σz2 (1 − ρ yz

)

of the random variable Z after adjusting for the random variable Y ; here, the correlation coefficient

ρ yz = σ yz /(σ y σz ).

52.4

Linear Statistical Models: Basic Definitions and Facts

Notation:

In this section, the uppercase, light-face italic letter X is reserved for the nonrandom n × p model matrix

and V is reserved for an n × n covariance matrix. The uppercase, light-face italic letter H is reserved for the

(symmetric idempotent) n × n hat matrix X(X T X)− X T and M = I − H is reserved for the (symmetric

idempotent) n × n residual matrix. The lowercase, bold-face roman letter y is reserved for an observable

n × 1 random vector and x is reserved for a column of the n × p model matrix X.

Definitions:

The general linear model (or Gauss–Markov model or Gauß–Markov model) is the model

M = {y, Xβ, σ 2 V }

defined by the equation y = Xβ + ε, where E(y) = Xβ, E(ε) = 0, var(y) = var(ε) = σ 2 V. The vector

y is an n × 1 observable random vector, ε is an n × 1 unobservable random error vector, X is a known

n × p model matrix (or design matrix, particularly when its entries are −1, 0, or +1), β is a p × 1 vector

of unknown parameters, V is a known n × n positive semidefinite matrix, and σ 2 is an unknown positive

constant. The realization of the n × 1 observable random vector y will also be denoted by y.

The classical theory of linear statistical models covers the full-rank model, where X has full column

rank and V is positive definite. In the full-rank model, the ordinary least squares estimator

OLSE(β) = βˆ = (X T X)−1 X T y = X † y

and the generalized least squares estimator (or Aitken estimator)

GLSE(β) = β˜ = (X T V −1 X)−1 X T V −1 y,

where X † denotes the Moore–Penrose inverse of X.

When either X or V is (or both X and V are) rank deficient, then it is usually assumed that rank(X) <

rank(V ). The model M = {y, Xβ, σ 2 V } is called a weakly singular model (or Zyskind–Martin model)

whenever range(X) ⊆ range(V ), and then rank(X) < rank(V ), and is consistent if the realization y

satisfies y ∈ range([X | V ]).

Let βˆ be any vector minimizing y − Xβ 2 = (y − Xβ)T (y − Xβ). Then yˆ = X βˆ = OLSE(Xβ) =

the ordinary least squares estimator (OLSE) of Xβ. When rank(X) < p, then βˆ is an ordinary least

squares solution to minβ (y − Xβ)T (y − Xβ). Moreover, βˆ is any solution to the normal equations

X T X βˆ = X T y. The vector of OLS residuals is e = y − yˆ = y − X βˆ and the residual sum of squares

S S E = eT e = (y − yˆ )T (y − yˆ ).

The coefficient of determination (or coefficient of multiple determination or squared multiple

correlation) R 2 = 1 − (S S E /yT C n y) identifies the proportion of variance explained in a multiple linear

regression where the model matrix X = [1n | x[1] | · · · | x[ p−1] ] with p − 1 regressor vectors (or regressors) x[1] , . . . , x[ p−1] each n × 1. In simple linear regression p = 2 and the model matrix X = [1n | x]

52-9

Random Vectors and Linear Statistical Models

with the single regressor vector x. The sample correlation coefficient r = xT C n y/ xT C n x · yT C n y,

where it is usually assumed that x is an n × 1 nonrandom vector (such as a regressor vector) and y is a

realization of the n × 1 random vector y.

Let the matrix A ∈ Rk×n and let the matrix K ∈ Rk× p . Then the linear estimator Ay is a linear

unbiased estimator (LUE) of K β if E(Ay) = K β for all β ∈ R p . Let the matrix B ∈ Rk×n . Then the

LUE By of K β is the best linear unbiased estimator (BLUE) of K β if it has the smallest covariance

matrix (in the positive semidefinite ordering) in that var( Ay) var(By) for all LUEs Ay of K β.

The hat matrix H = X(X T X)− X T associated with the model matrix X is so named since yˆ = Hy.

The residual matrix M = I − H and vector of OLS residuals is e = y − yˆ = y − Hy = My. Let the

nonrandom vector a ∈ Rn . Then the linear estimator aT y, which is unbiased for 0, i.e., E(aT y) = 0, is a

linear zero function.

The Watson efficiency φ under the full-rank model M = {y, Xβ, σ 2 V }, with the n × p model matrix

X having full column rank equal to p < n and with the n × n covariance matrix V positive definite,

measures the relative efficiency of the OLSE(β) = βˆ vs. the BLUE(β) = β˜ and is defined by the ratio of

the corresponding generalized variances:

φ=

˜

det2 (X T X)

det[var(β)]

=

.

ˆ

det(X T V X) · det(X T V −1 X)

det[var(β)]

The Bloomfield–Watson efficiency ψ under the general linear model M = {y, Xβ, σ 2 V } with no

rank assumptions measures the relative efficiency of the OLSE(Xβ) = X βˆ vs. the BLUE(β) = β˜ and is

defined by: ψ = 12 H V − V H 2 = H V M 2 , where the norm A = tr1/2 (AT A) is defined for any

k × q matrix A.

The n ×n covariance matrix (1−ρ)In +ρ1n 1nT = (1−ρ)In +ρ J n has intraclass correlation structure

(or equicorrelation structure) and is the intraclass correlation matrix (or the equicorrelation matrix).

The parameter ρ is the intraclass correlation (or intraclass correlation coefficient).

Facts:

The following facts, except for those with a specific reference, can be found in [Gro04], [PS89], or [SJ03,

§4.1–4.3]. Throughout this set of facts, X denotes the n × p nonrandom model matrix.

1. The hat matrix H = X(X T X)− X T associated with the model matrix X is invariant (unique)

with respect to choice of generalized inverse (X T X)− and is a symmetric idempotent matrix:

H = H T = H 2 , and rank(H) = tr(H) = rank(X). Moreover, the hat matrix H is the orthogonal

projector onto range(X).

2. If the p × p matrix Q is nonsingular, then the hat matrix associated with the model matrix X Q

equals the hat matrix associated with the model matrix X.

ˆ where M is the

3. The residual sum of squares S S E = yT My = (y − yˆ )T (y − yˆ ) = yT y − yT X β,

ˆ

residual matrix and β = OLSE(β).

4. In simple linear regression the coefficient of determination R 2 = r 2 , the square of the sample correlation coefficient. In multiple linear regression with model matrix X = [1n | X 0 ] =

[1n | x[1] | · · · | x[ p−1] ] and ( p − 1) × 1 nonrandom vector a ∈ R p ,

R 2 = max r a2 = max

a

a

(aT X 0T C n y)2

,

aT X 0T C n X 0 a · yT C n y

the square of the sample correlation coefficient r a between the variables whose observed values are

in vectors y and X 0 a.

ˆ but

5. The vector X βˆ is invariant (unique) with respect to the choice of least squares solution β,

ˆ

ˆ

β is unique if and only if X has full column rank equal to p ≤ n, and then β = OLSE(β) =

(X T X)−1 X T y = X † y, where X † is the Moore–Penrose inverse of X. The covariance matrix

ˆ = σ 2 (X T X)−1 X T V X(X T X)−1 .

var(β)

52-10

Handbook of Linear Algebra

6. The Watson efficiency φ is always positive, and φ ≤ 1 with equality if and only if OLSE(β) =

BLUE(β).

7. [DLL02, p. 477], [Gus97, p. 67] Bloomfield–Watson–Knott Inequality. The Watson efficiency

φ=

det2 (X T X)

T

det(X V X) · det(X T V −1 X)

m

i =1

4λi λn−i +1

,

(λi + λn−i +1 )2

for all n× p model matrices X with full column rank p. Here m = min( p, n− p) and λ1 ≥ · · · ≥ λn

denote the necessarily positive eigenvalues of the n × n positive definite covariance matrix V . The

ratios 4λi λn−i +1 /(λi + λn−i +1 )2 in the lower bound for the Watson efficiency are the squared

antieigenvalues of the covariance matrix V .

8. [DLL02, p. 454] Let p = 1 and set the n × 1 model matrix X = x. Then the Bloomfield–Watson–

Knott Inequality is the Kantorovich Inequality (or Frucht–Kantorovich Inequality):

4λ1 λn

(xT x)2

,

xT V x · xT V −1 x

(λ1 + λn )2

where λ1 and λn are, respectively, the largest and smallest eigenvalues of the n × n positive definite

covariance matrix V .

9. The Bloomfield–Watson efficiency

1

H V − V H 2 = H V M 2 = tr(H V MV H) = tr(H V MV )

2

= tr(H V 2 − H V H V ) = tr(H V 2 ) − tr (H V )2 ≥ 0,

ψ=

with equality if and only if OLSE(β) = BLUE(β) if and only if the Watson efficiency φ = 1.

10. [DLL02, p. 473] The Bloomfield–Watson Trace Inequality. Let A be a nonrandom symmetric n × n

matrix, not necessarily positive semidefinite. Then for all the nonrandom matrices U ∈ Rn× p that

satisfy U T U = I p :

tr(U T A2 U ) − tr (U T AU )2 ≤

1

4

min( p,n− p)

(αi − αn−i +1 )2 ,

i =1

where α1 ≥ · · · ≥ αn denote the eigenvalues of the n × n matrix A.

11. The Bloomfield–Watson efficiency

ψ = tr(H V 2 ) − tr (H V )2 ≤

1

4

min( p,n− p)

(λi − λn−i +1 )2 ,

i =1

for all n × n hat matrices H with rank p (and so for all n × p model matrices X with full column

rank p). Here, λ1 ≥ · · · ≥ λn denote the necessarily positive eigenvalues of the n × n positive

definite covariance matrix V .

12. The n × n intraclass correlation matrix Ric = (1 − ρ)In − ρ1n 1nT has eigenvalues 1 − ρ with

multiplicity n − 1 and 1 + ρ(n − 1) with multiplicity 1, and so Ric is singular if and only if

ρ = −1/(n − 1) or ρ = 1.

13. The intraclass correlation coefficient ρ is such that −1/(n − 1) ≤ ρ ≤ 1 and the n × n intraclass

correlation matrix is positive definite if and only if −1/(n − 1) < ρ < 1.

14. The inverse of the n × n positive definite intraclass correlation matrix

(1 − ρ)In − ρ1n 1nT

−1

=

1

1−ρ

In −

ρ

1n 1nT

1 + ρ(n − 1)

.

52-11

Random Vectors and Linear Statistical Models

15. Gauss–Markov Theorem (or Gauß–Markov Theorem). In the full-rank model {y, Xβ, σ 2 V },

the generalized least squares estimator β˜ = GLSE(β) = (X T V −1 X)−1 X T V −1 y = BLUE(β).

In the full-rank model {y, Xβ, σ 2 I }, the ordinary least-squares estimator OLSE(β) = βˆ =

(X T X)−1 X T y = X † y = BLUE(β).

16. In the model {y, Xβ, σ 2 V }, where V is positive definite, but with X possibly with less than full

column rank, the

BLUE(Xβ) = X(X T V −1 X)− X T V −1 y.

17. [Sea97, §5.4] Let the matrix K ∈ Rk× p . Then K β is estimable ⇐⇒ ∃ matrix A ∈ Rn×k : K T =

X T A ⇐⇒ range(K T ) ⊆ range(X T ) ⇐⇒ K βˆ is invariant for any choice of βˆ = (X T X)− X T y.

18. [Rao73b, p. 282] Consider the general linear model {y, Xβ, σ 2 V }, where X and V need not be

of full rank. Let the matrix G ∈ Rn×n . Then G y = BLUE(Xβ) ⇐⇒ G [X | V M] = [X | 0],

where the residual matrix M = I − H. Let the matrix A ∈ Rk×n and the matrix K ∈ Rk× p . Then

the corresponding condition for Ay to be the BLUE of an estimable parametric function K β is

A[X | V M] = [K | 0].

19. Let G 1 and G 2 both be n ×n. If G 1 y and G 2 y are two BLUEs of Xβ under the model {y, Xβ, σ 2 V },

then G 1 y = G 2 y for all y ∈ range([X | V ]). The matrix G yielding the BLUE is unique if and only

if range([X | V ]) = Rn .

20. Every linear zero function can be written as bT My for some nonrandom b ∈ Rn . Let the matrix

G ∈ Rn×n . Then an unbiased estimator G y = BLUE(Xβ) if and only if G y is uncorrelated with

every linear zero function.

21. [Rao71] Let the matrix A ∈ Rn×n . Then the linear estimator Ay = BLUE(Xβ) under the model

{y, Xβ, σ 2 V } if and only if there exists a matrix so that A is a solution to Pandora’s box

V

X

AT

X

T

0

=

0

XT

.

22. [Rao71] Let the (n + p) × (n + p) matrix B be defined as any generalized inverse:

B=

V

X

T

X

0

=

B1

B2

B3

−B4

.

˜ =

Let kT β be estimable; then the BLUE(kT β) = kT β˜ = kT B2T y = kT B3 y, the variance var(kT β)

2 T

T

2

σ k B4 k, and the quadratic form y B1 y/ f is an unbiased estimator of σ with f = rank([V | X])−

rank(X).

23. [PS89] In the model {y, Xβ, σ 2 V } with no rank assumptions, the OLSE(Xβ) = BLUE(Xβ) if and

only if any one of the following equivalent conditions holds:

r H V = V H.

r H V = H V H.

r H V M = 0.

r X T V L = 0, where the n × l matrix L has range(L ) = range(M).

r range(V X) ⊆ range(X).

r range(V X) = range(X) ∩ range(V ).

r H V H ≤ V , i.e., V − H V H is positive semidefinite.

r rank(V − H V H) = rank(V ) − rank(H V H).

r rank(V − H V H) = rank(V ) − rank(V X).

r range(X) has a basis consisting of r eigenvectors of V , where r = rank(X).

r V can be expressed as V = α I + X AX T + L B L T , where α ∈ R, range(L ) = range(M), and

the p × p matrices A and B are symmetric, and such that V is positive semidefinite.

52-12

Handbook of Linear Algebra

More conditions can be obtained by replacing V with its Moore–Penrose inverse V † and the hat

matrix H with the residual matrix M = I − H.

24. Suppose that the positive definite covariance matrix V has h distinct eigenvalues: λ{1} > λ{2} >

h

· · · > λ{h} > 0 with multiplicities m1 , . . . , mh , i =1 mi = n, and with associated orthonormalized

sets of eigenvectors U{1} , . . . , U{h} , respectively, n×m1 , . . . , n×mh . Then OLSE(Xβ) = BLUE(Xβ)

if and only if any one of the following equivalent conditions holds:

r rank(U T X) + · · · + rank(U T X) = rank(X).

{1}

{h}

r U T HU = (U T HU )2 for all i = 1, . . . , h.

{i }

{i }

{i }

{i }

r U T HU = 0 for all i = j ; i, j = 1, . . . , h.

{ j}

{i }

25. [Rao73b] Let the p × p matrix U be such that the n × n matrix W = V + XU X T has range(W) =

range([X | V ]). Then the BLUE(Xβ) = X(X T W − X)− X T W − y.

26. When V is nonsingular, the n × n matrix G such that G y is the BLUE of Xβ is unique, but

when V is singular this may not be so. However, the numerical value of BLUE(Xβ) is unique with

probability 1.

27. [SJ03, §7.4] The residual vector associated with the BLUE(Xβ) is

e˜ = y − X β˜ = V M(MV M)− My = My + H V M(MV M)− My,

which is invariant (unique) with respect to choice of generalized inverse (MV M)− . The weighted

sum of squares of BLUE residuals, which is needed when estimating σ 2 , can be written as

˜ = e˜T V − e˜ = yT M(MV M)− My.

˜ T V − (y − X β)

(y − X β)

Examples:

1

1. Let n = 3 and p = 2 with the model matrix X = ⎣1

1

1

0⎦ . Then X has full column rank equal

−1

to 2, the matrix X T X is nonsingular, and the hat matrix is

H = X(X T X)− X T = X(X T X)−1 X T =

5

1⎢

⎣ 2

6

−1

2

−1

2

2⎦

2

5

with rank(H) = tr(H) = 2. The OLSE(β) is

βˆ = (X T X)−1 X T y =

1

(y

3 1

+ y2 + y3 )

1

(y

2 1

− y3 )

where y = [y1 , y2 , y3 ]T . The vector of OLS residuals is

,

⎤⎡ ⎤

⎡ ⎤

y1

1

1⎢

⎥⎢ ⎥ 1

⎢ ⎥

My = ⎣−2

4 −2⎦ ⎣ y2 ⎦ = (y1 − 2y2 + y3 ) ⎣−2⎦

6

6

y3

1 −2

1

1

1

−2

1

with residual sum of squares S S E = (y1 − 2y2 + y3 )2 /6.

Now let the variance σ 2 = 1 and let the covariance matrix

1

0

0

V = ⎣0

0

δ

0⎦

0

1

### Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Chapter 52. Random Vectors and Linear Statistical Models

Tải bản đầy đủ ngay(0 tr)

×