A.3 Expectation, Variance and Covariance
Tải bản đầy đủ
A.3
Expectation, Variance and Covariance
325
A.3.2 Variance
The variance of a random variable X is defined as
2
Var(X) = E X − E(X) .
The variance of X may also be written as
Var(X) = E X 2 − E(X)2
and
1
E (X1 − X2 )2 ,
2
where X1 and X2 are independent copies of√X, i.e. X, X1 and X2 are independent
and identically distributed. The square root Var(X) of the variance of X is called
the standard deviation.
For any real numbers a and b, we have
Var(X) =
Var(a · X + b) = a 2 · Var(X).
A.3.3 Moments
Let k denote a positive integer. The kth moment mk of a random variable X is
defined as
mk = E X k ,
whereas the kth central moment is
ck = E (X − m1 )k .
The expectation E(X) of a random variable X is therefore its first moment, whereas
the variance Var(X) is the second central moment. Those two quantities are therefore often referred to as “the first two moments”. The third and fourth moments of a
random variable quantify its skewness and kurtosis, respectively. If the kth moment
of a random variable exists, then all lower moments also exist.
A.3.4 Conditional Expectation and Variance
For continuous random variables, the conditional expectation of Y given X = x,
E(Y | X = x) =
yf (y | x) dy,
(A.17)
326
A
Probabilities, Random Variables and Distributions
is the expectation of Y with respect to the conditional density of Y given X = x,
f (y | x). It is a real number (if it exists). For discrete random variables, the integral
in (A.17) has to be replaced with a sum, and the conditional density has to be replaced with the conditional probability mass function f (y | x) = Pr(Y = y | X = x).
The conditional expectation can be interpreted as the mean of Y , if the value x of the
random variable X is already known. The conditional variance of Y given X = x,
Var(Y | X = x) = E Y − E(Y | X = x)
2
|X = x ,
(A.18)
is the variance of Y , conditional on X = x. Note that the outer expectation is again
with respect to the conditional density of Y given X = x.
However, if we now consider the realisation x of the random variable X in
g(x) = E(Y | X = x) as unknown, then g(X) is a function of the random variable
X and therefore also a random variable. This is the conditional expectation of Y
given X. Similarly, the random variable h(X) with h(x) = Var(Y | X = x) is called
the conditional variance of Y given X = x. Note that the nomenclature is somewhat
confusing since only the addendum “of Y given X” indicates that we do not refer
to the numbers (A.17) nor (A.18), respectively, but to the corresponding random
variables.
These definitions give rise to two useful results. The law of total expectation
states that the expectation of the conditional expectation of Y given X equals the
(ordinary) expectation of Y for any two random variables X and Y :
E(Y ) = E E(Y | X) .
(A.19)
Equation (A.19) is also known as the law of iterated expectations. The law of total
variance provides a useful decomposition of the variance of Y :
Var(Y ) = E Var(Y | X) + Var E(Y | X) .
(A.20)
These two results are particularly useful if the first two moments of Y | X = x and
X are known. Calculation of expectation and variance of Y via (A.19) and (A.20) is
then often simpler than directly based on the marginal distribution of Y .
A.3.5 Covariance
Let (X, Y ) denote a bivariate random variable with joint probability mass or density function fX,Y (x, y). The covariance of X and Y is defined as
Cov(X, Y ) = E X − E(X) Y − E(Y )
= E(XY ) − E(X) E(Y ),
where E(XY ) = xyfX,Y (x, y) dx dy, see (A.16). Note that Cov(X, X) = Var(X)
and Cov(X, Y ) = 0 if X and Y are independent.
A.3
Expectation, Variance and Covariance
327
For any real numbers a, b, c and d, we have
Cov(a · X + b, c · Y + d) = a · c · Cov(X, Y ).
The covariance matrix of a p-dimensional random variable X = (X1 , . . . , Xp )
is
Cov(X) = E X − E(X) · X − E(X)
.
The covariance matrix can also be written as
Cov(X) = E X · X
− E(X) · E(X)
and has entry Cov(Xi , Xj ) in the ith row and j th column. In particular, on the
diagonal of Cov(X) there are the variances of the components of X.
If X is a p-dimensional random variable and A is a q × p matrix, we have
Cov(A · X) = A · Cov(X) · A .
In particular, for the bivariate random variable (X, Y ) and matrix A = (1, 1), we
have
Var(X + Y ) = Var(X) + Var(Y ) + 2 · Cov(X, Y ).
If X and Y are independent, then
Var(X + Y ) = Var(X) + Var(Y )
and
Var(X · Y ) = E(X)2 Var(Y ) + E(Y )2 Var(X) + Var(X) Var(Y ).
A.3.6 Correlation
The correlation of X and Y is defined as
Cov(X, Y )
Corr(X, Y ) = √
,
Var(X) Var(Y )
as long as the variances Var(X) and Var(Y ) are positive.
An important property of the correlation is
Corr(X, Y ) ≤ 1,
(A.21)
which can be shown with the Cauchy–Schwarz inequality (after Augustin Louis
Cauchy, 1789–1857, and Hermann Amandus Schwarz, 1843–1921). This inequality
states that for two random variables X and Y with finite second moments E(X 2 ) and
E(Y 2 ),
E(X · Y )2 ≤ E X 2 E Y 2 .
(A.22)
328
A
Probabilities, Random Variables and Distributions
Applying (A.22) to the random variables X − E(X) and Y − E(Y ), one obtains
E X − E(X)
Y − E(Y )
2
≤ E X − E(X)
2
E Y − E(Y )
2
,
from which
Corr(X, Y )2 =
Cov(X, Y )2
≤ 1,
Var(X) Var(Y )
i.e. (A.21), easily follows. If Y = a · X + b for some a > 0 and b, then Corr(X, Y ) =
+1. If a < 0, then Corr(X, Y ) = −1.
Let Σ denote the covariance matrix of a p-dimensional random variable X =
(X1 , . . . , Xp ) . The correlation matrix R of X can be obtained via
R = SΣS,
where
S denotes the diagonal matrix with entries equal to the standard deviations
√
Var(Xi ), i = 1, . . . , n, of the components of X. A correlation matrix R has the entry Corr(Xi , Xj ) in the ith row and j th column. In particular, the diagonal elements
are all one.
A.3.7 Jensen’s Inequality
Let X denote a random variable with finite expectation E(X), and g(x) a convex
function (if the second derivative g (x) exists, this is equivalent to g (x) ≥ 0 for
all x ∈ R). Then
E g(X) ≥ g E(X) .
If g(x) is even strictly convex (g (x) > 0 for all real x) and X is not a constant, i.e.
not degenerate, then
E g(X) > g E(X) .
For (strictly) concave functions g(x) (if the second derivative g (x) exists, this is
equivalent to the fact that for all x ∈ R, g (x) ≤ 0 and g (x) < 0, respectively), the
analogous results
E g(X) ≤ g E(X)
and
E g(X) < g E(X) ,
respectively, can be obtained.
A.4
Convergence of Random Variables
329
A.3.8 Kullback–Leibler Discrepancy and Information Inequality
Let fX (x) and fY (y) denote two density or probability functions, respectively, of
random variables X and Y . The quantity
D(fX
fX (X)
fY (X)
fY ) = E log
− E log fY (X)
= E log fX (X)
is called the Kullback–Leibler discrepancy from fX to fY (after Solomon Kullback,
1907–1994, and Richard Leibler, 1914–2003) and quantifies effectively the “distance” between fX and fY . However, note that in general
D(fX
fY ) = D(fY
fX )
since D(fX fY ) is not symmetric in fX and fY , so D(· ·) is not a distance in the
usual sense.
If X and Y have equal support, then the information inequality holds:
D(fX
fY ) = E log
fX (X)
fY (X)
≥ 0,
where equality holds if and only if fX (x) = fY (x) for all x ∈ R.
A.4
Convergence of Random Variables
After a definition of the different modes of convergence of random variables, several
limit theorems are described, which have important applications in statistics.
A.4.1 Modes of Convergence
Let X1 , X2 , . . . be a sequence of random variables. We say:
r
1. Xn → X converges in rth mean, r ≥ 1, written as Xn −
→ X, if E(|Xnr |) < ∞
for all n and
E |Xn − X|r → 0
2.
The case r = 2, called convergence in mean square, is often of particular interest.
P
→ X, if
Xn → X converges in probability, written as Xn −
Pr |Xn − X| > ε → 0
3.
as n → ∞.
as n → ∞ for all ε > 0.
D
Xn → X converges in distribution, written as Xn −
→ X, if
Pr(Xn ≤ x) → Pr(X ≤ x)
as n → ∞
330
A
Probabilities, Random Variables and Distributions
for all points x ∈ R at which the distribution function FX (x) = Pr(X ≤ x) is
continuous.
The following relationships between the different modes of convergence can be established:
r
=⇒ Xn −
→ X for any r ≥ 1,
P
P
=⇒ Xn −
→ X,
Xn −
→X
D
Xn −
→X
D
P
Xn −
→c
=⇒ Xn −
→ c,
where c ∈ R is a constant.
A.4.2 Continuous Mapping and Slutsky’s Theorem
The continuous mapping theorem states that any continuous function g : R → R is
limit-preserving for convergence in probability and convergence in distribution:
P
→X
Xn −
P
=⇒ g(Xn ) −
→ g(X),
D
=⇒ g(Xn ) −
→ g(X).
Xn −
→X
D
D
P
Slutsky’s theorem states that the limits of Xn −
→ X and Yn −
→ a ∈ R are preserved
under addition and multiplication:
D
→X+a
Xn + Yn −
D
Xn · Yn −
→ a · X.
A.4.3 Law of Large Numbers
Let X1 , X2 , . . . be a sequence of independent and identically distributed random
variables with finite expectation μ. Then
1
n
n
P
Xi −
→μ
i=1
as n → ∞.
A.4
Convergence of Random Variables
331
A.4.4 Central Limit Theorem
Let X1 , X2 , . . . denote a sequence of independent and identically distributed random
variables with mean μ = E(Xi ) < ∞ and finite, non-zero variance (0 < Var(Xi ) =
σ 2 < ∞). Then, as n → ∞,
n
1
√
nσ 2
D
Xi − nμ −
→ Z,
i=1
where Z ∼ N(0, 1). A more compact notation is
n
1
√
nσ 2
a
Xi − nμ ∼ N(0, 1),
i=1
a
where ∼ stands for “is asymptotically distributed as”.
If X1 , X2 , . . . denotes a sequence of independent and identically distributed
p-dimensional random variables with mean μ = E(X i ) and finite, positive definite
covariance matrix Σ = Cov(X i ), then, as n → ∞,
n
1
√
n
D
Xi − nμ −
→ Z,
i=1
where Z ∼ Np (0, Σ) denotes a p-dimensional normal distribution with expectation 0 and covariance matrix Σ , compare Appendix A.5.3. In more compact notation we have
1
√
n
n
a
Xi − nμ ∼ Np (0, Σ).
i=1
A.4.5 Delta Method
Consider Tn = n1 ni=1 Xi , where the Xi s are independent and identically distributed random variables with finite expectation μ and variance σ 2 . Suppose g(·)
is (at least in a neighbourhood of μ) continuously differentiable with derivative g
and g (μ) = 0. Then
√
a
n g(Tn ) − g(μ) ∼ N 0, g (μ)2 · σ 2
as n → ∞.
Somewhat simplifying, the delta method states that
a
g(Z) ∼ N g(ν), g (ν)2 · τ 2
a
if Z ∼ N(ν, τ 2 ).
332
A
Probabilities, Random Variables and Distributions
Now consider T n = n1 (X 1 + · · · + Xn ), where the p-dimensional random variables Xi are independent and identically distributed with finite expectation μ and
covariance matrix Σ. Suppose that g : Rp → Rq (q ≤ p) is a mapping continuously differentiable in a neighbourhood of μ with q × p Jacobian matrix D (cf.
Appendix B.2.2) of full rank q. Then
√
a
n g(T n ) − g(μ) ∼ Nq 0, DΣD
as n → ∞.
a
Somewhat simplifying, the multivariate delta method states that if Z ∼ Np (ν, T ),
then
a
g(Z) ∼ Nq g(ν), DT D
A.5
.
Probability Distributions
In this section we summarise the most important properties of the probability distributions used in this book. A random variable is denoted by X, and its probability or
density function is denoted by f (x). The probability or density function is defined
for values in the support T of each distribution and is always zero outside of T . For
each distribution, the mean E(X), variance Var(X) and mode Mod(X) are listed, if
appropriate.
In the first row we list the name of the distribution, an abbreviation and the core
of the corresponding R-function (e.g. norm), indicating the parametrisation implemented in R. Depending on the first letter, these functions can be conveniently used
as follows:
r stands for random and generates independent random numbers or vectors from
the distribution considered. For example, rnorm(n, mean = 0, sd = 1)
generates n random numbers from the standard normal distribution.
d stands for density and returns the probability and density function, respectively.
For example, dnorm(x) gives the density of the standard normal distribution.
p stands for probability and gives the distribution function F (x) = Pr(X ≤ x) of
X. For example, if X is standard normal, then pnorm(0) returns 0.5, while
pnorm(1.96) is 0.975002 ≈ 0.975.
q stands for quantile and gives the quantile function. For example, qnorm(0.975)
is 1.959964 ≈ 1.96.
The first argument of each function is not listed since it depends on the particular
function used. It is either the number n of random variables generated, a value x in
the domain T of the random variable or a probability p ∈ [0, 1]. The arguments x
and p can be vectors, as well as some parameter values. The option log = TRUE
is useful to compute the log of the density, distribution or quantile function. For example, multiplication of very small numbers, which may cause numerical problems,
can be replaced by addition of the log numbers and subsequent application of the
exponential function exp() to the obtained sum.
A.5
Probability Distributions
333
With the option lower.tail = FALSE, available in p- and q-type functions,
the upper tail of the distribution function Pr(X > x) and the upper quantile z with
Pr(X > z) = p, respectively, are returned. Further details can be found in the documentation to each function, e.g. by typing ?rnorm.
A.5.1 Univariate Discrete Distributions
Table A.1 gives some elementary facts about the most important univariate discrete
distributions used in this book. The function sample can be applied in various settings, for example to simulate discrete random variables with finite support or for
resampling. Functions for the beta-binomial distribution (except for the quantile
function) are available in the package VGAM. The density and random number generator functions of the noncentral hypergeometric distribution are available in the
package MCMCpack.
Table A.1 Univariate discrete distributions
Urn model:
sample(x, size, replace = FALSE, prob = NULL)
A sample of size size is drawn from an urn with elements x. The corresponding sample
probabilities are listed in the vector prob, which does not need to be normalised. If prob = NULL,
all elements are equally likely. If replace = TRUE, these probabilities do not change after the
first draw. The default, however, is replace = FALSE, in which case the probabilities are updated
draw by draw. The call sample(x) takes a random sample of size length(x) without
replacement, hence returns a random permutation of the elements of x. The call sample(x,
replace = TRUE) returns a random sample from the empirical distribution function of x, which
is useful for (nonparametric) bootstrap approaches.
Bernoulli: B(π)
_binom(. . . , size = 1, prob = π)
0<π <1
T = {0, 1}
f (x) = π x (1 − π)1−x
Mod(X) =
E(X) = π
0, π ≤ 0.5,
1, π ≥ 0.5.
Var(X) = π(1 − π)
If Xi ∼ B(π), i = 1, . . . , n, are independent, then
n
i=1 Xi
∼ Bin(n, π).
Binomial: Bin(n, π)
_binom(. . . , size = n, prob = π)
0 < π < 1, n ∈ N
T = {0, . . . , n}
f (x) =
n
x
π x (1 − π)n−x
E(X) = nπ
Mod(X) =
zm = (n + 1)π ,
zm − 1 and zm ,
zm ∈ N,
else.
Var(X) = nπ(1 − π)
The case n = 1 corresponds to a Bernoulli distribution with success probability π. If
Xi ∼ Bin(ni , π), i = 1, . . . , n, are independent, then ni=1 Xi ∼ Bin( ni=1 ni , π).
334
A
Probabilities, Random Variables and Distributions
Table A.1 (Continued)
Geometric: Geom(π)
_geom(. . . , prob = π)
0<π <1
T =N
f (x) = π(1 − π)x−1
E(X) = 1/π
Var(X) = (1 − π )/π 2
Caution: The functions in R relate to the random variable X − 1, i.e. the number of failures until a
success has been observed. If Xi ∼ Geom(π), i = 1, . . . , n, are independent, then
n
i=1 Xi ∼ NBin(n, π).
Hypergeometric: HypGeom(n, N, M)
_hyper(. . . , m = M, n = N − M, k = n)
N ∈ N, M ∈ {0, . . . , N}, n ∈ {1, . . . , N}
T = {max{0, n + M − N}, . . . , min{n, M}}
f (x) = C ·
Mod(X) =
M
x
N −M
n−x
xm ,
xm ∈ N,
xm − 1 and xm , else.
C=
N −1
n
xm =
(n+1)(M+1)
(N +2)
N −M (N −n)
N (N −1)
E(X) = n M
N
Var(X) = n M
N
Noncentral hypergeometric:
NCHypGeom(n, N, M, θ)
{r, d}noncenhypergeom(. . . , n1 = M,
n2 = N − M, m1 = n, psi = θ)
N ∈ N, M ∈ {0, . . . , N}, n ∈ {0, . . . , N}, θ ∈ R+
T = {max{0, n + M − N}, . . . , min{n, M}}
f (x) = C ·
Mod(X) =
N −M x
n−x θ
−2c
√
b+sign(b) b2 −4ac
M
x
C={
x∈T
M
x
N −M
n−x
θ x }−1
a = θ − 1,
b = (M + n + 2)θ + N − M − n,
c = θ(M + 1)(n + 1)
This distribution arises if X ∼ Bin(M, π1 ) independent of Y ∼ Bin(N − M, π2 ) and Z = X + Y ,
1 (1−π2 )
then X | Z = n ∼ NCHypGeom(n, N, M, θ) with the odds ratio θ = π(1−π
. For θ = 1, this
1 )π2
reduces to HypGeom(n, N, M).
Negative binomial: NBin(r, π)
_nbinom(. . . , size = r, prob = π)
0 < π < 1, r ∈ N
T = {r, r + 1, . . . }
f (x) =
x−1
r−1
E(X) = πr
π r (1 − π)x−r
Mod(X) =
zm = 1 + r−1
π , zm ∈ N,
zm − 1 and zm ,
else.
Var(X) =
r(1−π )
π2
Caution: The functions in R relate to the random variable X − r, i.e. the number of failures until r
successes have been observed. The NBin(1, π) distribution is a geometric distribution with
parameter π. If Xi ∼ NBin(ri , π), i = 1, . . . , n, are independent, then
n
n
i=1 Xi ∼ NBin( i=1 ri , π).
A.5
Probability Distributions
335
Table A.1 (Continued)
Poisson: Po(λ)
_pois(. . . , lambda = λ)
λ>0
T = N0
f (x) =
λx
Mod(X) =
exp(−λ)
x!
E(X) = λ
λ ,
λ − 1, λ,
λ ∈ N,
else.
Var(X) = λ
If Xi ∼ Po(λi ), i = 1, . . . , n, are independent, then
X1 | {X1 + X2 = n} ∼ Bin(n, λ1 /(λ1 + λ2 )).
n
i=1 Xi
∼ Po(
n
i=1 λi ).
Moreover,
Poisson-gamma: PoG(α, β, ν)
β
_nbinom(. . . , size = α, prob = β+ν
)
α, β > 0
T = N0
f (x) = C
ν x
· (α+x)
⎧ x!ν(α−1)β+ν
⎪
−1
⎨
β
Mod(X) =
0, 1
⎪
⎩
0,
C=
β α 1
β+ν
(α)
, αν > β + ν,
αν = β + ν,
αν < β + ν.
E(X) = ν βα
Var(X) = α βν (1 + βν )
The gamma function (x) is described in Appendix B.2.1. The Poisson-gamma distribution
β
generalises the negative binomial distribution, since X + α ∼ NBin(α, β+ν
), if α ∈ N. In R there
is only one function for both distributions.
Beta-binomial: BeB(n, α, β)
{r, d, p}betabin.ab(. . . , size = n, α, β)
α, β > 0, n ∈ N
T = {0, . . . , n}
f (x) =
n B(α+x,β+n−x)
B(α,β)
x
Mod(X) =
xm ,
xm ∈ N,
xm − 1 and xm , else.
α
E(X) = n α+β
xm =
(n+1)(α−1)
α+β−2
αβ
Var(X) = n (α+β)
2
(α+β+n)
(α+β+1)
The beta function B(x, y) is described in Appendix B.2.1. The BeB(n, 1, 1) distribution is a
discrete uniform distribution with support T and f (x) = (n + 1)−1 . For n = 1, the beta-binomial
distribution BeB(1, α, β) reduces to the Bernoulli distribution B(π) with success probability
π = α/(α + β).
A.5.2 Univariate Continuous Distributions
Table A.2 gives some elementary facts about the most important univariate continuous distributions used in this book. The density and random number generator
functions of the inverse gamma distribution are available in the package MCMCpack.
The distribution and quantile function (as well as random numbers) can be calculated with the corresponding functions of the gamma distribution. Functions relating to the general t distribution are available in the package sn. The functions
_t(. . . , df = α) available by default in R cover the standard t distribution. The lognormal, folded normal, Gumbel and the Pareto distributions are available in the
package VGAM. The gamma–gamma distribution is currently not available.