Tải bản đầy đủ
Example 5.1: A Weighted Histogram of Total Cholesterol Using the 2005– 2006 NHANES Data

Example 5.1: A Weighted Histogram of Total Cholesterol Using the 2005– 2006 NHANES Data

Tải bản đầy đủ

122

Applied Survey Data Analysis

0.015

Density

0.01

0.005

0

100

200

300
400
Total Cholesterol (mg/dl)

500

600

Figure€5.2
A weighted histogram of total cholesterol, based on 2005–2006 NHANES data.

of the 2005–2006 NHANES mobile examination center (MEC) analysis weight,
WTMEC2YR. Since WTMEC2YR is a population scale weight, trimming the decimal places will have no significant effect on the graphical representation of the
estimated distribution of total cholesterol.
Figure€ 5.2 displays the histogram output from Stata graphics. Since NHANES
sample data inputs have been weighted, the distributional form of the histogram is
representative of the estimated distribution for the NHANES survey population of
U.S. adults age 18 and older.

Example 5.2:╇ Weighted Boxplots of Total Cholesterol for U.S.
Adult Men and Women Using the 2005–2006 NHANES Data
For scientific reports and papers, boxplots are a very useful tool for graphical
presentation of the estimated population (weighted) distributions of single survey
variables. The following Stata command requests a pair of boxplots that represent
the estimated gender-specific population distributions of total serum cholesterol
in U.S. adults:
graph box lbxtc [pweight=wtmec2yr] if age18p==1, by(female)

Note that the Stata graph box command permits the specification of the
analysis weight as a pweight (or probability weight) variable—no conversion to
integer values is needed. As we will see in upcoming examples, Stata’s commands
designed for the analysis of (complex sample) survey data all expect users to initially identify a probability sampling weight variable, which may not necessarily
take on integer values.

© 2010 by Taylor and Francis Group, LLC

123

Descriptive Analysis for Continuous Variables

Male

Female

Total Cholesterol (mg/dl)

600
500
400
300
200
100
Graphs by gender.
Figure€5.3
Boxplots of the gender-specific weighted distributions of total serum cholesterol for U.S. adults.
(From 2005–2006 NHANES data.)

The pair of boxplots generated by the Stata command is displayed in
Figure€5.3. These boxplots suggest that male and female adults in the NHANES
survey population have very similar distributions on this particular continuous
variable.

5.3.2╇Estimation of Population Totals
The estimation of a population total and its sampling variance has played
a central role in the development of probability sampling theory. In survey
practice, optimal methods for estimation of population totals are extremely
important in government and academic surveys that focus on agriculture
(e.g., acres in production), business (e.g., total employment), and organizational research (e.g., hospital costs). Acres of corn planted per year, total natural gas production for a 30-day period, and total expenditures attributable
to a prospective change in benefit eligibility are all research questions that
require efficient estimation of population totals. In those agencies and disciplines where estimation of population totals plays a central role, advanced
“model-assisted” estimation procedures and specialized software are the
norm. Techniques such as the generalized regression (GREG) estimator or
calibration estimators integrate the survey data with known population
controls on the distribution of the population weighting factors to produce
efficient weighted estimates of population totals (DeVille and Särndal, 1992;
Valliant, Dorfman, and Royall, 2000).
Most statistical software packages do not currently support advanced
techniques for estimating population totals such the GREG or calibration

© 2010 by Taylor and Francis Group, LLC

124

Applied Survey Data Analysis

methods. Stata and other software systems that support complex sample
survey data analysis do provide the capability to compute simple weighted
or expansion estimates of finite population totals and also include a limited
set of options for including population controls in the form of post-stratified
estimation (see Theory Box€5.1).
In the case of a complex sample design including stratification (with strata
indexed by h = 1, …, H) and clustering (with clusters within stratum h indexed
by α = 1, 2, …, ah), the simple weighted estimator for the population total can
be written as follows:


Yˆw =

ah

H

nhα

∑∑∑ w

hαi



yhαi

(5.1)

h =1 α =1 i =1

A closed-form, unbiased estimate of the variance of this estimator is



var(Yˆw ) =

H


h=1



 ah
ah 
( ah − 1)  α =1




nhα

∑∑

2

( whαi yhαi ) −





ah


whαi yhαi 

i =1
ah

nhα

∑∑
α =1

i =1

2










(5.2)

(Recall from Section 3.6.2 that this simple estimator for the variance of a
weighted total plays an important role in the computation of Taylor series
linearization (TSL) estimates of sampling variances for more complex estimators that can be approximated as linear functions of estimated totals.)
This simple weighted estimator of the finite population total is often
labeled the Horvitz–Thompson or H–T estimator (Horvitz and Thompson,
1952). Practically speaking, this labeling is convenient, but the estimator in
Equation 5.1 and the variance estimator in Equation 5.2 make additional
assumptions beyond those that are explicit in Horvitz and Thompson’s original derivation. Theory Box€5.1 provides interested readers with a short summary of the theory underlying the H–T estimator.
Two major classes of total statistics can be estimated using Equation 5.1. If
yhαi is a binary indicator for an attribute (e.g., 1 = has the disease, 0 = disease
free), the result is an estimate of the size of the subpopulation that shares that
attribute:



Yˆw =

H

ah

nhα

∑∑∑ w
h=1 α =1 i=1

© 2010 by Taylor and Francis Group, LLC

hαi

yhαi

ˆ ≤
=M

H

ah

nhα

∑∑∑ w
h=1 α =1 i=1

hαi

⋅ 1 = Nˆ

125

Descriptive Analysis for Continuous Variables

Theory Box€5.1â•… The Horvitz–Thompson
Estimator of a Population Total
The Horvitz–Thompson estimator (Horvitz and Thompson, 1952) of
the population total for a variable Y is written as follows:
Yˆ =



N


i =1

δ iYi
=
pi

n

∑ yp
i

(5.3)

i

i =1

In Equation 5.3, δi = 1 if element i is included in the sample and 0 otherwise,
and pi is the probability of inclusion in the sample for element i. The H–T
estimator is an unbiased estimator for the population total Y, because the
only random variable defined in the estimator is the indicator of inclusion in the sample (the yi and pi values are fixed in the population):


N



E(Yˆ ) = E

i=1

δ iYi
=
pi

N


i=1

E(δ i )Yi
=
pi

N


i=1

piYi
=
pi

N

∑Y = Y
i

(5.4)

i =1

An unbiased estimator of the sampling variance of the H–T estimator
is


var(Yˆ ) =

n


i =1

yi2

(1 − pi )
+
pi2

n

n

i =1

j≠ i

∑∑

yi y j  pij − pi p j 

pij  pi p j 

(5.5)

In this expression, pij represents the probability that both elements i
and j are included in the sample; these joint inclusion probabilities must
be supplied to statistical software to compute these variance estimates.
The H–T estimator weights each sample observation inversely proportionate to its sample selection probability, wHT ,i = wsel ,i = 1 / pi , and does
not explicitly consider nonresponse adjustment or post-stratification
(Section 2.7). In fact, when the analysis weight incorporates all three of
these conventional weight factors, the variance estimator in Equation 5.2
does not fully reflect the stochastic sample-to-sample variability associated with the nonresponse mechanism, nor does it capture true gains
in precision that may have been achieved through the poststratification
of the weights to external population controls. Because survey nonresponse is a stochastic process that operates on the selected sample, the
variance estimator could (in theory) explicitly capture this added component of sample-to-sample variability (Valliant, 2004). This method
assumes that the data user can access the individual components of the
survey weight. Stata does provide the capability to directly account for

© 2010 by Taylor and Francis Group, LLC

126

Applied Survey Data Analysis

the reduction in sampling variance due to the poststratification using the
poststrata() and postweight() options on the svyset command.
The effect of nonresponse and poststratification weighting on the
sampling variance of estimated population totals and other descriptive
statistics may also be captured through the use of replicate weights, in
which the nonresponse adjustment and the poststratification controls
are separately developed for each balanced repeated replication (BRR)
or jackknife repeated replication (JRR) replicate sample of cases.
Alternatively, if y is a continuous measure of an attribute of the sample case
(e.g., acres of corn, monthly income, annual medical expenses), the result is
an estimate of the population total of y,
Yˆw =



H

ah

nhα

∑∑∑ w

hαi

yhαi = Yˆ

h=1 α =1 i=1

Example 5.3 will illustrate the estimation of a subpopulation total, and
Example 5.4 will illustrate the estimation of a population total.
Example 5.3:╇Using the NCS-R Data to Estimate the Total Count
of U.S. Adults with Lifetime Major Depressive Episodes (MDE)
The MDE variable in the NCS-R data set is a binary indicator (1 = yes, 0 = no) of
whether an NCS-R respondent reported a major depressive episode at any point
in his or her lifetime. The aim of this example analysis is to estimate the total
number of individuals who have experienced a lifetime major depressive episode
along with the standard error of the estimate (and the 95% confidence interval).
For this analysis, the NCS-R survey weight variable NCSRWTSH is selected to
analyze all respondents completing the Part I survey (n = 9,282), where the lifetime
diagnosis of MDE was assessed. Because the NCS-R data producers normalized
the values of the NCSRWTSH variable so that the weights would sum to the sample
size, the weight values must be expanded back to the population scale to obtain an
unbiased estimate of the population total. This is accomplished by multiplying the
Part I weight for each case by the ratio of the NCS-R survey population total (N =
209,128,094 U.S. adults age 18+) divided by the count of sample observations (n
= 9,282). The SECLUSTR variable contains the codes representing NCS-R sampling
error clusters while SESTRAT is the sampling error stratum variable:
gen ncsrwtsh_pop = ncsrwtsh * (209128094 / 9282)
svyset seclustr [pweight = ncsrwtsh_pop], strata(sestrat)

Once the complex design features of the NCS-R sample have been identified
using the svyset command, the svy: total command is issued to obtain an

© 2010 by Taylor and Francis Group, LLC

127

Descriptive Analysis for Continuous Variables

unbiased weighted estimate of the population total along with a standard error for
the estimate. The Stata estat effects command is then used to compute an
estimate of the design effect for this estimated total:
svy: total mde
estat effects
n

df

Yˆw

se(Yˆw )

CI.95 (Yˆw )

d 2 (Yˆw )

9,282

42

40,092,206

2,567,488

(34,900,000, 45,300,000)

9.03

The resulting Stata output indicates that 9,282 observations have been analyzed
and that there are 42 design-based degrees of freedom. The weighted estimate
of the total population of U.S. adults who have experienced an episode of major
depression in their lifetime is Yˆ = 40, 092, 206 . The estimated value of the design
effect for the weighted estimate of the population total is d 2(Yˆw ) = 9.03, suggesting
that the NCS-R variance of the estimated total is approximately nine times greater
than that expected for a simple random sample of the same size.
Weighted estimates of population totals can also be computed for subpopulations. Consider subpopulations of NCS-R adults classified by marital status (married,
separated/widowed/divorced, and never married). Under the complex NCS-R sample design, correct unconditional subpopulation analyses (see Section 4.5.2) can be
specified in Stata by adding the over() option to the svy: total command:
svy: total mde, over(mar3cat)
estat effects

Subpopulation
Married
Sep./Wid./Div.
Never Married

n

Estimated Total
Lifetime MDE

5322
2017
1943

20,304,190
10,360,671
╇ 9,427,345

Standard
Error
1,584,109
702,601
773,137

95% Confidence
Interval
(17,100,000, 23,500,000)
(8,942,723, 11,800,000)
(7,867,091, 11,000,000)

d 2 (Yˆ )
6.07
2.22
2.95

Note that the MAR3CAT variable is included in parentheses to request estimates
for each subpopulation defined by the levels of that variable.

Example 5.4:╇Using the HRS Data to Estimate Total Household Assets
Next, consider the example problem of estimating the total value of household
assets for the HRS target population (U.S. households with adults born prior to
1954). We first identify the HRS variables containing the sampling error computation units, or ultimate clusters (SECU) and the sampling error stratum codes
(STRATUM). We also specify the KWGTHH variable as the survey weight variable
for the analysis, because we are performing an analysis at the level of the HRS
household financial unit. The HRS data set includes an indicator variable (KFINR
for 2006) that identifies the individual respondent who is the financial reporter for
each HRS sample household. This variable is used to create a subpopulation indicator (FINR) that restricts the estimation to only sample members who are financial

© 2010 by Taylor and Francis Group, LLC

128

Applied Survey Data Analysis

reporters for their HRS household unit. We then apply the svy: total command
to the H8ATOTA variable, measuring the total value of household assets:
gen finr=1
replace finr=0 if kfinr !=1
svyset secu [pweight=kwgthh], strata(stratum)
svy, subpop(finr): total h8atota
n

df

Yˆw

se(Yˆw )

CI.95 (Yˆw )

11,942

56

$2.84 × 1013

$1.60 × 1012

(2.52 × 1013, 3.16 × 1013)

The Stata output indicates that the 2006 HRS target population includes approximately 53,853,000 households (not shown). In 2006, these 53.9 million estimated
households owned household assets valued at an estimated Yˆw = $2.84 × 1013,
with a 95% confidence interval (CI) of ($2.52 × 1013, $3.16 × 1013).

5.3.3╇ Means of Continuous, Binary, or Interval Scale Data
The estimation of the population mean, Y , for a continuous, binary or interval scale variable is a very common task for survey researchers seeking to
describe the central tendencies of these variables in populations of interest.
An estimator of the population mean, Y , can be written as a nonlinear ratio
of two estimated finite population totals:
ah

H

yw =



nhα

∑∑∑ w

hαi

yhαi
=

h =1 α =1 i =1
ah nhα
H

∑∑∑ w

hαi





(5.6)

h=1 α =1 i=1

Note that if y is a binary variable coded 1 or 0, the weighted mean estimates the population proportion or prevalence, P, of “1s” in the population
(see Chapter 6):
H



yw =

ah

nhα

∑∑∑ w

hαi

yhαi
=

h=1 α =1 i =1
ah nhα
H

∑∑∑ w

hαi

ˆ
M
= p


(5.7)

h=1 α =1 i =1

Since yw is not a linear statistic, a closed form formula for the variance of
this estimator does not exist. As a result, variance estimation methods such
as TSL, BRR, or JRR (see Chapter 3) must be used to compute an estimate of

© 2010 by Taylor and Francis Group, LLC