Tải bản đầy đủ
Example 5.8: Estimating Population Quantiles for Total Household Assets Using the HRS Data

Example 5.8: Estimating Population Quantiles for Total Household Assets Using the HRS Data

Tải bản đầy đủ

133

Descriptive Analysis for Continuous Variables

Table€5.1
Estimation of Percentiles of the Distribution of 2006
HRS Total Household Assets
SUDAAN (TSL)
Percentile
Q25
Q50 (Median)
Q75

WesVar PC (BRR)

ˆ
Q
p

ˆ )
se(Q
p

ˆ
Q
p

ˆ )
se(Q
p

$39,852
$183,309
$495,931

$3,167
$10,233
$17,993

$39,852
$183,309
$495,931

$3,249
$9,978
$17,460

generates the quantile estimates and standard errors, using an unconditional subclass analysis approach:
proc descript ;
nest stratum secu ;
weight kwgthh ;
subpopn finr = 1 ;
var h8atota ;
percentiles 25 75 / median ;
setenv decwidth = 1 ;
run ;

Table€5.1 summarizes the results provided in the SUDAAN output and compares
the estimates with those generated using the BRR approach to variance estimation
in the WesVar PC software. The estimated median of the total household assets for
the HRS target population is $183,309. The estimate of the mean total household
assets from Example 5.7 was $527,313, suggesting that the distribution of total
household assets is highly skewed to the higher dollar value ranges.
The analysis of quantiles of the distribution of total household assets for HRS
households was repeated using the WesVar PC software (readers are referred to
the ASDA Web site for the menu steps needed to perform this analysis in WesVar).
WesVar allows for the use of BRR to estimate the standard errors of estimated
quantiles. From the side-by-side comparison in Table€5.1, WesVar’s weighted estimates of the quantiles agree exactly with those reported by SUDAAN. However,
WesVar’s BRR estimates of the corresponding standard errors differ slightly from
the TSL standard errors computed by SUDAAN, as expected. The resulting inferences about the population quantiles would not differ substantially in this example
as a result.
The DESCRIPT procedure in the SUDAAN software can also be used to estimate quantiles for subpopulations of interest. For example, the same quantiles
could be estimated for the subpopulation of adults age 75 and older in the HRS
population using the following syntax:
proc descript ;
nest stratum secu ;
weight kwgthh ;
subpopn kage > 74 & finr = 1 ;

© 2010 by Taylor and Francis Group, LLC

134

Applied Survey Data Analysis

var h8atota ;
percentiles 25 75 / median ;
setenv decwidth = 1 ;
run ;

Note the use of the SUBPOPN statement to identify that the estimate is based on
respondents who are 75 years of age and older and are the financial reporter for
their HRS household unit. The estimated quantile values and standard errors generated by this subpopulation analysis are Qˆ 25,75+ = $40,329.4 ($4,434.8); Qˆ 50 ,75+ =
$177,781.3 ($11,142.9); and Qˆ 75,75+ = $461,308.3 ($27,478.0).

5.4╇Bivariate Relationships between
Two Continuous Variables
Four basic analytic approaches can be used to examine bivariate relationships
between two continuous survey variables: (1) generation of a scatterplot; (2)
computation of a pair-wise correlation coefficient, r; (3) estimation of the ratio
of two continuous variables, Rˆ = Yˆ / Xˆ ; and (4) estimation of the coefficients
of the simple linear regression of one variable on another, Yˆ = βˆ 0 + βˆ 1 ⋅ X . The
first three of these techniques are reviewed in this section. Chapter 7 will
address the estimation of linear regression models for continuous dependent
variables in detail.
5.4.1╇ X–Y Scatterplots
A comparison of the 2005–2006 NHANES MEC measures of high-density
lipoprotein (HDL; the y variable) and total serum cholesterol (the x variable)
illustrates how a simple scatterplot can be used to gain insight into the bivariate relationship of two survey variables. Figure€5.4 presents an unweighted
scatterplot comparing the two variables. The figure suggests a positive
relationship between HDL and total serum cholesterol but also illustrates
considerable variability in the relationship and the presence of a few points
that stand out as potential outliers. A drawback to the simple X–Y scatterplot summary is that there is no practical way to incorporate the population
weights and the corresponding population frequency associated with each
point in the two-dimensional X–Y space. Stata does provide a unique display
in which the area of the dot representing an X–Y point is proportionate to its
survey weight. Unfortunately, in survey data sets with hundreds and even
thousands of data points, it is not possible to obtain the resolution needed to
evaluate single points or to detect patterns in the weighted scatterplot display. One technique for introducing information on the effect of the weights

© 2010 by Taylor and Francis Group, LLC

135

Descriptive Analysis for Continuous Variables

Direct HDL Cholesterol

200

150

100

50

0
100

200

300

400

500

600

Total Cholesterol (mg/dl)
Direct hdl-cholesterol (mg/dl)

Fitted values

Figure€5.4
HDL versus total serum cholesterol (mg/dl) in U.S. adults (unweighted points, with the fit of a
weighted regression line included).

in the plotted X–Y relationship is to overlay on the scatterplot the line representing the weighted regression of y on x, yˆ wls = βˆ 0 + βˆ 1 ⋅ x .
To do this in the Stata software, we use the twoway graphing command,
where the first command in parentheses defines the scatterplot and the second command overlays the weighted estimate of the simple linear regression model relating total cholesterol to HDL (note that the probability weight
variable is defined in square brackets for the second command):
twoway (scatter lbdhdd lbxtc if age18p==1) (lfit lbdhdd ///
lbxtc if age18p==1 [pweight=wtmec2yr])

5.4.2╇ Product Moment Correlation Statistic (r)
The product moment correlation (r) is a standard measure of the association between two continuous variables. Unfortunately, few current software
systems provide the capability to estimate single correlations (or weighted
correlation matrices) and confidence intervals for r that account for complex
sample design features. A reasonable alternative would be to first standardize the two variables for which a correlation is desired and then compute a
weighted estimate of the slope parameter in a simple linear regression model
relating the two variables (see Chapter 7).
A weighted estimator of the product moment correlation statistic is written
as follows:

© 2010 by Taylor and Francis Group, LLC

136

Applied Survey Data Analysis

rw =
H

ah

nhα

∑∑∑ w



sxy ,w
=
sx , w ⋅ s y , w

hαi

( yhαi − yw )( xhαi − xw )



(5.14)

h=1 α =1 i=1

H

ah

nhα

∑∑∑

whαi ( yhαi − yw )2 ⋅

h =1 α = 1 i = 1

H

ah

nhα

∑∑∑ w

hαi

( xhαi − xw )2

h = 1 α = 1 i =1

Kish and Frankel (1974) included the pair-wise correlation statistic in
their simulation studies and found that TSL, BRR, and JRR performed
similarly in the estimation of standard errors for estimates of pair-wise
correlation statistics.
5.4.3╇Ratios of Two Continuous Variables
Occasionally, survey analysts need to estimate the ratio of two continuous survey variables (e.g., the ratio of HDL cholesterol level to total cholesterol). The
ratio estimator of the population mean (Equation 5.6) of a single variable can be
generalized to an estimator of the population ratio of two survey variables:
H


Rˆ = =




αh

nhα

∑∑∑ w
h =1 α =1 i =1
H α h nhα

∑∑∑ w

hαi

yhαi


(5.15)

x

hαi hαi

h =1 α =1 i =1

The following TSL approximation provides estimates of the sampling variance of ratio estimates:


var( Rˆ ) 

var(Yˆ ) + Rˆ 2 ⋅ var( Xˆ ) − 2 ⋅ Rˆ ⋅ cov(Yˆ , Xˆ )

Xˆ 2

(5.16)

BRR and JRR estimation options now available in the major statistical software packages provide an appropriate alternative to TSL for estimating standard errors of Rˆ .
Example 5.9:╇Estimating the Population Ratio of HighDensity to Total Cholesterol for U.S. Adults
A weighted estimate of this ratio based on 2005–2006 NHANES respondents
age 18+ is obtained in Stata using the svy: ratio command. Note that we

© 2010 by Taylor and Francis Group, LLC

137

Descriptive Analysis for Continuous Variables

Theory Box€5.2╅ Ratio Estimators
of Population Totals
If two survey variables x and y are highly correlated and a control total
for the variable x, Xpop, is known from an auxiliary data source, then the
ratio estimator of the population total, YˆR = Rˆ ⋅ Xpop , may provide a more
precise estimate than the simple weighted estimator of the population
total, Yˆw , described in Section 5.3.2. Because Xpop is assumed to be free
of sampling variability, the standard error of the ratio estimator of the
population total is: se(YˆR ) = se( Rˆ ) ⋅ Xpop .
once again define the subpopulation indicator (AGE18P) first before running the
analysis.
gen age18p = 1 if age >= 18 and age != .
replace age18p = 0 if age < 18
svyset sdmvpsu [pweight=wtmec2yr], strata(sdmvstra)
svy, subpop(age18p): ratio (lbdhdd/lbxtc)
n

df



se( Rˆ )

CI.95 ( Rˆ )

4996

15

0.275

0.002

(0.271,0.280)

Note that the two variables defining the numerator and denominator of the ratio
are indicated in parentheses in the svy: ratio command, with the numerator
variable listed before a forward slash (/) and the denominator variable listed after
the slash.

5.5╇ Descriptive Statistics for Subpopulations
Section 4.5.3 discussed the important analytical differences between conditional and unconditional subpopulation analyses. Section 5.3.2 presented
examples of unconditional subpopulation analyses in the estimation of totals.
Stata’s svy commands provide two options for correctly specifying subpopulation analyses. The over() option requests unconditional subclass analyses for subpopulations defined based on all levels of a categorical variable.
This option is only available in Stata’s descriptive analysis procedures. The
more general subpop() option can be used with all svy commands in Stata.
This option requests analysis for a specific subpopulation identified by an
indicator variable, coded 1 if the case belongs to the subpopulation of interest
and 0 otherwise. Procedures for survey data analysis in other software packages will have similar command options available for these subpopulation
analyses (see the book Web site for examples).

© 2010 by Taylor and Francis Group, LLC

138

Applied Survey Data Analysis

Example 5.10:╇Estimating the Proportions of Males and
Females Age > 70 with Diabetes Using the HRS Data
This first subpopulation analysis example uses the 2006 HRS data set to estimate the prevalence of diabetes for U.S. males and females age 70 years and
older. Because the HRS variable DIABETES is equal to 1 for respondents with
diabetes and 0 otherwise, estimating the mean of this binary variable will result
in the prevalence estimates of interest. The following example command uses
a logical condition in the subpop() option to specify the age range and then
the over() option to perform separate subpopulation analyses for males and
females:
svy: mean diabetes, subpop(if kage > 70) over(gender)
Gender
Male
Female

df
56
56

yw
0.235
0.184

se( yw )
0.008
0.009

CI.95 ( yw )
(0.219,0.252)
(0.167,0.201)

We see that more elderly men (23.5%) are estimated to have diabetes compared
with elderly women (18.4%).
It might be tempting for an analyst to make the mistake of taking a conditional approach to these subpopulation analyses, using a Stata command like the
following:
svy: mean diabetes if age > 70

Use of the subsetting if modifier to define a subpopulation for stratified samples is inappropriate because all cases not satisfying the condition are temporarily
deleted from the analysis, along with their sample design information. This effectively fixes the strata sample sizes for variance estimation calculations (when in
fact the strata subpopulation sample sizes should be treated as random variables).
See Chapter 4 for more details.

Example 5.11:╇Estimating Mean Systolic Blood Pressure for
Males and Females Age > 45 Using the NHANES Data
Consider an example based on NHANES of estimating the mean systolic blood
pressure for male and female adults over the age of 45. We once again illustrate
a combination of options in Stata to set up this analysis, using a logical condition
in the subpop() option to specify the age range and then the over() option to
perform subpopulation analyses for males and females in this age range separately.
We also request design effects for each subpopulation estimate using the estat
effects postestimation command:
svy, subpop(if age > 45): mean bpxsy1, over(gender)
estat effects

© 2010 by Taylor and Francis Group, LLC

139

Descriptive Analysis for Continuous Variables

Gender

df

yw

se( yw )

CI.95 ( yw )

d 2 ( yw )

Male
Female

15
15

128.96
132.09

0.76
1.06

(127.35,130.58)
(129.82,134.36)

2.60
3.62

Survey analysts should be aware that restricting analysis to a subpopulation of
the full sample may result in a reduction in the effective degrees of freedom for
variance estimation. Since Stata employs the variable degrees of freedom method
discussed in Section 3.5.2, its programs ignore any original design strata that do
not contain one or more observations from the subpopulation of interest. Stata
signals that complete strata have been dropped by including a note in the output
indicating that one or more design strata were “omitted because they contain
no subpopulation members.” Approaches to this issue are not currently uniform
across the different major statistical software packages.
The greatest reductions in effective degrees of freedom for variance estimation can occur when survey analysts are interested in estimation for rare
subpopulations that comprise only a small percent of the survey population
or when the subpopulation of interest is defined by a domain of sample strata,
such as a single census region. (Refer to Figure€ 4.4 for an illustration of how
subpopulations may distribute across the strata and clusters of a complex sample design.)
For example, the following Stata svy: mean command requests estimates of
mean systolic blood pressure for four education groupings of African Americans
age 80 and older:
svy: mean bpxsy1, subpop(if age >80 & black==1) over(edcat)

Stata reports (results not shown) that for one or more of these detailed subpopulation estimates, only 12 of the 15 design strata and 24 of the 30 clusters in the
2005–2006 NHANES design contain one or more eligible respondents from the
target subpopulation. Consequently, 12 degrees of freedom are assumed in developing confidence intervals or evaluating test statistics from this analysis.
Procedures for subpopulation analyses focused on estimation of percentiles/
quantiles are currently available in the SUDAAN, WesVar PC, and SAS (Version
9.2 and higher) software (see Example 5.8). Examples of subpopulation analyses
using these other software systems are available on the book Web site.

5.6╇L inear Functions of Descriptive Estimates
and Differences of Means
The capability to estimate functions of descriptive statistics, especially differences of means or proportions, is an important feature of many survey analyses.
In general, many important functions of descriptive statistics can be written as
linear combinations of the descriptive statistics of interest. Examples of such

© 2010 by Taylor and Francis Group, LLC