Tải bản đầy đủ
Example 5.11: Estimating Mean Systolic Blood Pressure for Males and Females Age > 45 Using the NHANES Data

Example 5.11: Estimating Mean Systolic Blood Pressure for Males and Females Age > 45 Using the NHANES Data

Tải bản đầy đủ


Descriptive Analysis for Continuous Variables




se( yw )

CI.95 ( yw )

d 2 ( yw )







Survey analysts should be aware that restricting analysis to a subpopulation of
the full sample may result in a reduction in the effective degrees of freedom for
variance estimation. Since Stata employs the variable degrees of freedom method
discussed in Section 3.5.2, its programs ignore any original design strata that do
not contain one or more observations from the subpopulation of interest. Stata
signals that complete strata have been dropped by including a note in the output
indicating that one or more design strata were “omitted because they contain
no subpopulation members.” Approaches to this issue are not currently uniform
across the different major statistical software packages.
The greatest reductions in effective degrees of freedom for variance estimation can occur when survey analysts are interested in estimation for rare
subpopulations that comprise only a small percent of the survey population
or when the subpopulation of interest is defined by a domain of sample strata,
such as a single census region. (Refer to Figure€ 4.4 for an illustration of how
subpopulations may distribute across the strata and clusters of a complex sample design.)
For example, the following Stata svy: mean command requests estimates of
mean systolic blood pressure for four education groupings of African Americans
age 80 and older:
svy: mean bpxsy1, subpop(if age >80 & black==1) over(edcat)

Stata reports (results not shown) that for one or more of these detailed subpopulation estimates, only 12 of the 15 design strata and 24 of the 30 clusters in the
2005–2006 NHANES design contain one or more eligible respondents from the
target subpopulation. Consequently, 12 degrees of freedom are assumed in developing confidence intervals or evaluating test statistics from this analysis.
Procedures for subpopulation analyses focused on estimation of percentiles/
quantiles are currently available in the SUDAAN, WesVar PC, and SAS (Version
9.2 and higher) software (see Example 5.8). Examples of subpopulation analyses
using these other software systems are available on the book Web site.

5.6╇L inear Functions of Descriptive Estimates
and Differences of Means
The capability to estimate functions of descriptive statistics, especially differences of means or proportions, is an important feature of many survey analyses.
In general, many important functions of descriptive statistics can be written as
linear combinations of the descriptive statistics of interest. Examples of such

© 2010 by Taylor and Francis Group, LLC


Applied Survey Data Analysis

linear combinations of estimates include differences of means, weighted sums
of means used to build economic indices, or computation of a moving average
of three monthly survey means, such as the following:


= y1 − y2

= .25 ⋅ y1 + .40 ⋅ y2 + 1.5 ⋅ y3 + 1.0 ⋅ y4

ymoving = 1 / 3 ⋅ yt1 + 1 / 3 ⋅ yt 2 + 1 / 3 ⋅ yt 3
Consider the general form of a linear combination of j = 1, …, J descriptive
statistics (e.g., estimates of means for J subpopulations):

f (θ1 ,..., θ J ) =

∑a θ



j =1

In Equation 5.17, θj represents the statistic of interest for the j-th subpopulation, while the aj terms represent constants defined by the analyst. This function is estimated by simply substituting estimates of the descriptive statistics
into the expression
f (θˆ 1 ,..., θˆ J ) =


∑ a θˆ



j =1

The variance of this estimator would then be calculated as follows:

var 


j =1

a jθˆ j  =


j =1

a var(θˆ j ) + 2 ×

J −1


∑ ∑ a a cov((θˆ , θˆ )
j k




j =1 k > j

Note that the variance of the linear combination incorporates the variances
of the individual component estimates as well as possible covariances of the
estimated statistics.
Covariance between descriptive estimates can arise in several ways.
First, the statistics of interest may represent overlapping subpopulations
(see Kish, 1987). An example would be a contrast comparing the mean
systolic blood pressure for men to that for the total population, such as
∆ˆ = ytotal − ymale , or a longitudinal analysis in which the mean blood pressures of a sample panel of individuals are compared at two points in time,
∆ˆ = ytime 2 − ytime1 . Due to the intraclass correlation among the sample elements within sample design clusters of complex sample designs, a degree

© 2010 by Taylor and Francis Group, LLC

Descriptive Analysis for Continuous Variables


of covariance is possible even when estimates are based on nonoverlapping
samples, such as ∆ˆ = y female − ymale .
Under conditions where samples are overlapping or the complex design
itself induces a covariance in the sample estimates, statistical software must
compute and store the covariances of estimates to correctly compute the variance of a linear combination of means or other descriptive statistics. Stata’s
lincom command is one example of a postestimation command that enables
analysts to correctly compute estimates and standard errors for linear combinations of estimates. SUDAAN’s contrast option is another.
5.6.1╇ Differences of Means for Two Subpopulations
Analysts of survey data are frequently interested in making inferences about
differences of descriptive statistics for two subpopulations. The inference
can be based on a 95% confidence interval for the estimated difference of the
two means, such as

CI = ( y female − ymale ) ± tdf ,.975 ⋅ se( y female − ymale )


or can employ a two-sample t-test based on the Student t statistic,

t = ( y female − ymale )/ se( y female − ymale )


Applying the general formula for the variance of a linear combination
(5.19), the standard error of the difference in the mean of the two subpopulations samples can be expressed as

se( y1 − y2 ) = var ( y1 ) + var ( y2 ) − 2cov( y1 , y2 )


Under simple random sampling, the covariance of estimates for distinct
samples is zero; however, for clustered samples or samples that share elements, the covariance of the two sample means may be nonzero and generally positive in value. Example 5.12 estimates the contrast in the mean value
of total household assets for two subpopulations of HRS households: one
subpopulation in which the household head has less than a high school
education and a second subpopulation of households headed by a collegeeducated adult.
Example 5.12:╇Estimating Differences in Mean Total Household Assets
between HRS Subpopulations Defined by Educational Attainment Level
Testing differences of subpopulation means using the svy: mean procedure in
Stata is a two-step process. First, the mean of total household assets is estimated

© 2010 by Taylor and Francis Group, LLC


Applied Survey Data Analysis

for subpopulations defined by the levels of the EDCAT variable in the HRS data
gen finr = 1
replace finr = 0 if kfinr != 1
svyset secu [pweight = kwgthh], strata(stratum) ///
vce(linearized) singleunit(missing)
svy, subpop(finr): mean h8atota, over(edcat)
Education of

Stata Label


se( yw )

CI.95 ( yw )




($129,184 , $227,588)
($294,171, $362,613)
($401,369, $509,545)
($902,646, $1,311,762)

0–11 yrs
12 yrs
13–15 yrs
16+ yrs

Stata automatically saves these estimated means along with their sampling variances and covariances in memory until the next command is issued. Stata assigns
the four subpopulations of household heads with 0–11, 12, 13–15, and 16+ years
of education to internal reference labels 1, 2, 3, and 4, respectively.
Following the computation of the subpopulation means, the lincom postestimation command is used to estimate the difference in means for household
heads with 0–11 (subpopulation 1) versus 16+ years of education (subpopulation 4):
lincom [h8atota]1 - [h8atota]4
of Head

y0-11 − y16+

0–11 vs. 16+


se( y0−11 − y16+ )

CI.95 ( y0−11 − y16+ )
(–$1,145,669, –$711,967)

After this postestimation command is submitted, Stata outputs the estimated difference of the two subpopulation means ( ∆ˆ = y0−11 − y16+ = –$928,818). The 95%
confidence interval for the population difference does not include 0, suggesting
that households headed by a college graduate have a significantly higher mean
of total household assets compared with households in which the head does not
have a high school diploma.
To display the estimated variance–covariance matrix for the subpopulation estimates of mean household assets, the Stata postestimation command vce is used:

The output produced by this command is shown in the symmetric 4 × 4 matrix
in Table€5.2. Note that in this example, the covariance of the estimated means for
the 0–11 and 16+ household heads is small and negative (–3.438 × 108).
When applying the formula for the standard error of the contrast we get Table

© 2010 by Taylor and Francis Group, LLC