Tải bản đầy đủ

Example 5.11: Estimating Mean Systolic Blood Pressure for Males and Females Age > 45 Using the NHANES Data

139

Descriptive Analysis for Continuous Variables

Gender

df

yw

se( yw )

CI.95 ( yw )

d 2 ( yw )

Male

Female

15

15

128.96

132.09

0.76

1.06

(127.35,130.58)

(129.82,134.36)

2.60

3.62

Survey analysts should be aware that restricting analysis to a subpopulation of

the full sample may result in a reduction in the effective degrees of freedom for

variance estimation. Since Stata employs the variable degrees of freedom method

discussed in Section 3.5.2, its programs ignore any original design strata that do

not contain one or more observations from the subpopulation of interest. Stata

signals that complete strata have been dropped by including a note in the output

indicating that one or more design strata were “omitted because they contain

no subpopulation members.” Approaches to this issue are not currently uniform

across the different major statistical software packages.

The greatest reductions in effective degrees of freedom for variance estimation can occur when survey analysts are interested in estimation for rare

subpopulations that comprise only a small percent of the survey population

or when the subpopulation of interest is defined by a domain of sample strata,

such as a single census region. (Refer to FigureÂ€ 4.4 for an illustration of how

subpopulations may distribute across the strata and clusters of a complex sample design.)

For example, the following Stata svy: mean command requests estimates of

mean systolic blood pressure for four education groupings of African Americans

age 80 and older:

svy: mean bpxsy1, subpop(if age >80 & black==1) over(edcat)

Stata reports (results not shown) that for one or more of these detailed subpopulation estimates, only 12 of the 15 design strata and 24 of the 30 clusters in the

2005–2006 NHANES design contain one or more eligible respondents from the

target subpopulation. Consequently, 12 degrees of freedom are assumed in developing confidence intervals or evaluating test statistics from this analysis.

Procedures for subpopulation analyses focused on estimation of percentiles/

quantiles are currently available in the SUDAAN, WesVar PC, and SAS (Version

9.2 and higher) software (see Example 5.8). Examples of subpopulation analyses

using these other software systems are available on the book Web site.

5.6â•‡L inear Functions of Descriptive Estimates

and Differences of Means

The capability to estimate functions of descriptive statistics, especially differences of means or proportions, is an important feature of many survey analyses.

In general, many important functions of descriptive statistics can be written as

linear combinations of the descriptive statistics of interest. Examples of such

© 2010 by Taylor and Francis Group, LLC

140

Applied Survey Data Analysis

linear combinations of estimates include differences of means, weighted sums

of means used to build economic indices, or computation of a moving average

of three monthly survey means, such as the following:

∆ˆ

= y1 − y2

Iˆ

= .25 ⋅ y1 + .40 ⋅ y2 + 1.5 ⋅ y3 + 1.0 ⋅ y4

ymoving = 1 / 3 ⋅ yt1 + 1 / 3 ⋅ yt 2 + 1 / 3 ⋅ yt 3

Consider the general form of a linear combination of j = 1, …, J descriptive

statistics (e.g., estimates of means for J subpopulations):

J

f (θ1 ,..., θ J ) =

∑a θ

j

(5.17)

j

j =1

In Equation 5.17, θj represents the statistic of interest for the j-th subpopulation, while the aj terms represent constants defined by the analyst. This function is estimated by simply substituting estimates of the descriptive statistics

into the expression

f (θˆ 1 ,..., θˆ J ) =

J

∑ a θˆ

j

(5.18)

j

j =1

The variance of this estimator would then be calculated as follows:

var

J

∑

j =1

a jθˆ j =

J

∑

j =1

a var(θˆ j ) + 2 ×

2

j

J −1

J

∑ ∑ a a cov((θˆ , θˆ )

j k

j

k

(5.19)

j =1 k > j

Note that the variance of the linear combination incorporates the variances

of the individual component estimates as well as possible covariances of the

estimated statistics.

Covariance between descriptive estimates can arise in several ways.

First, the statistics of interest may represent overlapping subpopulations

(see Kish, 1987). An example would be a contrast comparing the mean

systolic blood pressure for men to that for the total population, such as

∆ˆ = ytotal − ymale , or a longitudinal analysis in which the mean blood pressures of a sample panel of individuals are compared at two points in time,

∆ˆ = ytime 2 − ytime1 . Due to the intraclass correlation among the sample elements within sample design clusters of complex sample designs, a degree

© 2010 by Taylor and Francis Group, LLC

Descriptive Analysis for Continuous Variables

141

of covariance is possible even when estimates are based on nonoverlapping

samples, such as ∆ˆ = y female − ymale .

Under conditions where samples are overlapping or the complex design

itself induces a covariance in the sample estimates, statistical software must

compute and store the covariances of estimates to correctly compute the variance of a linear combination of means or other descriptive statistics. Stata’s

lincom command is one example of a postestimation command that enables

analysts to correctly compute estimates and standard errors for linear combinations of estimates. SUDAAN’s contrast option is another.

5.6.1â•‡ Differences of Means for Two Subpopulations

Analysts of survey data are frequently interested in making inferences about

differences of descriptive statistics for two subpopulations. The inference

can be based on a 95% confidence interval for the estimated difference of the

two means, such as

CI = ( y female − ymale ) ± tdf ,.975 ⋅ se( y female − ymale )

(5.20)

or can employ a two-sample t-test based on the Student t statistic,

t = ( y female − ymale )/ se( y female − ymale )

(5.21)

Applying the general formula for the variance of a linear combination

(5.19), the standard error of the difference in the mean of the two subpopulations samples can be expressed as

se( y1 − y2 ) = var ( y1 ) + var ( y2 ) − 2cov( y1 , y2 )

(5.22)

Under simple random sampling, the covariance of estimates for distinct

samples is zero; however, for clustered samples or samples that share elements, the covariance of the two sample means may be nonzero and generally positive in value. Example 5.12 estimates the contrast in the mean value

of total household assets for two subpopulations of HRS households: one

subpopulation in which the household head has less than a high school

education and a second subpopulation of households headed by a collegeeducated adult.

Example 5.12:â•‡Estimating Differences in Mean Total Household Assets

between HRS Subpopulations Defined by Educational Attainment Level

Testing differences of subpopulation means using the svy: mean procedure in

Stata is a two-step process. First, the mean of total household assets is estimated

© 2010 by Taylor and Francis Group, LLC

142

Applied Survey Data Analysis

for subpopulations defined by the levels of the EDCAT variable in the HRS data

set:

gen finr = 1

replace finr = 0 if kfinr != 1

svyset secu [pweight = kwgthh], strata(stratum) ///

vce(linearized) singleunit(missing)

svy, subpop(finr): mean h8atota, over(edcat)

Education of

Head

Stata Label

yw

se( yw )

CI.95 ( yw )

1

2

3

4

$178,386

$328,392

$455,457

$1,107,204

$24,561

$17,082

$27,000

$102,113

($129,184 , $227,588)

($294,171, $362,613)

($401,369, $509,545)

($902,646, $1,311,762)

0–11 yrs

12 yrs

13–15 yrs

16+ yrs

Stata automatically saves these estimated means along with their sampling variances and covariances in memory until the next command is issued. Stata assigns

the four subpopulations of household heads with 0–11, 12, 13–15, and 16+ years

of education to internal reference labels 1, 2, 3, and 4, respectively.

Following the computation of the subpopulation means, the lincom postestimation command is used to estimate the difference in means for household

heads with 0–11 (subpopulation 1) versus 16+ years of education (subpopulation 4):

lincom [h8atota]1 - [h8atota]4

Education

of Head

y0-11 − y16+

0–11 vs. 16+

–$928,818

se( y0−11 − y16+ )

$108,250

CI.95 ( y0−11 − y16+ )

(–$1,145,669, –$711,967)

After this postestimation command is submitted, Stata outputs the estimated difference of the two subpopulation means ( ∆ˆ = y0−11 − y16+ = –$928,818). The 95%

confidence interval for the population difference does not include 0, suggesting

that households headed by a college graduate have a significantly higher mean

of total household assets compared with households in which the head does not

have a high school diploma.

To display the estimated variance–covariance matrix for the subpopulation estimates of mean household assets, the Stata postestimation command vce is used:

vce

The output produced by this command is shown in the symmetric 4 × 4 matrix

in TableÂ€5.2. Note that in this example, the covariance of the estimated means for

the 0–11 and 16+ household heads is small and negative (–3.438 × 108).

When applying the formula for the standard error of the contrast we get Table

5.2.

© 2010 by Taylor and Francis Group, LLC

## 2010 applied survey data analysis

## 4 Simple Random Sampling: A Simple Model for Design- Based Inference

## 2 Analysis Weights: Review by the Data User

## Example 5.1: A Weighted Histogram of Total Cholesterol Using the 2005– 2006 NHANES Data

## Example 5.8: Estimating Population Quantiles for Total Household Assets Using the HRS Data

## Example 5.13: E stimating Differences in Mean Total Household Assets from 2004 to 2006 Using Data from the HRS

## Example 6.4: A Goodness- of- Fit Test for Blood Pressure Status Category Proportions

## Example 6.6 E stimation of Total and Row Proportions for the Cross- Tabulation of Gender and Lifetime Major Depression Status Using the NCS- R Data

## Example 6.8: Testing the Independence of Alcohol Dependence and Education Level in Young Adults ( Ages 18– 28) Using the NCS- R Data

## Example 6.9: Simple Logistic Regression to Estimate the NCS- R Male/ Female Odds Ratio for Lifetime Major Depressive Episode

## 5 Application: Modeling Diastolic Blood Pressure with the NHANES Data

Tài liệu liên quan