Example 5.13: E stimating Differences in Mean Total Household Assets from 2004 to 2006 Using Data from the HRS
Tải bản đầy đủ
144
Applied Survey Data Analysis
and WEIGHT. Each record also includes the permanently assigned stratum and
cluster codes. As described in Example 5.4, the HRS public use data sets include
an indicator variable for each wave of data that identifies the respondents who
are the household financial reporters (JFINR for 2004; KFINR for 2006). Using
the over(year) option, estimates of mean total household assets are computed
separately for 2004 and 2006. The subpop(finr0406) option restricts the
estimates to the financial reporters for each of these two data collection years.
The postestimation lincom statement estimates the difference of means for the
two time periods, its linearized standard error, and a 95% confidence interval
for the difference:
gen weight = jwgthh
replace weight = kwgthh if year == 2006
gen finr04 = 1 if (year==2004 & jfinr==1)
gen finr06 = 1 if (year==2006 & kfinr==1)
gen finr0406 = 1 if finr04==1 | finr06==1
svyset secu [pweight = weight], strata(stratum)
svy, subpop(finr0406): mean totassets, over(year)
lincom [totassets]2004 - [totassets]2006
Contrast
y2004 − y2006
se( y2004 − y2006 )
CI.95 ( y2004 − y2006 )
2004 vs. 2006
–$115,526
$20,025
(–$155,642, –$75,411)
Note that the svyset command has been used again to specify the recoded
sampling weight variable (WEIGHT) in the stacked data set. The svy: mean
command is then used to request weighted estimates and linearized standard
errors (and the covariances of the estimates, which are saved internally) for each
subpopulation defined by the YEAR variable. The resulting estimate of the difference of means is ∆ˆ = y 2004 − y 2006 = –$115,526, with a linearized standard error
of $20,025. The analysis provides evidence that the mean total household assets
increased significantly from 2004 to 2006 for this population.
5.7â•‡ Exercises
1.This exercise serves to illustrate the effects of complex sample
designs (i.e., design effects) on the variances of estimated means, due
to stratification and clustering in sample selection (see Section 2.6.1
for a review). The following table lists values for an equal probability
(self-weighting) sample of n = 36 observations.
Observations
7.0685
13.6760
© 2010 by Taylor and Francis Group, LLC
13.7441
7.2293
7.2293
13.7315
STRATUM
CLUSTER
1
1
1
2
145
Descriptive Analysis for Continuous Variables
Observations
13.2310
10.9647
11.3274
17.3248
19.7091
13.6724
15.3685
15.9246
20.2603
12.4546
10.8922
11.2793
16.4423
12.1142
12.9173
16.2839
15.3004
14.0902
12.0955
18.4702
12.3425
11.8507
11.9133
16.7290
18.3800
14.6646
13.5876
16.4873
18.1224
14.6783
STRATUM
CLUSTER
1
1
1
1
2
2
2
2
2
2
3
4
5
6
7
8
9
10
11
12
Any software procedures can be used to answer the following four
questions.
a. Assume the sample of n = 36 observations is a simple random
sample from the population. Compute the sample mean, the
standard error of the mean, and a 95% confidence interval for
the population mean. (Ignore the finite population correction
[fpc], stratification, and the clustering in calculating the standard error.)
b. Next, assume that the n = 36 observations are selected as a stratified random sample of 18 observations from each of two strata.
(Ignore the fpc and the apparent clustering.) Assume that the
population size of each of the two strata is equal. Compute the
sample mean, the standard error of the mean and a 95% confidence interval for the population mean. (Ignore the fpc and the
clustering in calculating the standard error.) What is the estimated DEFT ( y ) for the standard error of the sample mean (i.e.,
the square root of the design effect)?
c. Assume now that the n = 36 observations are selected as an equal
probability sample of 12 clusters with exactly three observations
from each cluster. Compute the sample mean, the standard error
of the mean, a 95% confidence interval for the population mean,
and the estimate of DEFT ( y ) . Ignore the fpc and the stratification in calculating the standard error. Use the simple model of
the design effect for sample means to derive an estimate of roh
(2.11), the synthetic intraclass correlation (this may take a negative value for this “synthetic” data set).
d. Finally, assume that the n = 36 observations are selected as an
equal probability stratified cluster sample of observations. (Two
strata, six clusters per stratum, three observations per cluster.)
Compute the sample mean, the standard error of the mean, the
95% CI, and estimates of DEFT ( y ) and roh.
© 2010 by Taylor and Francis Group, LLC
146
Applied Survey Data Analysis
2.Using the NCS-R data and a statistical software procedure of your
choice, compute a weighted estimate of the total number of U.S.
adults that has ever been diagnosed with alcohol dependence (ALD)
along with a 95% confidence interval for the total. Make sure to
incorporate the complex design when computing the estimate and
the confidence interval. Compute a second 95% confidence interval
using an alternative variance estimation technique, and compare the
two resulting confidence intervals. Would your inferences change at
all depending on the variance estimation approach?
3.(Requires SUDAAN or SAS Version 9.2+) Using the SUDAAN or
SAS software and the 2005–2006 NHANES data set, estimate the
25th percentile, the median, and the 75th percentile of systolic blood
pressure (BPXSY1) for U.S. adults over the age of 50. You will need to
create a subpopulation indicator of those aged 51 and older for this
analysis. Remember to perform an appropriate subpopulation analysis for this population subclass. Compute 95% confidence intervals
for each percentile.
4.Download the NCS-R data set from the book Web site and consider
the following questions. For this exercise, the SESTRAT variable
identifies the stratum codes for computation of sampling errors, the
SECLUSTR variable identifies the sampling error computation units,
and the NCSRWTSH variable contains the final sampling weights
for Part 1 of the survey for each sampled individual.
a. How many sampling error calculation strata are specified for the
NCS-R sampling error calculation model?
b. How many SECUs (or clusters) are there in total?
c. How many degrees of freedom for variance estimation does the
NCS-R provide?
d. What is the expected loss, Lw, due to random weighting in survey
estimation for total population estimates? Hint: Lw = CV2(weight);
see Section 2.7.5.
e. What is the average cluster size, b, for total sample analyses of
variables with no item-missing data?
5.Using the statistical software procedure of your choice, estimate
the proportion of persons in the 2006 HRS target population with
arthritis (ARTHRITIS = 1). Use Taylor series linearization to estimate
a standard error for this proportion. Then, answer the following
questions:
a. What is the design effect for the total sample estimate of the proportion of persons with arthritis (ARTHRITIS = 1)? What is the
design effect for the estimated proportion of respondents age 70
and older (AGE70 = 1)? Hint: Use the standard variance formula,
© 2010 by Taylor and Francis Group, LLC
147
Descriptive Analysis for Continuous Variables
var(p) = p × (1 – p)/(n – 1), to obtain the variance of the proportion p under SRS. Use the weighted estimate of p provided by the
software procedure to compute the SRS variance. Don’t confuse
standard errors and variances (squares of standard errors).
b. Construct a 95% confidence interval for the mean of DIABETES.
Based on this confidence interval, would you say the proportion
of individuals with diabetes in the 2006 HRS target population is
significantly different from 0.25?
6.Examine the CONTENTS listing for the NCS-R data set on the book
Web site. Choose a dependent variable of interest (e.g., MDE: 1 = Yes,
0 = No). Develop a one-sample hypothesis (e.g., the prevalence of
lifetime major depressive episodes in the U.S. adult population is p
= 0.20). Write down your hypothesis before you actually look at the
sample estimates. Perform the required analysis using a software
procedure of your choosing, computing the weighted sample-based
estimate of the population parameter and a 95% confidence interval for the desired parameter. Test your hypothesis using the confidence interval. Write a one-paragraph statement of your hypothesis
and a summary of the results of your sample-based estimation and
your inference/conclusion based on the 95% CI.
7.Two subclasses are defined for NCS-R respondents based on their
response to a question on diagnosis of a major depressive episode
(MDE) (1 = Yes, 0 = No). For these two subclasses, use the software
procedure of your choice to estimate the difference of means and
standard error of the difference for body mass index (BMI). Make
sure to use the unconditional approach to subclass analysis in this
case, given that these subclasses can be thought of as cross-classes
(see Section 4.5). Use the output from this analysis to replicate the
following summary table.
Subclass
MDE = 1 (Yes)
MDE = 0 (No)
Difference in Means (1–0)
© 2010 by Taylor and Francis Group, LLC
Variable
y ,[se( y )]
BMI
BMI
BMI
27.59 (.131)
26.89 (.102)
.693 (.103)
6
Categorical Data Analysis
If the same group of individuals is classified in two or more different
ways, as persons may be classified as inoculated and not inoculated,
and also may be attacked and not attacked by disease, then we may
require to know if the two classifications are independent.
—R. A. Fisher (1925)
6.1â•‡ Introduction
A casual perusal of the codebook and variable descriptions for most publicuse survey data sets quickly leads to the observation that the responses to
the majority of survey questions in the social sciences and public health and
related fields of research are measured as a binary choice (e.g., yes, no), a selection from multinomial response categories (e.g., ethnicity), a choice from an
ordinal scale (e.g., strongly agree to strongly disagree), or possibly a discrete
count of events. This chapter covers procedures for simple univariate, bivariate, and selected multivariate analyses for such categorical survey responses,
focusing on the adaptation of established analytic techniques to complex
sample survey data. For readers interested in a fuller discussion of these basic
techniques of categorical data analysis, we recommend Agresti (2002).
The outline of this chapter parallels that of Chapter 5. Section 6.2 highlights
several important considerations in categorical data analysis for complex
sample surveys. Basic methods for analyzing a single categorical variable are
described in Section 6.3, including estimation of category proportions, goodness-of-fit (GOF) tests to compare the sample estimates with hypothesized
population values, and graphical display of the estimated population distribution across the K categories of the single variable. Section 6.4 extends the
discussion to bivariate analyses including statistics that measure association
and tests of hypotheses concerning independence of two categorical variables. The chapter concludes in Section 6.5 with coverage of two techniques
for multivariate categorical data: the Cochran–Mantel–Haenszel (CMH) test,
and a brief discussion of simple log-linear models for cross-tabulated data.
Multivariate regression modeling and related methods for categorical data
will be introduced later in Chapters 8 and 9.
149
© 2010 by Taylor and Francis Group, LLC
150
Applied Survey Data Analysis
6.2â•‡ A Framework for Analysis of Categorical Survey Data
We begin this chapter by introducing a framework for the analysis of categorical data collected in complex sample surveys. We introduce methods for
accommodating complex sample design features and important considerations for both univariate and bivariate analyses of categorical data. Sections
6.3 and 6.4 go into more detail about these analysis approaches.
6.2.1â•‡Incorporating the Complex Design and
Pseudo-Maximum Likelihood
The simplicity of categorical survey responses can belie the range of sophistication of categorical data analysis techniques—techniques that range from
the simplest estimation of category proportions to complex multivariate
and even multilevel regression models. The majority of the standard estimators and test statistics for categorical data analysis are derived under the
method of maximum likelihood and assume that the data are independent
and identically distributed (i.i.d.) according to a discrete probability distribution. Under simple random sampling assumptions, categorical variables
are assumed to follow one of several known probability distributions or
sampling models (Agresti, 2002)—that is, the binomial, the multinomial,
the Poisson, the product-multinomial, or, more rarely, a hypergeometric
model. Unfortunately, due to sample weighting, clustering, and stratification, the true likelihood of the sample survey data is generally difficult to
specify analytically. Therefore, the simple elegance of maximum likelihood
methods for estimation and inference does not easily transfer to complex
sample survey data. This chapter will introduce the methods and software
that survey statisticians have developed to adjust standard analyses for the
complex sample design effects, including weighted estimates of proportions, design-based estimates of sampling variance, and generalized design
effect corrections for key test statistics. Later, in Chapters 8 and 9, different forms of the generalized linear model and pseudo-maximum likelihood techniques will be discussed for regression modeling of categorical
data that follow an approximate binomial, multinomial, Poisson, or negative
binomial distribution.
6.2.2â•‡ Proportions and Percentages
In Section 6.3 we discuss estimation of proportions and their standard
errors. In this chapter, we denote estimates of population proportions as p
and population proportions as π. Many software packages choose to output estimates of percentages and standard errors on the percentage scale.
Translation between estimates and standard errors for proportions and percentages simply involves the following multiplicative scaling:
© 2010 by Taylor and Francis Group, LLC
151
Categorical Data Analysis
percent = 100 ⋅ p , se( percent ) = 100 ⋅ se( p), var( percent ) = 100 2 ⋅ var( p)
While these relationships may be obvious to experienced analysts, our
experience suggests that it is easy for mistakes to be made, especially in
translating standard errors from the percentage to the proportion scale.
6.2.3â•‡Cross-Tabulations, Contingency Tables,
and Weighted Frequencies
Categorical data analysis becomes more complex (and also more interesting) when the analysis is expanded from a single variable to include two
or more categorical variables. With two categorical variables, the preferred
summary is typically a two-way data display with r = 1, …, R rows and c =
1, …, C columns, often referred to in statistical texts and literature as a crosstabulation or a contingency table. Cross-tabs and contingency tables are
not limited to two dimensions but may include a third (or higher) dimension, that is, l = 1, …, L layers or subtables based on categories of a third
variable. For simplicity of presentation, the majority of the examples in this
chapter will be based on the cross-tabulation of two categorical variables
(see Section 6.4).
Consider the simple R = 2 by C = 2 (2 × 2) tabular display of observed
sample frequencies shown in FigureÂ€6.1.
Under simple random sampling (SRS), one-, two-, three-, or higher-dimension arrays of unweighted sample frequencies like the 2 × 2 array illustrated
in FigureÂ€ 6.1 can be used directly to estimate statistics of interest such as
the row proportion in category 1, p11
| = n11 / n1+ , or derive tests of hypotheses
for the relationships between categorical variables, such as the Pearson chisquare (χ2) test. However, because individuals can be selected to a survey
sample with varying probabilities, estimates and test statistics computed
from the unweighted sample frequencies may be biased for the true properties of the survey population. Therefore, it is necessary to translate from
unweighted sample counts to weighted frequencies as shown in FigureÂ€6.2,
Variable 2
Variable 1
Row Margin
0
1
0
n00
n01
n0+
1
n10
n11
n1+
Column Margin
n+0
n+1
n++
FigureÂ€6.1
Bivariate distribution of observed sample frequencies.
© 2010 by Taylor and Francis Group, LLC
152
Applied Survey Data Analysis
Variable 2
Variable 1
Row Margin
0
1
0
00
01
0+
1
10
11
1+
Column Margin
+0
+1
++
FigureÂ€6.2
Bivariate distribution of weighted sample frequencies.
where for example, the weighted frequency (or estimated population count)
in cell (0,1) is
Nˆ 01 =
H
ah
∑∑ ∑ w
hαi
h=1 α =1 i∈( 0 ,1)
The weighted proportions estimated from these weighted sample frequencies, such as prc = Nˆ rc / Nˆ ++ , will reflect the relative size of the total
population in the corresponding cell, row margin or column margin of the
cross-tabulation.
We discuss bivariate analyses in more detail in Section 6.4.
6.3â•‡ Univariate Analysis of Categorical Data
Simple descriptive analyses of a single categorical variable in a survey data
set can take a number of forms, including estimation of a simple population
proportion π for binary responses; estimation of the population proportion,
π k , for each of the k = 1, …, K categories of a multicategory nominal or ordinal response variable; and, with care, estimation of the means for ordinally
scaled responses (see Section 5.3.3). To draw inferences about the population
parameters being estimated, analysts can construct 100(1 – α)% confidence
intervals (CIs) for the parameters or perform Student t or simple χ2 hypothesis tests. Properly weighted graphical displays of the frequency distribution of the categorical variable are also very effective tools for presentation
of results.
6.3.1â•‡Estimation of Proportions for Binary Variables
Estimation of a single population proportion, π, for a binary response variable requires only a straightforward extension of the ratio estimator (Section
© 2010 by Taylor and Francis Group, LLC
153
Categorical Data Analysis
5.3.3) for the population mean of a continuous random variable. By recoding
the original response categories to a single indicator variable yi with possible
values 1 and 0 (e.g., yes = 1, no = 0), the ratio mean estimator estimates the
proportion or prevalence, π, of “1s” in the population:
H
p=
ah
nhα
∑∑∑ w
h =1 α =1 i =1
ah
H
hαi
I ( yi = 1)
=
nhα
∑∑∑ w
hαi
Nˆ 1
Nˆ
(6.1)
h=1 α =1 i =1
Most software procedures for survey data analysis provide the user with
at least two alternatives for calculating ratio estimates of proportions. The
first alternative is to use the single variable option in procedures such as
Stata svy: prop and svy: tab or SAS PROC SURVEYFREQ. These programs are designed to estimate the univariate proportions for the discrete
categories of a single variable as well as total, row, and column proportions
for cross-tabulations of two or more categorical variables. The second alternative for estimating single proportions is to first generate an indicator variable for the category of interest (with possible values 1 if a case falls into the
category of interest, or 0 if a case does not fall into the category of interest)
and then to apply a procedure designed for estimation of means (e.g., Stata
svy: mean or SAS PROC SURVEYMEANS). Both approaches will yield
identical estimates of the population proportion and the standard error, but,
as explained next, estimated confidence intervals developed by the different
procedures may not agree exactly due to differences in the methods used to
derive the lower and upper confidence limits.
An application of Taylor series linearization (TSL) to the ratio estimator of
π in Equation 6.1 results in the following TSL variance estimator for the ratio
estimate of a simple proportion:
v( p)
V ( Nˆ 1 ) + p2 ⋅ V ( Nˆ ) − 2 ⋅ p ⋅ Cov( Nˆ 1 , Nˆ )
Nˆ 2
(6.2)
Replication techniques (jackknife repeated replication [JRR], balanced
repeated replication [BRR]) can also be used to compute estimates of the variances of these estimated proportions and the corresponding design-based
confidence intervals and test statistics.
If the analyst estimates a proportion as the mean of an indicator variable
(e.g., using Stata’s svy: mean procedure), a standard design-based confidence interval is constructed for the proportion, CI ( p) = p ± t1−α/2 ,df ⋅ se( p).
One complication that arises when proportions are viewed simply as the
mean of a binary variable is that the true proportion, π, is constrained to lie
© 2010 by Taylor and Francis Group, LLC
154
Applied Survey Data Analysis
in the interval (0,1). When the estimated proportion of interest is extreme (i.e.,
close to 0 or close to 1), the standard design-based confidence limits may be
less than 0 or exceed 1. To address this problem, alternative computations
of design-based confidence intervals for proportions have been proposed,
including the logit transformation procedure and the modified Wilson procedure (Hsu and Rust, 2007).
Stata’s svy: tab command uses the logit transformation procedure by
default. Implementation of this procedure for constructing a confidence
interval is a two-step process. First, using the weighted estimate of the proportion, one constructs a 95% CI for the logit transform of p:
CI [logit(p )] = { A,B} =
p t1−α/2 ,df ⋅ se( p)
p t1−α/2 ,df ⋅ se( p)
−
, ln
+
ln
p ⋅ (1 − p)
p ⋅ (1 − p)
1 − p
1 − p
(6.3)
where
p = the weighted estimate of the proportion of interest;
se(p) = the design-based Taylor Series approximation to the standard error
of the estimated proportion.
Next, the two confidence limits on the logit scale, A and B, are transformed
back to the original (0,1) scale:
eA
eB
CI ( p) =
,
A
B
1 + e 1 + e
(6.4)
Although procedures such as Stata svy: tab and SUDAAN PROC
CROSSTAB default to the logit transformation formula for estimating CI(p)
for all values of p, the adjusted CIs generally do not differ from the standard symmetric CIs unless p < 0.10 or p > 0.90. Interested readers can find a
description of the modified Wilson Procedure in Theory BoxÂ€6.1.
Example 6.1:â•‡Estimating the Proportion of the U.S.
Adult Population with an Irregular Heart Beat
In this example, data from the 2005–2006 NHANES are used to estimate the
proportion of U.S. adults with an irregular heartbeat (defined by BPXPULS = 2 in
the NHANES data set). To enable a direct comparison of the results from analyses
performed using Stata svy: tab, svy: prop, and svy: mean, we generate
a binary indicator, equal to 1 for respondents reporting an irregular heartbeat,
and 0 otherwise. The recoding of the original response categories for BPXPULS
would not be necessary for estimation performed using either svy: tab or svy:
© 2010 by Taylor and Francis Group, LLC