Tải bản đầy đủ

Example 5.13: E stimating Differences in Mean Total Household Assets from 2004 to 2006 Using Data from the HRS

144

Applied Survey Data Analysis

and WEIGHT. Each record also includes the permanently assigned stratum and

cluster codes. As described in Example 5.4, the HRS public use data sets include

an indicator variable for each wave of data that identifies the respondents who

are the household financial reporters (JFINR for 2004; KFINR for 2006). Using

the over(year) option, estimates of mean total household assets are computed

separately for 2004 and 2006. The subpop(finr0406) option restricts the

estimates to the financial reporters for each of these two data collection years.

The postestimation lincom statement estimates the difference of means for the

two time periods, its linearized standard error, and a 95% confidence interval

for the difference:

gen weight = jwgthh

replace weight = kwgthh if year == 2006

gen finr04 = 1 if (year==2004 & jfinr==1)

gen finr06 = 1 if (year==2006 & kfinr==1)

gen finr0406 = 1 if finr04==1 | finr06==1

svyset secu [pweight = weight], strata(stratum)

svy, subpop(finr0406): mean totassets, over(year)

lincom [totassets]2004 - [totassets]2006

Contrast

y2004 − y2006

se( y2004 − y2006 )

CI.95 ( y2004 − y2006 )

2004 vs. 2006

–$115,526

$20,025

(–$155,642, –$75,411)

Note that the svyset command has been used again to specify the recoded

sampling weight variable (WEIGHT) in the stacked data set. The svy: mean

command is then used to request weighted estimates and linearized standard

errors (and the covariances of the estimates, which are saved internally) for each

subpopulation defined by the YEAR variable. The resulting estimate of the difference of means is ∆ˆ = y 2004 − y 2006 = –$115,526, with a linearized standard error

of $20,025. The analysis provides evidence that the mean total household assets

increased significantly from 2004 to 2006 for this population.

5.7â•‡ Exercises

1.This exercise serves to illustrate the effects of complex sample

designs (i.e., design effects) on the variances of estimated means, due

to stratification and clustering in sample selection (see Section 2.6.1

for a review). The following table lists values for an equal probability

(self-weighting) sample of n = 36 observations.

Observations

7.0685

13.6760

© 2010 by Taylor and Francis Group, LLC

13.7441

7.2293

7.2293

13.7315

STRATUM

CLUSTER

1

1

1

2

145

Descriptive Analysis for Continuous Variables

Observations

13.2310

10.9647

11.3274

17.3248

19.7091

13.6724

15.3685

15.9246

20.2603

12.4546

10.8922

11.2793

16.4423

12.1142

12.9173

16.2839

15.3004

14.0902

12.0955

18.4702

12.3425

11.8507

11.9133

16.7290

18.3800

14.6646

13.5876

16.4873

18.1224

14.6783

STRATUM

CLUSTER

1

1

1

1

2

2

2

2

2

2

3

4

5

6

7

8

9

10

11

12

Any software procedures can be used to answer the following four

questions.

a. Assume the sample of n = 36 observations is a simple random

sample from the population. Compute the sample mean, the

standard error of the mean, and a 95% confidence interval for

the population mean. (Ignore the finite population correction

[fpc], stratification, and the clustering in calculating the standard error.)

b. Next, assume that the n = 36 observations are selected as a stratified random sample of 18 observations from each of two strata.

(Ignore the fpc and the apparent clustering.) Assume that the

population size of each of the two strata is equal. Compute the

sample mean, the standard error of the mean and a 95% confidence interval for the population mean. (Ignore the fpc and the

clustering in calculating the standard error.) What is the estimated DEFT ( y ) for the standard error of the sample mean (i.e.,

the square root of the design effect)?

c. Assume now that the n = 36 observations are selected as an equal

probability sample of 12 clusters with exactly three observations

from each cluster. Compute the sample mean, the standard error

of the mean, a 95% confidence interval for the population mean,

and the estimate of DEFT ( y ) . Ignore the fpc and the stratification in calculating the standard error. Use the simple model of

the design effect for sample means to derive an estimate of roh

(2.11), the synthetic intraclass correlation (this may take a negative value for this “synthetic” data set).

d. Finally, assume that the n = 36 observations are selected as an

equal probability stratified cluster sample of observations. (Two

strata, six clusters per stratum, three observations per cluster.)

Compute the sample mean, the standard error of the mean, the

95% CI, and estimates of DEFT ( y ) and roh.

© 2010 by Taylor and Francis Group, LLC

146

Applied Survey Data Analysis

2.Using the NCS-R data and a statistical software procedure of your

choice, compute a weighted estimate of the total number of U.S.

adults that has ever been diagnosed with alcohol dependence (ALD)

along with a 95% confidence interval for the total. Make sure to

incorporate the complex design when computing the estimate and

the confidence interval. Compute a second 95% confidence interval

using an alternative variance estimation technique, and compare the

two resulting confidence intervals. Would your inferences change at

all depending on the variance estimation approach?

3.(Requires SUDAAN or SAS Version 9.2+) Using the SUDAAN or

SAS software and the 2005–2006 NHANES data set, estimate the

25th percentile, the median, and the 75th percentile of systolic blood

pressure (BPXSY1) for U.S. adults over the age of 50. You will need to

create a subpopulation indicator of those aged 51 and older for this

analysis. Remember to perform an appropriate subpopulation analysis for this population subclass. Compute 95% confidence intervals

for each percentile.

4.Download the NCS-R data set from the book Web site and consider

the following questions. For this exercise, the SESTRAT variable

identifies the stratum codes for computation of sampling errors, the

SECLUSTR variable identifies the sampling error computation units,

and the NCSRWTSH variable contains the final sampling weights

for Part 1 of the survey for each sampled individual.

a. How many sampling error calculation strata are specified for the

NCS-R sampling error calculation model?

b. How many SECUs (or clusters) are there in total?

c. How many degrees of freedom for variance estimation does the

NCS-R provide?

d. What is the expected loss, Lw, due to random weighting in survey

estimation for total population estimates? Hint: Lw = CV2(weight);

see Section 2.7.5.

e. What is the average cluster size, b, for total sample analyses of

variables with no item-missing data?

5.Using the statistical software procedure of your choice, estimate

the proportion of persons in the 2006 HRS target population with

arthritis (ARTHRITIS = 1). Use Taylor series linearization to estimate

a standard error for this proportion. Then, answer the following

questions:

a. What is the design effect for the total sample estimate of the proportion of persons with arthritis (ARTHRITIS = 1)? What is the

design effect for the estimated proportion of respondents age 70

and older (AGE70 = 1)? Hint: Use the standard variance formula,

© 2010 by Taylor and Francis Group, LLC

147

Descriptive Analysis for Continuous Variables

var(p) = p × (1 – p)/(n – 1), to obtain the variance of the proportion p under SRS. Use the weighted estimate of p provided by the

software procedure to compute the SRS variance. Don’t confuse

standard errors and variances (squares of standard errors).

b. Construct a 95% confidence interval for the mean of DIABETES.

Based on this confidence interval, would you say the proportion

of individuals with diabetes in the 2006 HRS target population is

significantly different from 0.25?

6.Examine the CONTENTS listing for the NCS-R data set on the book

Web site. Choose a dependent variable of interest (e.g., MDE: 1 = Yes,

0 = No). Develop a one-sample hypothesis (e.g., the prevalence of

lifetime major depressive episodes in the U.S. adult population is p

= 0.20). Write down your hypothesis before you actually look at the

sample estimates. Perform the required analysis using a software

procedure of your choosing, computing the weighted sample-based

estimate of the population parameter and a 95% confidence interval for the desired parameter. Test your hypothesis using the confidence interval. Write a one-paragraph statement of your hypothesis

and a summary of the results of your sample-based estimation and

your inference/conclusion based on the 95% CI.

7.Two subclasses are defined for NCS-R respondents based on their

response to a question on diagnosis of a major depressive episode

(MDE) (1 = Yes, 0 = No). For these two subclasses, use the software

procedure of your choice to estimate the difference of means and

standard error of the difference for body mass index (BMI). Make

sure to use the unconditional approach to subclass analysis in this

case, given that these subclasses can be thought of as cross-classes

(see Section 4.5). Use the output from this analysis to replicate the

following summary table.

Subclass

MDE = 1 (Yes)

MDE = 0 (No)

Difference in Means (1–0)

© 2010 by Taylor and Francis Group, LLC

Variable

y ,[se( y )]

BMI

BMI

BMI

27.59 (.131)

26.89 (.102)

.693 (.103)

6

Categorical Data Analysis

If the same group of individuals is classified in two or more different

ways, as persons may be classified as inoculated and not inoculated,

and also may be attacked and not attacked by disease, then we may

require to know if the two classifications are independent.

—R. A. Fisher (1925)

6.1â•‡ Introduction

A casual perusal of the codebook and variable descriptions for most publicuse survey data sets quickly leads to the observation that the responses to

the majority of survey questions in the social sciences and public health and

related fields of research are measured as a binary choice (e.g., yes, no), a selection from multinomial response categories (e.g., ethnicity), a choice from an

ordinal scale (e.g., strongly agree to strongly disagree), or possibly a discrete

count of events. This chapter covers procedures for simple univariate, bivariate, and selected multivariate analyses for such categorical survey responses,

focusing on the adaptation of established analytic techniques to complex

sample survey data. For readers interested in a fuller discussion of these basic

techniques of categorical data analysis, we recommend Agresti (2002).

The outline of this chapter parallels that of Chapter 5. Section 6.2 highlights

several important considerations in categorical data analysis for complex

sample surveys. Basic methods for analyzing a single categorical variable are

described in Section 6.3, including estimation of category proportions, goodness-of-fit (GOF) tests to compare the sample estimates with hypothesized

population values, and graphical display of the estimated population distribution across the K categories of the single variable. Section 6.4 extends the

discussion to bivariate analyses including statistics that measure association

and tests of hypotheses concerning independence of two categorical variables. The chapter concludes in Section 6.5 with coverage of two techniques

for multivariate categorical data: the Cochran–Mantel–Haenszel (CMH) test,

and a brief discussion of simple log-linear models for cross-tabulated data.

Multivariate regression modeling and related methods for categorical data

will be introduced later in Chapters 8 and 9.

149

© 2010 by Taylor and Francis Group, LLC

150

Applied Survey Data Analysis

6.2â•‡ A Framework for Analysis of Categorical Survey Data

We begin this chapter by introducing a framework for the analysis of categorical data collected in complex sample surveys. We introduce methods for

accommodating complex sample design features and important considerations for both univariate and bivariate analyses of categorical data. Sections

6.3 and 6.4 go into more detail about these analysis approaches.

6.2.1â•‡Incorporating the Complex Design and

Pseudo-Maximum Likelihood

The simplicity of categorical survey responses can belie the range of sophistication of categorical data analysis techniques—techniques that range from

the simplest estimation of category proportions to complex multivariate

and even multilevel regression models. The majority of the standard estimators and test statistics for categorical data analysis are derived under the

method of maximum likelihood and assume that the data are independent

and identically distributed (i.i.d.) according to a discrete probability distribution. Under simple random sampling assumptions, categorical variables

are assumed to follow one of several known probability distributions or

sampling models (Agresti, 2002)—that is, the binomial, the multinomial,

the Poisson, the product-multinomial, or, more rarely, a hypergeometric

model. Unfortunately, due to sample weighting, clustering, and stratification, the true likelihood of the sample survey data is generally difficult to

specify analytically. Therefore, the simple elegance of maximum likelihood

methods for estimation and inference does not easily transfer to complex

sample survey data. This chapter will introduce the methods and software

that survey statisticians have developed to adjust standard analyses for the

complex sample design effects, including weighted estimates of proportions, design-based estimates of sampling variance, and generalized design

effect corrections for key test statistics. Later, in Chapters 8 and 9, different forms of the generalized linear model and pseudo-maximum likelihood techniques will be discussed for regression modeling of categorical

data that follow an approximate binomial, multinomial, Poisson, or negative

binomial distribution.

6.2.2â•‡ Proportions and Percentages

In Section 6.3 we discuss estimation of proportions and their standard

errors. In this chapter, we denote estimates of population proportions as p

and population proportions as π. Many software packages choose to output estimates of percentages and standard errors on the percentage scale.

Translation between estimates and standard errors for proportions and percentages simply involves the following multiplicative scaling:

© 2010 by Taylor and Francis Group, LLC

151

Categorical Data Analysis

percent = 100 ⋅ p , se( percent ) = 100 ⋅ se( p), var( percent ) = 100 2 ⋅ var( p)

While these relationships may be obvious to experienced analysts, our

experience suggests that it is easy for mistakes to be made, especially in

translating standard errors from the percentage to the proportion scale.

6.2.3â•‡Cross-Tabulations, Contingency Tables,

and Weighted Frequencies

Categorical data analysis becomes more complex (and also more interesting) when the analysis is expanded from a single variable to include two

or more categorical variables. With two categorical variables, the preferred

summary is typically a two-way data display with r = 1, …, R rows and c =

1, …, C columns, often referred to in statistical texts and literature as a crosstabulation or a contingency table. Cross-tabs and contingency tables are

not limited to two dimensions but may include a third (or higher) dimension, that is, l = 1, …, L layers or subtables based on categories of a third

variable. For simplicity of presentation, the majority of the examples in this

chapter will be based on the cross-tabulation of two categorical variables

(see Section 6.4).

Consider the simple R = 2 by C = 2 (2 × 2) tabular display of observed

sample frequencies shown in FigureÂ€6.1.

Under simple random sampling (SRS), one-, two-, three-, or higher-dimension arrays of unweighted sample frequencies like the 2 × 2 array illustrated

in FigureÂ€ 6.1 can be used directly to estimate statistics of interest such as

the row proportion in category 1, p11

| = n11 / n1+ , or derive tests of hypotheses

for the relationships between categorical variables, such as the Pearson chisquare (χ2) test. However, because individuals can be selected to a survey

sample with varying probabilities, estimates and test statistics computed

from the unweighted sample frequencies may be biased for the true properties of the survey population. Therefore, it is necessary to translate from

unweighted sample counts to weighted frequencies as shown in FigureÂ€6.2,

Variable 2

Variable 1

Row Margin

0

1

0

n00

n01

n0+

1

n10

n11

n1+

Column Margin

n+0

n+1

n++

FigureÂ€6.1

Bivariate distribution of observed sample frequencies.

© 2010 by Taylor and Francis Group, LLC

152

Applied Survey Data Analysis

Variable 2

Variable 1

Row Margin

0

1

0

00

01

0+

1

10

11

1+

Column Margin

+0

+1

++

FigureÂ€6.2

Bivariate distribution of weighted sample frequencies.

where for example, the weighted frequency (or estimated population count)

in cell (0,1) is

Nˆ 01 =

H

ah

∑∑ ∑ w

hαi

h=1 α =1 i∈( 0 ,1)

The weighted proportions estimated from these weighted sample frequencies, such as prc = Nˆ rc / Nˆ ++ , will reflect the relative size of the total

population in the corresponding cell, row margin or column margin of the

cross-tabulation.

We discuss bivariate analyses in more detail in Section 6.4.

6.3â•‡ Univariate Analysis of Categorical Data

Simple descriptive analyses of a single categorical variable in a survey data

set can take a number of forms, including estimation of a simple population

proportion π for binary responses; estimation of the population proportion,

π k , for each of the k = 1, …, K categories of a multicategory nominal or ordinal response variable; and, with care, estimation of the means for ordinally

scaled responses (see Section 5.3.3). To draw inferences about the population

parameters being estimated, analysts can construct 100(1 – α)% confidence

intervals (CIs) for the parameters or perform Student t or simple χ2 hypothesis tests. Properly weighted graphical displays of the frequency distribution of the categorical variable are also very effective tools for presentation

of results.

6.3.1â•‡Estimation of Proportions for Binary Variables

Estimation of a single population proportion, π, for a binary response variable requires only a straightforward extension of the ratio estimator (Section

© 2010 by Taylor and Francis Group, LLC

153

Categorical Data Analysis

5.3.3) for the population mean of a continuous random variable. By recoding

the original response categories to a single indicator variable yi with possible

values 1 and 0 (e.g., yes = 1, no = 0), the ratio mean estimator estimates the

proportion or prevalence, π, of “1s” in the population:

H

p=

ah

nhα

∑∑∑ w

h =1 α =1 i =1

ah

H

hαi

I ( yi = 1)

=

nhα

∑∑∑ w

hαi

Nˆ 1

Nˆ

(6.1)

h=1 α =1 i =1

Most software procedures for survey data analysis provide the user with

at least two alternatives for calculating ratio estimates of proportions. The

first alternative is to use the single variable option in procedures such as

Stata svy: prop and svy: tab or SAS PROC SURVEYFREQ. These programs are designed to estimate the univariate proportions for the discrete

categories of a single variable as well as total, row, and column proportions

for cross-tabulations of two or more categorical variables. The second alternative for estimating single proportions is to first generate an indicator variable for the category of interest (with possible values 1 if a case falls into the

category of interest, or 0 if a case does not fall into the category of interest)

and then to apply a procedure designed for estimation of means (e.g., Stata

svy: mean or SAS PROC SURVEYMEANS). Both approaches will yield

identical estimates of the population proportion and the standard error, but,

as explained next, estimated confidence intervals developed by the different

procedures may not agree exactly due to differences in the methods used to

derive the lower and upper confidence limits.

An application of Taylor series linearization (TSL) to the ratio estimator of

π in Equation 6.1 results in the following TSL variance estimator for the ratio

estimate of a simple proportion:

v( p)

V ( Nˆ 1 ) + p2 ⋅ V ( Nˆ ) − 2 ⋅ p ⋅ Cov( Nˆ 1 , Nˆ )

Nˆ 2

(6.2)

Replication techniques (jackknife repeated replication [JRR], balanced

repeated replication [BRR]) can also be used to compute estimates of the variances of these estimated proportions and the corresponding design-based

confidence intervals and test statistics.

If the analyst estimates a proportion as the mean of an indicator variable

(e.g., using Stata’s svy: mean procedure), a standard design-based confidence interval is constructed for the proportion, CI ( p) = p ± t1−α/2 ,df ⋅ se( p).

One complication that arises when proportions are viewed simply as the

mean of a binary variable is that the true proportion, π, is constrained to lie

© 2010 by Taylor and Francis Group, LLC

154

Applied Survey Data Analysis

in the interval (0,1). When the estimated proportion of interest is extreme (i.e.,

close to 0 or close to 1), the standard design-based confidence limits may be

less than 0 or exceed 1. To address this problem, alternative computations

of design-based confidence intervals for proportions have been proposed,

including the logit transformation procedure and the modified Wilson procedure (Hsu and Rust, 2007).

Stata’s svy: tab command uses the logit transformation procedure by

default. Implementation of this procedure for constructing a confidence

interval is a two-step process. First, using the weighted estimate of the proportion, one constructs a 95% CI for the logit transform of p:

CI [logit(p )] = { A,B} =

p t1−α/2 ,df ⋅ se( p)

p t1−α/2 ,df ⋅ se( p)

−

, ln

+

ln

p ⋅ (1 − p)

p ⋅ (1 − p)

1 − p

1 − p

(6.3)

where

p = the weighted estimate of the proportion of interest;

se(p) = the design-based Taylor Series approximation to the standard error

of the estimated proportion.

Next, the two confidence limits on the logit scale, A and B, are transformed

back to the original (0,1) scale:

eA

eB

CI ( p) =

,

A

B

1 + e 1 + e

(6.4)

Although procedures such as Stata svy: tab and SUDAAN PROC

CROSSTAB default to the logit transformation formula for estimating CI(p)

for all values of p, the adjusted CIs generally do not differ from the standard symmetric CIs unless p < 0.10 or p > 0.90. Interested readers can find a

description of the modified Wilson Procedure in Theory BoxÂ€6.1.

Example 6.1:â•‡Estimating the Proportion of the U.S.

Adult Population with an Irregular Heart Beat

In this example, data from the 2005–2006 NHANES are used to estimate the

proportion of U.S. adults with an irregular heartbeat (defined by BPXPULS = 2 in

the NHANES data set). To enable a direct comparison of the results from analyses

performed using Stata svy: tab, svy: prop, and svy: mean, we generate

a binary indicator, equal to 1 for respondents reporting an irregular heartbeat,

and 0 otherwise. The recoding of the original response categories for BPXPULS

would not be necessary for estimation performed using either svy: tab or svy:

© 2010 by Taylor and Francis Group, LLC

## 2010 applied survey data analysis

## 4 Simple Random Sampling: A Simple Model for Design- Based Inference

## 2 Analysis Weights: Review by the Data User

## Example 5.1: A Weighted Histogram of Total Cholesterol Using the 2005– 2006 NHANES Data

## Example 5.8: Estimating Population Quantiles for Total Household Assets Using the HRS Data

## Example 5.11: Estimating Mean Systolic Blood Pressure for Males and Females Age > 45 Using the NHANES Data

## Example 6.4: A Goodness- of- Fit Test for Blood Pressure Status Category Proportions

## Example 6.6 E stimation of Total and Row Proportions for the Cross- Tabulation of Gender and Lifetime Major Depression Status Using the NCS- R Data

## Example 6.8: Testing the Independence of Alcohol Dependence and Education Level in Young Adults ( Ages 18– 28) Using the NCS- R Data

## Example 6.9: Simple Logistic Regression to Estimate the NCS- R Male/ Female Odds Ratio for Lifetime Major Depressive Episode

## 5 Application: Modeling Diastolic Blood Pressure with the NHANES Data

Tài liệu liên quan