6 Biplot Display of PCA, EFA, and DCA Results
Tải bản đầy đủ - 0trang
Unsupervised Learning Methods ◾ 83
◾◾ New: Factor pattern plots are generated using the New 9.2: statistical graphics feature before and after factor rotation.
◾◾ New: Assessing the significance and the nature of factor loadings are generated using the New 9.2: statistical graphics feature.
◾◾ New: Confidence interval estimates for factor loading when ML factor analysis is used.
◾◾ Biplot display showing the interrelationship between the principal component or factor scores and the correlations among the multiattributes are produced for all combinations of selected principal components or factors.
◾◾ Options for saving the output tables and graphics in WORD, HTML, PDF,
and TXT formats are available.
Software requirements for using the FACTOR2 macro are the following:
◾◾ SAS/BASE, SAS/STAT, SAS/GRAPH, and SAS/IML must be licensed and
installed at your site.
◾◾ SAS version 9.13 and above is required for full utilization.
4.7.1 Steps Involved in Running the FACTOR2 Macro
1.Create or open a temporary SAS dataset from n × p coordinate data containing
p correlated continuous variables and n observations. If a coordinate (n × p)
dataset is not available and only a correlation matrix is available, then create
a special correlation SAS dataset (see Figure 4.15).
2.Open the FACTOR2.SAS macro-call file into the SAS EDITOR window.
Instructions are given in Appendix 1 regarding downloading the macro-call
and sample data files from this book’s Web site. Click the RUN icon to submit the macro-call file FACTOR2.SAS to open the MACRO–CALL window called FACTOR2 (Figure 4.1).
Special note to SAS Enterprise Guide (EG) CODE window users: Because the
user-friendly SAS macro application included in this book uses SAS
WINDOW/DISPLAY commands, and these commands are not compatible with SAS EG, open the traditional FACTOR2 macro-call file
included in the\dmsas2e\maccal\nodisplay\ into the SAS editor. Read the
instructions given in Appendix 3 regarding using the traditional macrocall files in the SAS EG/SAS Learning Edition (LE) code window.
3.Input the appropriate parameters in the macro-call window by following the
instructions provided in the FACTOR2 macro help file in Appendix 2. Users
can choose either the scatter plot analysis option or the PCA/EFA analysis option. Options for checking for multivariate normality assumptions and
detecting for presence of outliers are also available. After inputting all the
required macro parameters, check whether the cursor is in the last input field,
and then hit the ENTER key (not the RUN icon) to submit the macro.
© 2010 by Taylor and Francis Group, LLC
K10535_Book.indb 83
5/18/10 3:36:56 PM
84 ◾ Statistical Data Mining Using SAS Applications
Figure 4.1 Screen copy of FACTOR2 macro-call window showing the macro-call
parameters required for performing PCA.
4.Examine the LOG window (only in DISPLAY mode) for any macro execution errors. If you see any errors in the LOG window, activate the EDITOR
window, resubmit the FACTOR2.SAS macro-call file, check your macro
input values and correct if you see any input errors.
5.Save the output files. If no errors are found in the LOG window, activate the
EDITOR window, resubmit the FACTOR2.SAS macro-call file, and change
the macro input value from DISPLAY to any other desirable format. PCA or
EFA SAS output files and exploratory graphs could be saved in user-specified
format in the user-specified folder.
4.7.2 Case Study 1: Principal Component
Analysis of 1993 Car Attribute Data
4.7.2.1 Study Objectives
1.Variable reduction: Reduce the dimension of 6 of highly correlated, multiattribute, coordinate data into fewer dimensions (2 or 3) without losing much
of the variation in the dataset.
© 2010 by Taylor and Francis Group, LLC
K10535_Book.indb 84
5/18/10 3:36:57 PM
Unsupervised Learning Methods ◾ 85
2.Scoring observations: Group or rank the observations in the dataset based on
composite scores generated by an optimally weighted linear combination of
the original variables.
3.Interrelationships: Investigate the interrelationship between the observations and the multiattributes and group similar observations and similar
variables.
4.7.2.2 Data Descriptions
Data Name
SAS Dataset CARS93
Multiattributes
Y2: Midprice
Y4: City gas mileage/gallon
X4: HP
X8: Passenger capacity
X11: Width of the vehicle
X15: Physical weight
Number of observations
92
Car93: Data source18
Lock, R. H. (1993)
Open the FACTOR2.SAS macro-call file in the SAS EDITOR window, and click
RUN to open the FACTOR2 macro-call window (Figure 4.1). Input the appropriate
macro-input values by following the suggestions given in the help file (Appendix 2).
Exploratory analysis: Input Y2, Y4, X4, X8, X11, and X15 as the multiattributes in (#3). Input YES in the macro-call field #2 to perform data exploration and
create a scatter plot matrix. (PCA will not be performed when you choose to run
data exploration.) Submit the FACTOR2 macro, and SAS will output descriptive
statistics, correlation matrices, and scatter plot matrices. Only selected output and
graphics generated by the FACTOR2 macro are interpreted in the following text.
The descriptive simple statistics of all multiattributes generated by the SAS
PROC CORR are presented in Table 4.2. The number of observations (N) per
variable is useful in checking for missing values for any given attribute and providing information on the size of the n × p coordinate data. The estimates of central
tendency (mean) and the variability (standard deviation) provide information on
the nature of multiattributes that can be used to decide whether to use standardized or unstandardized data in the PCA analysis. The minimum and the maximum
values describe the range of variation in each attribute and help to check for any
extreme outliers.
© 2010 by Taylor and Francis Group, LLC
K10535_Book.indb 85
5/18/10 3:36:57 PM
86 ◾ Statistical Data Mining Using SAS Applications
Table 4.2 PROC CORR Output and Simple Statistics—FACTOR2 Macro
Standard
Deviation
Sum
19.50
9.65
1814
7.40
61.90
Midrange
price (in
$1000)
93
22.36
5.61
2080
15.00
46.00
City MPG
(miles per
gallon by
EPA rating)
X4
93
143.82
52.37
13376
55.00
300.00
HP
(maximum)
X8
93
5.08
473.00
2.00
8.00
Passenger
capacity
(persons)
X11
93
69.37
3.77
6452
60.00
78.00
Car width
(inches)
X15
93
589.89
285780
Variable
N
Y2
93
Y4
Mean
3073
1.038
Minimum Maximum
1695
4105
Label
Weight
(pounds)
The degree of linear association among the variables measured by the Pearson
correlation coefficient (r), and their statistical significance are presented in Table 4.3.
The value of r ranged from 0 to 0.87. The statistical significance of r varied from no
correlation (p-value: 0.967) to a highly significant correlation (p-value < 0.0001).
Among the 15 possible pairs of correlations, 13 pairs of correlations were highly
significant, indicating that this data is suitable for performing PCA analysis. The
scatter plot matrix among the six attributes presented in Figure 4.2 reveals the
strength of correlation, presence of any outliers, and the nature of bidirectional
variation. In addition, each scatter plot shows the linear regression line, 95% mean
confidence interval band, and a horizontal line (Y-bar line), which passes through the
mean of the Y-variable. If this Y-bar line intersects the confidence band lines, that
is, the confidence band region does not enclose the Y-bar line, then the correlation
between the X and Y variable is statistically significant. For example, among the 15
scatter plots present in Figure 4.2, only in two scatter plots (Y2 versus X8; X4 versus
X8) did the Y-bar lines not intersect the confidence band. Only these two correlations are statistically not significant (Table 4.3).
To perform PCA, input Y2, Y4, X4, X8, X11, and X15 as the multiattributes
in (#3). Leave the macro-call field #2 blank to perform PCA. (PCA will not be
performed when you input YES to data exploration.) Input the appropriate macroinput values by following the suggestions given in the help file (Appendix 2).
© 2010 by Taylor and Francis Group, LLC
K10535_Book.indb 86
5/18/10 3:36:57 PM
Unsupervised Learning Methods ◾ 87
Table 4.3 Pearson Correlation Coefficients and Their Statistical Significance
Levels (p-values)—PROC CORR Output from FACTOR2—Macro
Y2
Y2
Y4
1
−0.59a
Midrange price (in $1000)
Y4
City MPG (miles per
gallon by EPA rating)
X4
HP (maximum)
X8
Passenger capacity
(persons)
X11
Car width (inches)
X15
Weight (pounds)
a
b
X4
X8
X11
X15
0.78
0.05
0.45
0.64
<0.0001 <0.0001
b
−0.59
1
<0.0001
0.78
−0.67
0.5817
−0.41
<0.0001 <0.0001
−0.72
<0.0001 <0.0001 <0.0001 <0.0001
−0.67
1
<0.0001 <0.0001
0.009
0.9298
0.05
−0.41
0.009
0.5817
<0.0001
0.9298
0.45
−0.72
0.64
1
0.64
−0.84
0.73
0.73
<0.0001 <0.0001
0.48
0.55
<0.0001 <0.0001
0.48
1
<0.0001 <0.0001 <0.0001 <0.0001
0.64
−0.84
0.55
0.87
<0.0001
0.87
1
<0.0001 <0.0001 <0.0001 <0.0001 <0.0001
Correlation coefficient.
Statistically significant (p-value).
In PCA analysis, the dimensions of standardized multiattributes define the number of eigenvalues. An eigenvalue greater than 1 indicates that PC accounts for more
of the variance than one of the original variables in standardized data. This can be
confirmed by visually examining the improved scree plot (Figure 4.3) of eigenvalues
and the parallel analysis of eigenvalues. This enhanced scree plot shows the rate of
change in the magnitude of the eigenvalues for an increasing number of PCs. The
rate of decline levels off at a given point in the scree plot that indicates the optimum
number of PC to extract. Also, the intersection point between the scree plot and the
parallel analysis plot reveals that the first two eigenvalues that account for 86.2% of
the total variation could be retained as the significant PC (Figure 4.3).
If the data is standardized, that is, normalized to zero mean and 1 standard
deviation, the sum of the eigenvalues is equal to the number of variables used.
The magnitude of the eigenvalue is usually expressed as a percentage of the total
variance. The information in Table 4.4 indicates that the first eigenvalue accounts
for about 66% of the variation, the second for 20%, and the proportions drop off
gradually for the rest of the eigenvalues. Cumulatively, the first two eigenvalues
together account for 86% of the variation in the dataset. A two-dimensional view
(of the six-dimensional dataset) can be created by projecting all data points onto the
© 2010 by Taylor and Francis Group, LLC
K10535_Book.indb 87
5/18/10 3:36:57 PM
88 ◾ Statistical Data Mining Using SAS Applications
x11
80
X8
8
7
6
5
4
3
2
X8
8
7
6
5
4
3
2
40
30
20
10
0
65
60
x4
300
x4
300
200
200
200
100
100
100
0
0
0
5000
2345678
60 65 70 75 80
X8: r = 0.01
x11: r = 0.64
0
5000
x15: r = 0.74
y4
50
y4
50
y4
50
40
40
40
30
30
30
20
20
20
23 4 56 78
X8: r = –0.42
0 100 200 300
x4: r = –0.67
y2
70
60
50
40
30
20
10
0
10 20 30 40 50
y4: r = –0.59
10
10
10
0
5000
x15: r = 0.65
y2
70
60
50
40
30
20
10
0
0
5000
x15: r = 0.87
x4
300
x15: r = –0.84
y2
70
60
50
40
30
20
10
0
70
0
5000
x15: r = 0.55
60 65 70 75 80
x11: r = 0.49
y4
50
75
y2
70
60
50
40
30
20
10
0
0 100 200 300
x4: r = 0.79
60 65 70 75 80
x11: r = –0.72
y2
70
60
50
40
30
20
10
0
23 4 56 7 8
X8: r = 0.06
60 65 70 75 80
x11: r = 0.46
Figure 4.2 Scatter plot matrix illustrating the degree of linear correlation among
the five attributes derived using the SAS macro FACTOR2.
© 2010 by Taylor and Francis Group, LLC
K10535_Book.indb 88
5/18/10 3:36:58 PM
Unsupervised Learning Methods ◾ 89
Scree Plot
Variance Explained
4
1.0
0.8
Proportion
Eigenvalue
3
2
1
0.6
0.4
0.2
0
0.0
1
2
3
4
Factor
5
6
1
2
3
4
Factor
5
6
Cumulative
Proportion
Scree Plot and Parallel Analysis
4
e
Scree plot
Eigenvalue
3
2
Parallel analysis plot
1
p
e
p
p
e
0
1
2
3
p
e
4
p
p
e
e
5
6
Number of PC
Figure 4.3 PCA scree plot (New: ODS Graphics feature) illustrating the relationship between number of PCs and the rate of decline of eigenvalue and the parallel
analysis plot derived using the SAS macro FACTOR2.
© 2010 by Taylor and Francis Group, LLC
K10535_Book.indb 89
5/18/10 3:37:00 PM
90 ◾ Statistical Data Mining Using SAS Applications
Table 4.4 Eigenvalues in Principal Component Analysis—PROC
FACTOR Output from FACTOR2 Macro
a
Eigenvaluea
Difference
Proportion
Cumulative
1
3.97215807
2.76389648
0.6620
0.6620
2
1.20826159
0.83065407
0.2014
0.8634
3
0.37760751
0.11365635
0.0629
0.9263
4
0.26395117
0.14171050
0.0440
0.9703
5
0.12224066
0.06645966
0.0204
0.9907
6
0.05578100
—
0.0093
1.0000
Eigenvalues of the correlation matrix: Total = 6, average = 1.
plane defined by the axes of the first two PC. This two-dimensional view will retain
86% of the information from the six-dimensional plot.
The new variables PC1 and PC2 are the linear combinations of the six standardized variables, and the magnitude of the eigenvalues accounts for the variation in
the new PC scores. The eigenvectors presented in Table 4.5 provide the weights for
transforming the six standardized variables into PCs. For example, PC1 is derived
by performing the following linear transformation using these eigenvectors.
PC1 = 0.37781Y1 − 0.44702Y2 + 0.41786X4 + 23403X8 + 0.43847X11 + 0. 48559X15
The sum of the squared of eigenvectors for a given PC is equals to one.
PC loadings presented in Table 4.6 are the correlation coefficient between the first
two PC scores and the original variables. They measure the importance of each variable
Table 4.5 Eigenvectors in PCA Analysis: PROC FACTOR Output from
FACTOR2 Macro
Variables
Eigenvectors
1
2
0.37781
−0.44215
−0.44702
−0.05055
Y2
Midrange price (in $1000)
Y4
City MPG (miles per gallon by EPA rating)
X4
HP (maximum)
0.41786
−0.42666
X8
Passenger capacity (persons)
0.23403
0.75256
X11
Car width (inches)
0.43847
0.19308
X15
Weight (pounds)
0.48559
0.12758
© 2010 by Taylor and Francis Group, LLC
K10535_Book.indb 90
5/18/10 3:37:00 PM
Unsupervised Learning Methods ◾ 91
Table 4.6 Principal Component (PC) Loadings for the First Two PC:
PROC FACTOR Output from FACTOR2 Macro
Variables
FACTOR (PC) 1
FACTOR (PC) 2
0.75298
−0.48602
−0.89092
−0.05557
Y2
Midrange price (in $1000)
Y4
City MPG (miles per gallon
by EPA rating)
X4
HP (maximum)
0.83281
−0.46899
X8
Passenger capacity (persons)
0.46643
0.82722
X11
Car width (inches)
0.87388
0.21224
X15
Weight (pounds)
0.96780
0.14024
in accounting for the variability in the PC. That is, the larger the loadings in absolute
terms, the more influential the variables are in forming the new PC and vice versa.
A high correlation between PC1 and midrange price, city MPG, HP, car width, and
weight indicate that these variables are associated with the direction of the maximum
amount of variation in this dataset. The first PC loading patterns suggest that heavy,
big, very powerful, and highly priced cars are less fuel efficient. A strong correlation
between passenger capacity and PC2 indicates that this variable is mainly attributed to
the passenger capacity of the vehicle responsible for the next largest variation in the data
perpendicular to PC1. A visual display of the degree and the direction of PC loadings
is presented in Figure 4.4. The regression plot between PC scores and the original variables derived using the SAS macro FACTOR2 displays the statistical significance of the
linear association between the original variable and the derived PC scores (Figure 4.5).
A partial list of the first two PC scores presented in Table 4.7 is the scores
computed by the linear combination of the standardized variables using the eigenvectors as the weights. The cars that have small negative scores for the PC1 are less
expensive, small, and less powerful, but they are highly fuel efficient. Similarly,
expensive, large, and powerful cars with low fuel efficiency are listed at the end of
the table with the large positive PC1 scores.
A biplot display of both PC (PC1 and PC2) scores and PC loadings (Figure 4.6) is
very effective in studying the relationships within observations, between variables, and
the interrelationship between observations and the variables. The X-Y axis of the biplot
of PCA analysis represents the standardized PC1 and PC2 scores, respectively. In order
to display the relationships among the variables, the PC loading values for each PC are
overlaid on the same plot after being multiplied by the corresponding maximum value of
PC. For example, PC1 loading values are multiplied by the maximum value of the PC1
score, and the PC2 loadings are multiplied by the maximum value of the PC2 scores. This
transformation places both the variables and the observations on the same scale in the
biplot display since the range of PC loadings is usually shorter than the PC scores.
© 2010 by Taylor and Francis Group, LLC
K10535_Book.indb 91
5/18/10 3:37:01 PM
92 ◾ Statistical Data Mining Using SAS Applications
Variable
Factor 1
Factor 2
X11
>0.7
X15
>0.7
>0.7
X4
Ns
–0.4 to
–07
0.4–07
X8
>0.7
Y2
>0.7
–0.4 to
–07
< –0.7
Y4
–1.0
–0.5
0.0
0.5
1.0 –1.0 –0.5
Factor (PC) Loadings
0.0
0.5
1.0
Regression Plots of Factor Scores and Attributes
4
Factor 1
2
0
–2
–4
Factor 2
2
0
–2
–4
20 40 60 20 30 40 50 150 250 2 3 4 5 6 7 8 60 65 70 75 2000 4000
Y4
X11
X15
Y2
X4
X8
Figure 4.4 Factor loadings plot. (New: Statistical graphics feature and the
regression plot between PC scores and the original variables derived using the
SAS macro FACTOR2.)
Cars having larger (>75% percentile) or smaller (<25% percentile) PC scores are
only identified by their ID numbers on the biplot to avoid crowding of too many
ID values. Cars with similar characteristics are displayed together in the biplot
observational space since they have similar PC1 and PC2 scores. For example, small
compact cars with relatively higher gas mileage such as “Geo Metro (ID 12)” and
© 2010 by Taylor and Francis Group, LLC
K10535_Book.indb 92
5/18/10 3:37:02 PM
Unsupervised Learning Methods ◾ 93
2
Y4
X4
Attribute
1
Y2
0
Attribute
mean
X8
–1
X11
–2
X15
–2
0
Factor 1 (66.2%)
1
2
Y2
2
Attribute
–1
Attribute
mean
0
–2
X4
X8
–4
–4
–2
0
Factor 2 (20.1%)
2
Figure 4.5 Assessing the significance and the nature of factor loadings. (New:
Statistical graphics feature) derived using the SAS macro FACTOR2.
“Ford Fiesta (ID 7)” are grouped closer. Similarly, cars with different attributes
are displayed far apart since their PC1, PC2, or both PC scores are different. For
example, small compact cars with relatively higher gas mileage such as “Geo Metro
(ID 12)” and large expensive cars with relatively lower gas mileage such as “Lincoln
Town car (ID 42)” are separated far apart.
The correlations among the multivariate attributes used in the PCA analysis
are revealed by the angles between any two PC loading vectors. For each variable,
a PC load vector is created by connecting the X-Y origin (0,0) and the multiplied
value of PC1 and PC2 loadings in the biplot. The angles between any two variable
vectors will be:
© 2010 by Taylor and Francis Group, LLC
K10535_Book.indb 93
5/18/10 3:37:03 PM