Tải bản đầy đủ
5ROLLUP WITH SUMS, AVERAGES, AND COUNTS

5ROLLUP WITH SUMS, AVERAGES, AND COUNTS

Tải bản đầy đủ

2.6  Calculation of the Mode   49

_AVG=_TOT/_NT;
output;
_NT=0;
end;
drop &Value;
run;

Furthermore, the code inside the %do loop should also reflect our interest in transposing the values of the average variable, _AVG. Therefore, the code will be as
follows:

Step 4
%do i=1 %to &N;
0proc transpose
data =_Temp1
out=_R_&i
prefix=%substr(&&Cat_&i, 1, &Nchars)_;
by &IDVar &TypeVar;
ID &TimeVar;
var _AVG;
where &TypeVar=“&&Cat_&i”;
run;
%end;

The complete code for the modified code to roll up the average value is
included in the macro ABRollup().

2.6 CALCULATION OF THE MODE
Another useful summary statistic is the mode, which is used in both the rollup
stage and the event-driven architecture (EDA). The mode is the most common
category of transaction. The mode for nominal variables is equivalent to the use
of the average or the sum for the continuous case. For example, when customers
use different payment methods, it may be beneficial to identify the payment
method most frequently used by each customer.
The computation of the mode on the mining view entity level from a transaction dataset is a demanding task because we need to search for the frequencies
of the different categories for each unique value of the entity variable. The macro
shown in Table 2.5 is based on a classic SQL query for finding the mode on the
entity level from a transaction table. The variable being searched is XVar, and the
entity level is identified through the unique value of the variable IDVar:
%macro VarMode(TransDS, XVar, IDVar, OutDS);
/* A classic implementation of the mode of transactional
   data using SQL */
proc sql noprint;
create table &OutDS as

50    CHAPTER 2  Data Acquisition and Integration

Table 2.5 Parameters of VarMode() Macro
Header Parameter

VarMode (TransDS, XVar, IDVar, OutDS)
Description

TransDS

Input transaction dataset

XVar

Variable for which the mode is to be calculated

IDVar

ID variable

OutDS

The output dataset with the mode for unique IDs

SELECT &IDVar , MIN(&XVar ) AS mode
FROM (
               SELECT &IDVar, &XVar
               FROM &TransDS p1
               GROUP BY &IDVar, &XVar
               HAVING COUNT( * ) =
                     (SELECT MAX(CNT )
                     FROM (SELECT COUNT( * ) AS CNT
                           FROM &TransDS p2
                           WHERE p2.&IDVar= p1.&IDVar
                           GROUP BY p2.&XVar
                           ) AS p3
                     )
               ) AS p
         GROUP BY p.&IDVar;
quit;
%mend;

The query works by calculating a list holding the frequency of the XVar categories, identified as CNT, then using the maximum of these counts as the mode. The
query then creates a new table containing IDVar and XVar where the XVar category frequency is equal to the maximum count, that is, the mode.
The preceding compound SELECT statement is computationally demanding
because of the use of several layers of GROUP BY and HAVING clauses. Indexing
should always be considered when dealing with large datasets. Sometimes it is
even necessary to partition the transaction dataset into smaller datasets before
applying such a query to overcome memory limitations.

2.7 DATA INTEGRATION
The data necessary to compile the mining view usually comes from many different
tables. The rollup and summarization operations described in the last two sections
can be performed on the data coming from each of these data sources indepen-

2.7  Data Integration   51

dently. Finally, we would be required to assemble all these segments in one mining
view. The most used assembly operations are merging and concatenation. Merging
is used to collect data for the same key variable (e.g., customer ID) from different
sources. Concatenation is used to assemble different portions of the same data
fields for different segments of the key variable. It is most useful when preparing
the scoring view with a very large number of observations (many millions). In this
case, it is more efficient to partition the data into smaller segments, prepare each
segment, and finally concatenate them together.

2.7.1  Merging
SAS provides several options for merging and concatenating tables together using
DATA step commands. However, we could also use SQL queries, through PROC
SQL, to perform the same operations. In general, SAS DATA step options are more
efficient in merging datasets than PROC SQL is. However, DATA step merging may
require sorting of the datasets before merging them, which could be a slow
process for large datasets. On the other hand, the performance of SQL queries can
be enhanced significantly by creating indexes on the key variables used in
merging.
Because of the requirement that the mining view have a unique record
per category of key variable, most merging operations required to integrate
different pieces of the mining view are of the type called match-merge with
nonmatched observations. We demonstrate this type of merging with a simple
example.
Example 2.1
We start with two datasets, Left and Right, as shown in Table 2.6.
The two tables can be joined using the MERGE–BY commands within a DATA step
operation as follows:
DATA Left;
 INPUT ID Age Status $;
 datalines;
 1  30  Gold
 2  20  .
 4  40  Gold
 5  50  Silver
 ;
RUN;
DATA Right;
INPUT ID Balance Status $;
 datalines;
 2  3000  Gold
 4  4000  Silver
;
RUN;

52    CHAPTER 2  Data Acquisition and Integration

Table 2.6 Two Sample Tables: Left and Right
Table: Left

Table: Right

ID

Age

Status

ID

Balance

Status

1

30

Gold

2

3000

Gold

2

20

.

4

4000

Silver

4

40

Gold

5

50

Silver

Table 2.7 Result of Merging: Dataset Both
Obs

ID

Age

Status

Balance

1

1

30

Gold

.

2

2

20

Gold

3000

3

4

40

Silver

4000

4

5

50

Silver

.

DATA Both;
 MERGE Left Right;
 BY ID;
RUN;
PROC PRINT DATA=Both;
RUN;

The result of the merging is the dataset Both given in Table 2.7, which shows that the
MERGE-BY commands did merge the two datasets as needed using ID as the key variable.
We also notice that the common file Status was overwritten by values from the Right
dataset. Therefore, we have to be careful about this possible side effect. In most practical
cases, common fields should have identical values. In our case, where the variable represented some customer designation status (Gold or Silver), the customer should have had
the same status in different datasets. Therefore, checking these status values should be one
of the data integrity tests to be performed before performing the merging.
Merging datasets using this technique is very efficient. It can be used with more than
two datasets as long as all the datasets in the MERGE statement have the common variable
used in the BY statement. The only possible difficulty is that SAS requires that all the datasets be sorted by the BY variable. Sorting very large datasets can sometimes be slow.

2.7  Data Integration   53

You have probably realized by now that writing a general macro to merge a
list of datasets using an ID variable is a simple task. Assuming that all the datasets
have been sorted using ID before attempting to merge them, the macro would
simply be given as follows:
%macro MergeDS(List, IDVar, ALL);
DATA &ALL;
    MERGE &List; by
    &IDVar;
run;
%mend;

Finally, calling this macro to merge the two datasets in Table 2.6 would simply
be as follows:
%let List=Left Right;
%let IDVar=ID;
%let ALL = Both;
%MergeDS(&List, &IDVar, &ALL);

2.7.2  Concatenation
Concatenation is used to attach the contents of one dataset to the end of another
dataset without duplicating the common fields. Fields unique to one of the two
files would be filled with missing values. Concatenating datasets in this fashion
does not check on the uniqueness of the ID variable. However, if the data acquisition and rollup procedures were correctly performed, such a problem should
not exist.
Performing concatenation in SAS is straightforward. We list the datasets to be
concatenated in a SET statement within the destination dataset. This is illustrated
in the following example.

Example 2.2
Start with two datasets TOP and BOTTOM, as shown in Tables 2.8 and 2.9.
We then use the following code to implement the concatenation of the two datasets into
a new dataset:
DATA TOP;
 input ID Age Status $;
 datalines;
 1  30  Gold
 2  20  .
 3  30  Silver
 4  40  Gold
 5  50  Silver
 ;
run;

54    CHAPTER 2  Data Acquisition and Integration

Table 2.8 Table: TOP
Obs

ID

Age

Status

1

1

30

Gold

2

2

20

.

3

3

30

Silver

4

4

40

Gold

5

5

50

Silver

Table 2.9 Table: BOTTOM
Obs

ID

Balance

Status

1

6

6000

Gold

2

7

7000

Silver

DATA BOTTOM;
input ID Balance Status $;
 datalines;
 6  6000  Gold
 7  7000  Silver
 ;
run;
DATA BOTH;
 SET TOP BOTTOM;
run;
DATA BOTH;
 SET TOP BOTTOM;
run;

The resulting dataset is shown in Table 2.10.
As in the case of merging datasets, we may include a list of several datasets in the SET
statement to concatenate. The resulting dataset will contain all the records of the contributing datasets in the same order in which they appear in the SET statement.

2.7  Data Integration   55

Table 2.10 Table: BOTH
Obs

ID

Age

Status

Balance

1

1

30

Gold

.

2

2

20

.

.

3

3

30

Silver

.

4

4

40

Gold

.

5

5

50

Silver

.

6

6

.

Gold

6000

7

7

.

Silver

7000

The preceding process can be packed into the following macro:
%macro ConcatDS(List, ALL);
DATA &ALL;
 SET &List;
run;
%mend;

To use this macro to achieve the same result as in the previous example, we use
the following calling code:
%let List=TOP BOTTOM;
%let ALL = BOTH;
%ConcatDS(&List, &ALL);

This page intentionally left blank

CHAPTER

Data Preprocessing

3

Today’s real-world databases are highly susceptible to noisy, missing, and inconsistent data because of their typically huge size (often several gigabytes or more)
and their likely origin from multiple, heterogenous sources. Low-quality data will
lead to low-quality mining results.
How can the data be preprocessed in order to help improve the quality of the
data and, consequently, of the mining results? How can the data be preprocessed
so as to improve the efficiency and ease of the mining process?
There are a number of data preprocessing techniques. Data cleaning can be
applied to remove noise and correct inconsistencies in the data. Data integration
merges data from multiple sources into a coherent data store, such as a data warehouse. Data transformations, such as normalization, may be applied. For example,
normalization may improve the accuracy and efficiency of mining algorithms
involving distance measurements. Data reduction can reduce the data size by
aggregating, eliminating redundant features, or clustering, for instance. These
techniques are not mutually exclusive; they may work together. For example, data
cleaning can involve transformations to correct wrong data, such as by transforming all entries for a date field to a common format. Data processing techniques,
when applied before mining, can substantially improve the overall quality of the
patterns mined or the time required for the actual mining.
In Section 3.1 of this chapter, we introduce the basic concepts of data preprocessing. Section 3.2 presents descriptive data summarization, which serves as a
foundation for data preprocessing. Descriptive data summarization helps us study
the general characteristics of the data and identify the presence of noise or outliers, which is useful for successful data cleaning and data integration. The methods
for data preprocessing are organized into the following categories: data cleaning
(Section 3.3), data integration and transformation (Section 3.4), and data
reduction (Section 3.5). Concept hierarchies can be used in an alternative form
of data reduction where we replace low-level data (such as raw values for age)
with higher-level concepts (such as youth, middle-aged, or senior). This form
of data reduction is the topic of Section 3.6, wherein we discuss the automatic
generation of concept hierarchies from numeric data using data discretization

58    CHAPTER 3  Data Preprocessing

techniques. The automatic generation of concept hierarchies from categorical data
is also described.

3.1 WHY PREPROCESS THE DATA?
Imagine that you are a manager at AllElectronics and have been charged with
analyzing the company’s data with respect to the sales at your branch. You immediately set out to perform this task. You carefully inspect the company’s database
and data warehouse, identifying and selecting the attributes or dimensions to be
included in your analysis, such as item, price, and units_sold. Alas! You notice
that several of the attributes for various tuples have no recorded value. For your
analysis, you would like to include information as to whether each item purchased
was advertised as on sale, yet you discover that this information has not been
recorded. Furthermore, users of your database system have reported errors,
unusual values, and inconsistencies in the data recorded for some transactions. In
other words, the data you wish to analyze by data mining techniques are incomplete (lacking attribute values or certain attributes of interest, or containing only
aggregate data), noisy (containing errors, or outlier values that deviate from the
expected), and inconsistent (e.g., containing discrepancies in the department
codes used to categorize items). Welcome to the real world!
Incomplete, noisy, and inconsistent data are commonplace properties of large
real-world databases and data warehouses. Incomplete data can occur for a number
of reasons. Attributes of interest may not always be available, such as customer
information for sales transaction data. Other data may not be included simply
because it was not considered important at the time of entry. Relevant data may
not be recorded because of a misunderstanding or because of equipment malfunctions. Data that were inconsistent with other recorded data may have been deleted.
Furthermore, the recording of the history or modifications to the data may have
been overlooked. Missing data, particularly for tuples with missing values for some
attributes, may need to be inferred.
There are many possible reasons for noisy data (having incorrect attribute
values). The data collection instruments used may be faulty. There may have been
human or computer errors occurring at data entry. Errors in data transmission can
also occur. There may be technology limitations, such as limited buffer size for
coordinating synchronized data transfer and consumption. Incorrect data may also
result from inconsistencies in naming conventions or data codes used, or inconsistent formats for input fields, such as date. Duplicate tuples also require data
cleaning.
Data cleaning routines work to “clean” the data by filling in missing values,
smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. If users believe the data are dirty, they are unlikely to trust the results of any
data mining that has been applied to it. Furthermore, dirty data can cause confusion for the mining procedure, resulting in unreliable output. Although most

3.1  Why Preprocess the Data?   59

mining routines have some procedures for dealing with incomplete or noisy data,
they are not always robust. Instead, they may concentrate on avoiding overfitting
the data to the function being modeled. Therefore, a useful preprocessing step is
to run your data through some data cleaning routines. Section 3.3 discusses
methods for cleaning up your data.
Getting back to your task at AllElectronics, suppose that you would like to
include data from multiple sources in your analysis. This would involve integrating
multiple databases, data cubes, or files, that is, data integration. Yet some attributes representing a given concept may have different names in different databases, causing inconsistencies and redundancies. For example, the attribute for
customer identification may be referred to as customer_id in one data store and
cust_id in another. Naming inconsistencies may also occur for attribute values.
For example, the same first name could be registered as “Bill” in one database but
“William” in another, and “B.” in the third. Furthermore, you suspect that some
attributes may be inferred from others (e.g., annual revenue). Having a large
amount of redundant data may slow down or confuse the knowledge discovery
process. Clearly, in addition to data cleaning, steps must be taken to help avoid
redundancies during data integration. Typically, data cleaning and data integration
are performed as a preprocessing step when preparing the data for a data warehouse. Additional data cleaning can be performed to detect and remove redundancies that may have resulted from data integration.
Getting back to your data, you have decided, say, that you would like to use
a distance-based mining algorithm for your analysis, such as neural networks,
nearest-neighbor classifiers, or clustering. Such methods provide better results if
the data to be analyzed have been normalized, that is, scaled to a specific range
such as (0.0, 1.0). Your customer data, for example, contain the attributes age
and annual salary. The annual salary attribute usually takes much larger
values than age. Therefore, if the attributes are left unnormalized, the distance
measurements taken on annual salary will generally outweigh distance measurements taken on age. Furthermore, it would be useful for your analysis to obtain
aggregate information as to the sales per customer region—something that is not
part of any precomputed data cube in your data warehouse. You soon realize that
data transformation operations, such as normalization and aggregation, are
additional data preprocessing procedures that would contribute toward the
success of the mining process. Data integration and data transformation are discussed in Section 3.4.
“Hmmm,” you wonder, as you consider your data even further. “The dataset I
have selected for analysis is huge, which is sure to slow down the mining process.
Is there any way I can reduce the size of my dataset without jeopardizing the data
mining results?” Data reduction obtains a reduced representation of the dataset
that is much smaller in volume yet produces the same (or almost the same) analytical results. There are a number of strategies for data reduction. These include
data aggregation (e.g., building a data cube), attribute subset selection (e.g.,
removing irrelevant attributes through correlation analysis), dimensionality reduc-