Tải bản đầy đủ - 0 (trang)
2 Prototype Selection, Feature Selection, and Data Compaction

2 Prototype Selection, Feature Selection, and Data Compaction

Tải bản đầy đủ - 0trang

5.2 Prototype Selection, Feature Selection, and Data Compaction



97



multiple samples from the entire dataset, it compares the average minimum

dissimilarity of all objects in the entire dataset. The sample size suggested is

40 + 2k. However, in practice, CLARA requires large amount of time when

k is greater than 100. It is observed that CLARA does not always generate

good prototypes and it is computationally expensive for large datasets with

a complexity of O(kc2 + k(n − k)), where c is the sample size. CLARANS

(Clustering Large Applications based on RANdomized Search) combines the

sampling technique with PAM. CLARANS replaces the build part of PAM

by random selection of objects. Where CLARA considers a fixed sample size

at every stage, CLARANS introduces randomness in sample selection. Once

a set of objects is selected, a new object is selected when a preset value of

local minimum and maximum neighbors is searched using the swap phase.

CLARANS generates better cluster representatives than PAM and CLARA.

The computational complexity of CLARANS is O(n2 ).

The other schemes for prototype selection include support vector machine

(SVM) and Genetic Algorithm (GA) based schemes. The SVMs are known to be

expensive as they take O(n3 ) time. The GA-based schemes need multiple scans of

dataset, which could be prohibitive when large datasets are processed.

As an illustration of prototype selection, we compare two algorithms that require

a single database scan, viz., Condensed Nearest-Neighbor (CNN) and Leader clustering algorithms for prototype selection.

The outline of CNN is provided in Algorithm 5.1. The CNN starts with the first

sample as a selected point (BIN-2). Subsequently, using the selected pattern, the

other patterns are classified. The first incorrectly classified sample is included as

an additional selected point. Likewise with selected patterns, all other patterns are

classified to generate a final set of representative patterns.

Algorithm 5.1 (Condensed Nearest Neighbor rule)

Step 1: Set two bins called BIN-1 and BIN-2. The first sample is placed in BIN-2

Step 2: The second sample is classified by the NN rule, using current contents of

BIN-2 as reference set. If the second sample is classified correctly, it is

placed in BIN-1; otherwise, it is placed in BIN-2.

Step 3: Proceeding in this manner, the ith sample is classified by the current contents of BIN-2. If classified correctly, it is placed in BIN-1; otherwise, it is

placed in BIN-2.

Step 4: After one passes through the original sample set, the procedure continues

to loop through BIN-1 until termination in one of the following ways

(a) The BIN-1 is exhausted with all its members transferred to BIN-2, or

(b) One complete pass is made through BIN-1 with no transfers to BIN-2.

Step 5: The final contents of BIN-2 are used as reference points for the NN rule;

the contents of BIN-1 are discarded



98



5



Data Compaction Through Simultaneous Selection of Prototypes



The Leader clustering algorithm is provided in Sect. 2.5.2.1. Discussions related

to Leader algorithm are provided in Sect. 5.3.4.

A comparative study is conducted between CNN and Leader by providing all the

6670 patterns as training data and 3333 patterns as test data for classifying them with

the help of Nearest-Neighbor Classifier (NNC). Table 5.1 provides the results. In the

table, Classification Accuracy is represented as CA. CPU Time refers to the processing time computed on Pentium III 500 MHz computer as time elapsed between the

first and last computations. The table provides a comparison between both methods.

It demonstrates the effect of threshold on the number of (a) prototypes selected,

(b) CA, and (c) processing time. A finite set of thresholds is chosen to demonstrate

the effect of distance threshold. It should be noted that for binary patterns, the Hamming and Euclidean distances provide equivalent information. At the same time, it

reduces computation time in terms of computation of squares of deviation and the

square root. Hence, we choose the Hamming distance as a dissimilarity measure.

The exercises indicate that compared to Leader algorithm, CNN requires more

time for obtaining the same classification accuracy. But CNN provides fewer but a

fixed set of prototypes corresponding to a chosen order of input data. Leader algorithm offers a way of improving the classification accuracy by means of threshold

value-based prototype selection and thus provides a greater flexibility to operate

with. In view of this and based on the earlier comparative study with PAM and

CLARA, Leader is considered for prototype selection in this study. We use the NNC

as the classifier. In order to achieve efficient classification, we use the set of prototypes obtained using the Leader algorithm. Our scheme offers flexibility to select

different sizes of prototype sets.

Dimensionality reduction is achieved through either feature selection or feature

extraction. In feature selection, it is achieved through removing redundant features.

This is achieved by optimal feature selection by deterministic and random search

algorithms. Some of the conventional algorithms include feature selection by individual merit basis, branch-and-bound algorithm, sequential forward and backward

selection, plus l–take away r algorithm, max–min feature selection, etc. Feature extraction methods utilize all the information contained in feature space to obtain a

transformed space resulting in lower dimension.

Considering these philosophical and historical notes, in order to obtain generalization and regularization, we examine a large handwritten digit data in terms of

feature selection and data reduction and whether there exists an equivalence between

the two.

Four different approaches are presented, and the results of exercises are provided in driving home the issues involved. We classify large handwritten digit data

by combining dimensionality reduction and prototype selection. The compactness

achieved by dimensionality reduction is indicated by means of the number of combinations of distinct subsequences. A detailed study of subsequence-based lossy

compression is presented in Chap. 4. The concepts of frequent items and Leader

cluster algorithms are used in the work.

The handwritten digit data set considered for the study consists of 10 classes

of nearly equal number of patterns. The data consisting of about 10,000 labeled



5.2 Prototype Selection, Feature Selection, and Data Compaction

Table 5.1 Comparison

between CNN and leader



Distance

threshold



99



No. of

prototypes



CA (%)



CPU time

(sec)



1610



86.77



942.76



CNN



Leader

5



6149



91.24



1171.81



10



5581



91.27



1066.54



15



4399



90.40



896.84



18



3564



90.20



735.53



20



3057



88.03



655.44



22



2542



87.04



559.52



25



1892



84.88



434.00



27



1526



81.70



363.00



patterns are divided in training and test patterns in the ratio of 67 % and 33 %

approximately. About 7 % of total dataset is used validation data, and it is taken

out of training dataset. Each pattern consists of 192 binary features. The number of

patterns per class is nearly equal.



5.2.1 Data Compression Through Prototype and Feature Selection

In Chap. 4, we observed that increasing frequent-item support till a certain value

leads to data compaction without resulting in significant reduction in classification

accuracy. We explore whether such a compaction would lead to selection of better

prototypes than selection without such a compaction. Similarly, we study whether

activity that leads to feature selection would result in a better representative feature

set. We propose to evaluate both these activities through classification of unseen

data.

5.2.1.1 Feature Selection Through Frequent Features

In this chapter, we examine whether frequent-item support helps in arriving at such

a discriminative feature set. We explore to select such a feature set with varying support values. We evaluate each such selected set through classifying unseen patterns.

5.2.1.2 Prototype Selection for Data Reduction

The use of representative patterns in place of original dataset reduces the input data

size. We make use of an efficient pattern clustering scheme known as leader clustering, which is discussed in Sect. 2.5.2.1. The algorithm generates clustering in



100



5



Data Compaction Through Simultaneous Selection of Prototypes



a single database scan. The leaders form cluster representatives. The clustering algorithm requires an optimal value of dissimilarity threshold. Since such a value is

data dependent, a random sample from the input data is used to arrive at the threshold. Each cluster is formed with reference to a leader. The leaders are retained, and

the remaining patterns are discarded. The representativeness of leaders is evaluated

with the help of pattern classification of unseen patterns with the help of the set of

leaders.



5.2.1.3 Sequencing Feature Selection and Prototype Selection for

Generalization

Prototype selection and feature selection reduce the data size and dimensionality, respectively. It is educative to examine whether feature selection using frequent items

followed by prototype selection or vice versa would have any impact on classification accuracy. We experiment both these orderings to evaluate relative performance.



5.2.1.4 Class-Wise Data vs Entire Data

Given a multiclass labeled dataset, we examine the relative performance of considering the dataset class-wise and a single large set of multiclass data. We observe

from Fig. 5.4 and Table 5.6 that patterns belonging to different class labels require

different numbers of effective features to represent the pattern. Identifying a classwise feature set or patterns would likely to be a better representative of the class. On

the contrary, it is interesting to examine whether there could be a common threshold

for prototype selection and common support threshold for selecting a feature set to

represent the entire dataset.



5.3 Background Material

Consider training data containing n patterns, each having d features. Let ε and ζ

be the minimum support for considering any feature for the study and the distance threshold for selecting prototypes, respectively. For continuity of notation,

we follow the same terminology as provided in Table 4.1. Also, the terms defined

in Sect. 4.3 are valid in the current work too. Additional terms are provided below.

1. Leader. Leaders are cluster representatives obtained by using Leader Clustering

algorithm.

2. Distance Threshold for clustering (ζ ). It is the threshold value of the distance

used for computing leaders.

Illustration 5.1 (Leaders, choice of first leader, and impact of threshold) In order

to illustrate computation of leaders and impact of the threshold on the number



5.3 Background Material

Table 5.2 Transaction and

items



101

Transaction

No.



Items

1



2



3



4



5



6



1



1



1



0



0



1



0



2



0



1



1



0



1



1



3



0



0



1



0



1



1



4



1



0



0



0



0



1



5



1



0



1



1



1



0



of leaders, we consider UCI-ML dataset on iris. We demonstrate the concepts on

iris-versicolor data. We consider the petal length and width as two features per

pattern. In applying the Leader algorithm, we consider the Euclidean distance as

a dissimilarity measure. To start with, we consider the distance threshold (ε) of

1.4 cm and consider the first pattern as leader 1. The result is shown in Fig. 5.1.

The figure contains two clusters with respective cluster members shown as different symbols. Leaders are shown with superscribed square symbols. In order

to demonstrate the order dependence of leaders, we consider the same distance

threshold and select pattern no. 16 as the first leader. As shown in Fig. 5.2, we

still obtain two clusters with location of different first leader and different number of cluster members. As a third example, we consider the distance threshold

of 0.5 cm. We obtain seven clusters as shown in Fig. 5.3 . Note that the leaders

are shown with a superscribed square. When we consider a large threshold of,

say, 5.0 cm, we obtain a single cluster, which essentially is a scatter plot of all

patterns.

3. Transaction. A transaction is represented using a set of items that are possible

to be purchased. In any given transaction, all or a subset of the items could be

purchased. Thus, a transaction indicates presence or absence of items purchased.

This is analogous to a pattern with presence or absence of binary-valued features.

Illustration 5.2 (Transaction or Binary-Valued Pattern) Consider a transaction

with six items. We represent an item bought as “1” or not bought as “0”. We

represent five transactions with the corresponding itemsets in Table 5.2. For example, in transaction 3, items 3, 5, and 6 are purchased.

The leader clustering algorithm is explained in Sect. 2.5.2.1. We use (a) pattern

and transaction and (b) item and feature interchangeably in the current work. The

following subsections describe some of the important concepts used in explaining

the proposed method. As compared to k-means clustering, the leader clustering algorithm identifies prototypes in one data scan and does not involve iteration. However, leaders are order dependent. It is possible that we arrive at different sets of

leaders depending on the choice of the first leader. Further, the centroid in k-means



102



5



Data Compaction Through Simultaneous Selection of Prototypes



Fig. 5.1 Leader clustering with a threshold of 1.4 cm on Iris-versicolor dataset. First leader is

selected as pattern sl no. 1



Fig. 5.2 Leader clustering with a threshold of 1.4 cm on Iris-versicolor dataset. First leader is

selected as pattern sl no. 16



5.3 Background Material



103



Fig. 5.3 Leader clustering with a threshold of 0.5 cm on Iris-versicolor dataset. First leader is

selected as pattern sl no. 16. The data is grouped into seven clusters



most often does not coincide with one of the input patterns. On the contrary, leaders

are one of the patterns.



5.3.1 Computation of Frequent Features

In the current section, we describe an experimental setup wherein we examine which

of the features helps discrimination using frequent-item support. This is done by

counting the number of occurrences of each feature in the training data. If the number is less than a given support threshold ε, the value of the feature is set to be

absent in all the patterns. After identifying the features that have support less than

ε as infrequent, the training data set is modified to contain “frequent features” only.

As noted earlier, the value of ε is a trade-off between the minimal description of the

data under study and the maximal compactness that could be achieved. The actual

value depends on the size of training data, such as class-wise data of 600 patterns,

each or full data of 6000 patterns.

To illustrate the concept, we consider one arbitrary pattern from each class and

display each such pattern with frequent features having supports of 1, 100, 200,

300, and 400. Figure 5.4 contains each of those patterns. Each of the support values

considered in the example is out of 600 patterns. It is interesting to note that although

with increasing support a pattern becomes less decipherable to the human eye as

shown in the figure, it is sufficient for the machine to classify it correctly. Later in

the current chapter, we demonstrate the advantage of this concept.



104



5



Data Compaction Through Simultaneous Selection of Prototypes



Fig. 5.4 Sample Patterns

with frequent features having

support 1, 100, 200, 300 and

400



5.3.2 Distinct Subsequences

The concepts of sequence, subsequence, and length of a subsequence are used in

the context of demonstrating compactness of a pattern, as discussed in Chap. 4. For

example, consider the pattern containing binary features, {01110110110100011011

0010 . . .}. Considering a block of length 4 the pattern can be written as {0111 0110

1101 0001 1011 0010 . . .}. The corresponding values of blocks, which are decimal equivalents of the 4-bit blocks, are {7, 6, 13, 1, 11, 2, . . .}. When arranged as

a 16 × 12 pattern matrix, each row of the matrix would contain three blocks, each

of length of 4 bits, as {(7, 6, 13), (1, 11, 2), . . .}. Let each set of three such codes

form a subsequence, e.g., {(7, 6, 13)}. In the training set, all such distinct subsequences are counted. Original data of 6000 training patterns consists of 6000 · 192

features. When arranged as subsequences, the corresponding number of distinct subsequences is 690.

We count the frequency of subsequences, which is the number of occurrences,

of each of the subsequences. Subsequently, they are ordered in descending order

of their frequency. The sequences are sequentially numbered for internal use. For

example, the first two of the most frequent distinct subsequences, {(0, 6, 0)} and

{(0, 3, 0)} , are repeated 8642 times and 6447 times, respectively. As the minimum

support value, ε, is increased, some of the binary feature values should be set to

zero. This would lead to reduction in the number of distinct subsequences, and we

would show later that it also provides a better generalization.



5.3.3 Impact of Support on Distinct Subsequences

As discussed in Sect. 5.3.2, with increasing ε, the number of distinct subsequences

reduces. For example, consider the pattern {(1101 1010 1011 1100 1010 1011. . . )}.

The corresponding 4-bit block values are {(13, 10, 11), (12, 10, 11), . . .}. Suppose that with the chosen support, the feature number 4 in the considered pattern is absent. This would make the pattern {(110010101011110010101011,. . . )}.

Thus, the original distinct subsequences {(13, 10, 11), (12, 10, 11), . . .} reduce to

{(12, 10, 11), (12, 10, 11), . . .}, where {(13, 10, 11)} is replaced by {(12, 10, 11)}.

This results in reduction in the number of distinct subsequences.



5.4 Preliminary Analysis



105



5.3.4 Computation of Leaders

The Leader computation algorithm is described in Sect. 2.5.2 of Chap. 2. The leaders

are considered as prototypes, and they alone are used further, either for classification

or for computing frequent items, depending on the adopted approach. This forms

data reduction.



5.3.5 Classification of Validation Data

The algorithm is tested against the validation data using k-Nearest-Neighbor Classifier (kNNC). Each time, prototypes alone are used to classify test patterns. Different

approaches are followed to generate prototypes. Depending on the approach, the

prototypes are either “in their original form” or in a “new form with reduced number of features.” The schemes are discussed in the following section.



5.4 Preliminary Analysis

We carry out elaborate preliminary analysis in arriving at various parameters and

also to study the sensitivity of such parameters.

Table 5.3 contains the results on preliminary experiments considering training

dataset as 6670 patterns, which combine training and validation data. The exercises

provide insights on the choice of thresholds, the number of patterns per class, the reduced set of training patterns, and classification accuracy with the test data. It can be

noticed from the table that as the distance threshold (ε) increases, the number of prototypes reduces, and that the classification reaches the best accuracy for an optimal

set of thresholds, beyond which it starts reducing. The table consists of class-wise

thresholds for different discrete choices. One such threshold set that provides the

best classification accuracy is {3.5, 3.5, 3.5, 3.8, 3.5, 3.7, 3.5, 3.5, 3.5, 3.5}, and the

accuracy is 94.0 %.

Table 5.4 provides the results on impact of various class-wise support thresholds

for a chosen set of leader distance threshold values. We consider a class-wise distance threshold of 3.5 for prototype selection using the Leader clustering algorithm

for the study. Column 1 of the table contains the class-wise support values, column

2 consists of the totals of prototypes, which are the sums of class-wise prototypes

generated with a common distance threshold of 3.5, and column 3 consists of classification accuracies using kNNC. The table provides an interesting aspect that when

the patterns with frequent features are only selected, the number of representative

patterns also reduces.

We study the sensitivity of support in terms of the number of distinct numbers of

subsequences. We consider 6670 patterns in full and apply the support threshold. We

present the numbers of distinct subsequences and evaluate the patterns with reduced



106



5



Data Compaction Through Simultaneous Selection of Prototypes



Table 5.3 Experiments with leader clustering with class-wise thresholds and prototypes

Class-wise percentage support

1



2



3



4



5



6



7



8



9



#Prototypes



CA



0

(1)



(2)



(3)



(4)



(5)



(6)



(7)



(8)



(9)



(10)



(11)



(12)



4207



92.53



5764



93.61



5405



93.61



5219



93.85



4984



93.88



4764



93.82



4.0



4.0



4.0



4.0



4.0



4.0



4.0



4.0



4.0



4.0



(412)



(21)



(580)



(545)



(500)



(606)



(398)



(243)



(528)



(374)



3.0



3.0



3.0



3.0



3.0



3.0



3.0



3.0



3.0



3.0



(609)



(74)



(663)



(657)



(649)



(661)



(686)



(549)



(642)



(624)



3.2



3.2



3.2



3.2



3.2



3.2



3.2



3.2



3.2



3.2



(561)



(49)



(647)



(640)



(629)



(657)



(577)



(452)



(628)



(565)



3.4



3.4



3.4



3.4



3.4



3.4



3.4



3.4



3.4



3.4



(542)



(42)



(641)



(630)



(612)



(653)



(548)



(409)



(606)



(536)



3.6



3.6



3.6



3.6



3.6



3.6



3.6



3.6



3.6



3.6



(506)



(34)



(628)



(606)



(593)



(648)



(516)



(363)



(593)



(497)



3.7



3.7



3.7



3.7



3.7



3.7



3.7



3.7



3.7



3.7



(418)



(30)



(616)



(589)



(567)



(640)



(489)



(322)



(570)



(463)



Table 5.4 Experiments with

support value for a common

set of prototype selection



Support

threshold



No. of prototypes



Classification

accuracy (%)



(1)



(2)



(3)



5



4981



93.82



10



4974



93.82



15



4967



93.61



20



4962



93.58



25



4948



93.67



30



4935



93.61



35



4928



93.49



40



4915



93.7



45



4899



93.55



50



4887



93.79



55



4875



93.76



numbers of distinct features using kNNC. Table 5.5 contains the results. In the table,

column 1 contains the support threshold for the features. Column 2 contains the corresponding distinct subsequences. The classification accuracy of validation patterns

with the considered distinct subsequences is provided in column 3.



5.5 Proposed Approaches

Table 5.5 Distinct

subsequences and

classification accuracy for

varying support and constant

set of input patterns



107

Support

threshold



Distinct

subsequences



Classification

accuracy (%)



(1)



(2)



(3)



0



690



92.47



15



648



92.35



25



599



92.53



45



553



92.56



55



533



92.89



70



490



92.29



80



468



92.26



90



422



92.20



100



395



92.32



5.5 Proposed Approaches

With the background of the discussion provided in Sect. 5.2, we propose the following four approaches.

• Patterns with frequent features only, considered in both class-wise and combined

sets of multi-class data

• Cluster representatives only in both class-wise and entire datasets

• Frequent item selection followed by prototype selection in both class-wise and

entire datasets

• Prototype selection followed by frequent items in both class-wise and combined

datasets

In the following subsections, we elaborate each approach.



5.5.1 Patterns with Frequent Items Only

In this approach, we consider entire training data. For a chosen ε, we form patterns

containing frequent features only. With the training data containing frequent features, we classify validation patterns using kNNC. By varying ε, the Classification

Accuracy (CA) is computed. The value of ε that provides the best CA is identified,

and results are tabulated. The entire exercise is repeated considering class-wise data

and class-wise support as well as full data. It should be noted that the support value

depends on data size. The procedure can be summarized as follows.

Class-Wise Data

• Consider class-wise data of 600 patterns per class.

• By changing support in small discrete steps carry out the following steps.



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

2 Prototype Selection, Feature Selection, and Data Compaction

Tải bản đầy đủ ngay(0 tr)

×