Tải bản đầy đủ - 0 (trang)
4 Efficient Approaches to Large-Scale Feature Selection Using Genetic Algorithms

4 Efficient Approaches to Large-Scale Feature Selection Using Genetic Algorithms

Tải bản đầy đủ - 0trang

7.4 Efficient Approaches to Large-Scale Feature Selection


working of the algorithm. Experiments and results are discussed in Sect. 7.4.4. The

work is summarized in Sect. 7.4.5.

7.4.1 An Overview of Genetic Algorithms

Genetic algorithms are search and optimization methods based on the mechanisms

of natural genetics and evolution. Since these algorithms are motivated by the competition and survival of the fittest in Nature, we find analogy with them. The GAs

have advantages over conventional optimization methods in finding global optimum

solution or near-global optimal solution while avoiding local optima. Over the years,

the applications rapidly spread to almost all engineering disciplines. Since their introduction, a number of developments and variants have been introduced and developed into mature topics such as multiobjective genetic algorithms, interactive

genetic algorithms, etc. In the current section, we briefly discuss the basic concepts

with a focus on implementation of a simple genetic algorithm (SGA) and few applications. A brief discussion on SGA can be found in Chap. 3. The discussion

provided in the present section forms the background to subsequent material.

SGA is characterized by the following.

Population of chromosomes or binary strings of finite length.

Fitness function and problem encoding mechanism.

Selection of individual strings.

Genetic operators, viz., cross-over and mutation.

Termination and other control mechanisms.

It should be noted here that each of the topics is studied in depth through research

works. Since the current section is intended to provide completeness on the discussion with a focus on implementation aspect, interested readers are directed to the

references listed out at the end of the section. We also intentionally avoid discussion

on other evolutionary algorithms.

Objective Function. SGA is intended to find optimal set of parameters that optimize

a function. For example, find a set of parameters, x1 , x2 , . . . , xn , that maximizes

a function f (x1 , x2 , . . . , xn ).

Chromosomes. A bit-string or chromosome consists of a set of finite number of

bits, l, called the length of the chromosome. Bit-string encoding is a classical

method adapted by the researchers. The chromosomes are used to encode parameters that represent a solution to the optimization problem. Alternate encoding mechanisms include binary encoding, gray code, floating point, etc. SGA

makes use of a population of chromosomes with a finite population size, C.

Each bit of the bit-string is called allele in genetic terms. Both the terms are

used interchangeably in the literature.

Encoding Mechanism and Fitness Function. We find an optimal value of f (x1 , x2 ,

. . . , xn ) through the set of parameters x1 , x2 , . . . , xn . The value of f (·) is called


7 Optimal Dimensionality Reduction

the fitness function. Given the values of x1 , x2 , . . . , xn , the fitness can be computed. We encode the chromosome to represent the set of the parameters. This

forms the key step of a GA. Encoding depends on the nature of the optimization

problem. The following are two examples of encoding mechanisms. It should be

noted that the mechanisms are problem dependent, and one can find novel ways

of encoding a given problem.

Example 1. Suppose that we need to select a subset of features out of a group of

features that represent a pattern. The chromosome length is considered equal to

the total number of features in the pattern, and each bit of the chromosome represents whether the corresponding feature is considered. The fitness function in

this case can be the classification accuracy based on the selected set of features.

Example 2. Suppose that we need to find values of two parameters that minimize

(maximize) a given function and the parameters assume real values. The chromosome is divided into two parts representing the two parameters. The binary

equivalent of the expected range of real values of the parameters are considered

as corresponding lengths, viz., l1 and l2 . The length of the chromosome is given

by l1 + l2 .

Selection Mechanism. Selection refers to identifying individual chromosomes from

previous generation to the next generation of evolution while giving emphasis to highly fit individuals in the current generation. There are many selection

schemes that are used in practice. For example, the Roulette wheel selection

scheme consists of a sector in roulette wheel such that the angle subtended by

the sector is proportional to its fitness. This ensures that more copies of highly

fit individuals move on to the next generation. Many alternate approaches for

selection mechanisms are used in practice.

Crossover. Pairs of individuals, s1 and s2 , are chosen at random from population

and are subjected to crossover. Crossover takes place when the prechosen probability of crossover, Pc , exceeds a generated random number in the range [0,1].

In the “single point crossover” scheme, the position, say, k, within chromosome

is chosen at random from the numbers 1, 2, . . . , (l − 1) with equal probability.

Crossover takes place at k, resulting in two new offsprings containing alleles

from 0 to k of s1 and from (k + 1) to l of s2 for offspring 1 and from 0 to k of s2

and from (k + 1) to l of s1 for offspring 2. The operation is depicted in Fig. 7.1.

The other crossover schemes include two-point crossover, uniform crossover,


Mutation. Mutation of a bit consists changing it from 0 to 1 or vice versa based on

probability of mutation, Pm . This provides better exploration of solution space

by restoring genetic material that could possibly be lost through generations. The

activity consists of generating a random number in the range [0,1]. If the random

number is greater than Pm , mutation is resorted . The bit position of mutation is

determined randomly by choosing a random number in [0, l]. A higher value for

Pm causes more frequent disruption. The operation is depicted in Fig. 7.2.

Termination. Many criteria exist for termination of the algorithm. Some approaches

are (a) when there is no significant improvement in the fitness value, (b) a limit

on number of iterations, etc.

7.4 Efficient Approaches to Large-Scale Feature Selection


Fig. 7.1 Crossover operation

Fig. 7.2 Mutation operation

Control Parameters. The choice of population size C and the values of Pc and Pm

affect the solution and speed of convergence. Although large population size

assures the convergence, it increases computation time. The choice of these parameters is problem dependent. We demonstrate the effect of their variability

in Sect. 7.4.4. Adaptive schemes for choosing the values of Pc and Pm show

improvement on final fitness value.

SGA. With the above background, we briefly discuss working of a Simple Genetic

Algorithm as given below. After encoding the parameters of an optimization

problem, consider n chromosomes, each of length l. Initialize the population

with a probability of initialization, PI . With PI = 0, all the alleles are considered

for each chromosome, and with PI = 1, none are considered. Thus, as the value

of PI varies from 0 to 1, more alleles with value 0 are expected, thereby resulting

in lesser number of features getting selected for the chromosome. In Sect. 7.4.4,

we demonstrate the effect of variation of PI and provide a discussion. As the

next step, we evaluate the function to obtain fitness values of each chromosome

of the function.

Till the convergence based on the set criteria is obtained, for each iteration, select the population for the generation and perform crossover (Pc ) and mutation (Pm )

operations to obtain new offsprings. Compute the fitness function for the new population.


7 Optimal Dimensionality Reduction

Simple Genetic Algorithm


Step 1: Initialize population containing ‘C’

strings of length ‘l’, each with

probability of initialization, Pi;

Step 2: Compute fitness of each chromosome;

while termination criterion not met


Step 3: Select population for the next


Step 4: Perform crossover based on Pc and

mutation Pm;

Step 5: Compute fitness of each updated



} Steady-State Genetic Algorithm (SSGA)

In the general framework of Genetic Algorithms, we choose entire feature set of a

pattern as a chromosome. Since the features are in binary form, they indicate the

presence or absence of the corresponding feature in a pattern. The genetic operators

of Selection, Cross-over, and Mutation with corresponding probability of selection

(PI ), probability of cross-over (Pc ) and probability of mutation (Pm ) are used. Like

in the case of SGA, the given dataset is divided into training, validation, and test

data. Classification accuracy on validation data using NNC forms the fitness function. Table 7.1 contains the terminology used in the paper.

In case of SSGA, we retain a chosen percentage of highly fit individuals from

generation to generation, thereby preventing loss of such individuals during the generations at a given point of time. It is termed as generation gap. Thus, SSGA permits

larger Pm values as compared to SGA.

7.4.2 Proposed Schemes

We propose the algorithms shown in Algorithm 7.1 and Algorithm 7.2 for the study.

Algorithm 7.1 integrates run-length compression of data, classification of compressed data, SSGA, and knowledge acquired through preliminary analysis with

7.4 Efficient Approaches to Large-Scale Feature Selection


a generation gap of 40 %. Algorithm 7.2 integrates the concept of frequent features

in addition to GA-based optimal feature selection.

Algorithm 7.1 (Algorithm for Feature Selection using Compressed Data Classification and Genetic Algorithms)

Step 1: Consider a population of ‘C’ chromosomes, with each chromosome consisting of ‘l’ features. Initiate each chromosome by setting a feature to ‘1’

as selected with a given probability, PI .

Step 2: For each chromosome in the population,

(a) Consider those selected features in the chromosome

(b) With the selected features in training and validation data sets, compress

the data

(c) Compute classification accuracy of validation data directly using the

compressed form. The classification accuracy forms the fitness function

(d) Record the number of alleles, classification accuracy for each chromosome, and generation-wise average fitness value.

Step 3: In computing next generation of chromosomes, carry out the following


(a) sort the chromosomes in the descending order of their fitness

(b) preserve 40 % of highly fit individuals for the next generation

(c) the remaining 60 % of the next population are obtained by subjecting randomly selected individuals from current population to cross-over

and mutation with respective probabilities Pc and Pm .

Step 4: Repeat Steps 2 and 3 till there is no significant change in the average fitness

between successive generations.

In the framework of optimal feature selection using genetic algorithms, each

chromosome is considered to represent entire candidate feature set. The population containing C chromosomes is initialized in Step 1. Since the features are binary valued, the initialization is carried out by setting a feature to “1” with a given

probability of initialization, PI . Based on the binary value 1 or 0 of an allele, the

corresponding feature is considered either selected or not, respectively.

In Step 2, for each initialized chromosome, original training and validation data is

updated to contain only those selected features. The data is compressed using runlength compression algorithm. The validation data is classified in its compressed

form, and the average classification accuracy is recorded.

In Step 3, the subsequent population is generated. The best 40 % of the current population are preserved. The remaining 60 % are generated by subjecting entire current population to genetic operators of selection, single-point cross-over, and

mutation with preselected probabilities. The terminating criterion is verified for percentage change of fitness between two successive generations.

An elaborate experimentation is carried out by changing population initialization procedure such as (a) preselected population, (b) preselecting some features as

unused, and (c) initialization using probability of initialization, varying values of


7 Optimal Dimensionality Reduction

probabilities of selection, cross-over, mutation, etc. The nature of exercises and the

results are discussed in the following section.

Genetic Algorithms (GAs) are well studied for feature selection and feature extraction. We restrict our study for feature selection. Given a feature set, C, the problem of dimensionality reduction can be defined as arriving at a subset of original

feature set of dimension d < C such that the best classification accuracy is obtained. Single dominant computation block is the evaluation of fitness function. If

this could be speeded up, overall speed can be achieved. In order to achieve this, we

propose to compress the training and validation data and compute the classification

accuracy directly on the compressed data without having to uncompress.

In the current section, before discussing the proposed procedure, we present compressed data classification and Steady-State Genetic Algorithm for feature selection

in the following subsections. Compressed Data Classification

We make use of the algorithm discussed in Chap. 3 to compress input binary data

and operate directly on the compressed data without decompressing for classification using runs. This forms a total nonlossy compression–decompression scenario.

It is possible to perform this when classification is achieved with the help of the

Manhattan distance function. The distance function on the compressed data results

in the same classification accuracy as that obtained on the original data as shown

in Chap. 3. The compression algorithm is applied on large data, and it is noticed to

reduce processing requirements significantly. Frequent Features

Albeit Genetic Algorithms provide optimal feature subset, it is interesting to explore

whether the input set of features can be reduced by simpler means. Frequent pattern

approach, as discussed in Sects. 2.4.1 and 4.5.2, provides frequently used features,

which could possibly help discrimination too. A binary-valued pattern can be considered as a transaction with each feature representing the presence and absence of

a feature. Support of an item can be defined as the percentage of transactions in

the given database that contain the item. We make use of the concept of support

in identifying the feature set that is frequent above a chosen threshold. This results

in reduction in the number of features that need to be explored for an optimal set.

In Sect. 7.4.3, as part of preliminary analysis on the considered data, we demonstrate this aspect. Figure 7.3 demonstrates the concept of support. The support and

percentage-support are used equivalently in the present chapter.

Algorithms 7.1 and 7.2 are studied in detail in the following sections.

Algorithm 7.2 (Optimal Feature Selection using Genetic Algorithms combined

with frequent features)

Step 1: Identify frequent features based on a chosen support threshold.

7.4 Efficient Approaches to Large-Scale Feature Selection


Fig. 7.3 The figure depicts

the concepts of transaction,

items, and support

Step 2: Consider only those frequent features for further exploration.

Step 3: All steps of Algorithm 7.1.

We briefly elaborate each of the steps along with results of preliminary analysis.

7.4.3 Preliminary Analysis

Preliminary analysis of the data brings out insights of the data and forms domain

knowledge. The analysis primarily consists of computation of measures of central

tendency and dispersion, feature occupancy of patterns, class-wise variability, and

inter-class similarities. The results of the analysis help in choosing appropriate parameters and forming the experimental setup.

We consider 10-class handwritten digit data consisting of 10,000 192-featured

patterns. Each digit is formed as a 16 × 12 matrix with binary-valued features. The

data is divided into three mutually exclusive sets for training, validation, and testing.

In order to find optimal feature selection, it is useful to understand the basic statistics on the number of nonzero features in the training data. Although care is taken

while forming the handwritten dataset in terms of centering and depicting all variations of collected digits through the pattern matrix, it is possible that some regions

within the 16 × 12 not fully utilized depending on the class-label. Figure 7.4 contains these details. The topmost figure depicts class-wise details of average number

of nonzero features. It can be seen that the digits 0, 2, and 8 contain about 68 nonzero

features, each with digit 1 requiring the least number of nonfeatures of about 30 for


7 Optimal Dimensionality Reduction

Fig. 7.4 Statistics of features in the training dataset

Fig. 7.5 The figure contains

nine 3-featured patterns

occupying different feature

locations in 3 × 3 pattern

representation. It can be

observed that all locations are

occupied cumulatively at the

end of 9 sample patterns

representation. The middle figure indicates the standard deviation of the number of

nonzero features, indicating comparatively a larger dispersion for the digits 0, 2, 3, 5,

and 8. The third plot in the figure provides an interesting aspect of occupancy of

features within digit. Considering the digit 0, although, on the average, 68 nonzero

features suffice to represent the digit, the nonzero features occupied about 175 of

192 features by one training pattern or the other. Similar observations can be seen

for other digits too.

When the objective is to find an optimal subset of features, this provides a

glimpse of complexity involved. Figure 7.5 summarizes this argument that although

the average number of features per pattern is small, all the feature locations can be

occupied at least once. We consider a pattern such as handwritten digit “1” in a 3 × 3

pattern-representation. The average number of features needed to represent the digit

is 3. It can be noted here that all the feature locations are occupied after passing

through 9 patterns.

7.4 Efficient Approaches to Large-Scale Feature Selection


Fig. 7.6 The figure contains patterns with frequent features excluded with minimum support

thresholds of 13, 21, 52, and 70. The excluded feature regions are depicted as gray and black

portion corresponding to retain a feature set for exploration Redundancy of Features Vis-a-Vis Support

We make use of the concept of support, as discussed in Fig. 7.3 and Sect. to

identify the features that occur above a prechosen support threshold. We compute

empirical the probability for each feature. We vary support to find the set of frequent

features. We will later examine experimentally whether such excluded features have

impact on feature selection. Figure 7.6 contains an image of a 192-featured pattern

with excluded features corresponding to various support thresholds. The figure indicates features of low minimum support. It should be noted that they occurred in

this case of low minimum support on the edges of the pattern. As the support is

increased, the pattern representability will be affected. Data Compression and Statistics

The considered patterns consist of binary-valued features. The data is compressed

using the run-length compression scheme as discussed in Chap. 3. The scheme consist of the following steps.

• Consider each pattern.

• Form runs of continuous occurrence of each feature. For ease of dissimilarity

computation, consider each pattern as starting with a feature value of 1, so that

the first run corresponds to number of 1s. In case the first feature of the pattern

is 0, the corresponding length would be 0.

The compression results in unequal number of runs for various patterns as shown

Fig. 7.7. The dissimilarity computation in the compressed domain is based on the

work in Chap. 3.

7.4.4 Experimental Results

Experimentation is planned to explore each of the parameters of Table 7.1 in order

to arrive at a minimal set of features that provides the best classification accuracy.

We initially study the choice of probabilities of initialization, cross-over, and mutation based on few generations of execution of genetic algorithms. After choosing


7 Optimal Dimensionality Reduction

Fig. 7.7 Statistics of runs in compressed patterns. For each class label, the vertical bar indicates

the range of number of runs in the patterns. For example, for class label “0”, the compressed

image length ranges from 34 to 67. The discontinuities indicate that there are no patterns that

have compressed lengths of 36 to 39. The figure provides range of compressed pattern lengths

corresponding to the original pattern length of 192 for all the patterns

Table 7.1 Terminology




Population size


No. of generations


Length of chromosome


Probability of initialization


Probability of cross-over


Probability of mutation


Support threshold

appropriate values of these three values, we proceed with feature selection. We also

bring out comparison of computation time with and without compression and bring

out comparisons. All the exercises are carried out with run-length-encoded nonlossy

compression, and classification is performed in the compressed domain directly. Choice of Probabilities

In order to choose appropriate values for probabilities of cross-over, mutation, and

initialization, exercises are carried out using the proposed algorithm for 10–15 gen-

7.4 Efficient Approaches to Large-Scale Feature Selection


Fig. 7.8 Result of genetic algorithms after 10 generations on sensitivity of the probabilities of

initialization, cross-over, and mutation. The two plots in each case indicate the number of features

for best chromosome across 10 generations and the corresponding fitness value

erations. For these exercises, we consider the complete set containing 192 features.

Figure 7.8 contains the results of these exercises. The objective of the study is to

obtain a subset of features that provides a reasonable classification accuracy.

Choice of Probability of Initialization (PI ). A feature is included when the corresponding probability is more than the probability of initialization(PI ). As PI

increases, the number of selected features reduces. When PI = 0, all features

are considered for the study. The classification accuracy of corresponding best

fit chromosome reduces as PI increases since the representability reduces in

Tài liệu bạn tìm kiếm đã sẵn sàng tải về

4 Efficient Approaches to Large-Scale Feature Selection Using Genetic Algorithms

Tải bản đầy đủ ngay(0 tr)