4 Efﬁcient Approaches to Large-Scale Feature Selection Using Genetic Algorithms
Tải bản đầy đủ - 0trang
7.4 Efficient Approaches to Large-Scale Feature Selection
155
working of the algorithm. Experiments and results are discussed in Sect. 7.4.4. The
work is summarized in Sect. 7.4.5.
7.4.1 An Overview of Genetic Algorithms
Genetic algorithms are search and optimization methods based on the mechanisms
of natural genetics and evolution. Since these algorithms are motivated by the competition and survival of the fittest in Nature, we find analogy with them. The GAs
have advantages over conventional optimization methods in finding global optimum
solution or near-global optimal solution while avoiding local optima. Over the years,
the applications rapidly spread to almost all engineering disciplines. Since their introduction, a number of developments and variants have been introduced and developed into mature topics such as multiobjective genetic algorithms, interactive
genetic algorithms, etc. In the current section, we briefly discuss the basic concepts
with a focus on implementation of a simple genetic algorithm (SGA) and few applications. A brief discussion on SGA can be found in Chap. 3. The discussion
provided in the present section forms the background to subsequent material.
SGA is characterized by the following.
•
•
•
•
•
Population of chromosomes or binary strings of finite length.
Fitness function and problem encoding mechanism.
Selection of individual strings.
Genetic operators, viz., cross-over and mutation.
Termination and other control mechanisms.
It should be noted here that each of the topics is studied in depth through research
works. Since the current section is intended to provide completeness on the discussion with a focus on implementation aspect, interested readers are directed to the
references listed out at the end of the section. We also intentionally avoid discussion
on other evolutionary algorithms.
Objective Function. SGA is intended to find optimal set of parameters that optimize
a function. For example, find a set of parameters, x1 , x2 , . . . , xn , that maximizes
a function f (x1 , x2 , . . . , xn ).
Chromosomes. A bit-string or chromosome consists of a set of finite number of
bits, l, called the length of the chromosome. Bit-string encoding is a classical
method adapted by the researchers. The chromosomes are used to encode parameters that represent a solution to the optimization problem. Alternate encoding mechanisms include binary encoding, gray code, floating point, etc. SGA
makes use of a population of chromosomes with a finite population size, C.
Each bit of the bit-string is called allele in genetic terms. Both the terms are
used interchangeably in the literature.
Encoding Mechanism and Fitness Function. We find an optimal value of f (x1 , x2 ,
. . . , xn ) through the set of parameters x1 , x2 , . . . , xn . The value of f (·) is called
156
7 Optimal Dimensionality Reduction
the fitness function. Given the values of x1 , x2 , . . . , xn , the fitness can be computed. We encode the chromosome to represent the set of the parameters. This
forms the key step of a GA. Encoding depends on the nature of the optimization
problem. The following are two examples of encoding mechanisms. It should be
noted that the mechanisms are problem dependent, and one can find novel ways
of encoding a given problem.
Example 1. Suppose that we need to select a subset of features out of a group of
features that represent a pattern. The chromosome length is considered equal to
the total number of features in the pattern, and each bit of the chromosome represents whether the corresponding feature is considered. The fitness function in
this case can be the classification accuracy based on the selected set of features.
Example 2. Suppose that we need to find values of two parameters that minimize
(maximize) a given function and the parameters assume real values. The chromosome is divided into two parts representing the two parameters. The binary
equivalent of the expected range of real values of the parameters are considered
as corresponding lengths, viz., l1 and l2 . The length of the chromosome is given
by l1 + l2 .
Selection Mechanism. Selection refers to identifying individual chromosomes from
previous generation to the next generation of evolution while giving emphasis to highly fit individuals in the current generation. There are many selection
schemes that are used in practice. For example, the Roulette wheel selection
scheme consists of a sector in roulette wheel such that the angle subtended by
the sector is proportional to its fitness. This ensures that more copies of highly
fit individuals move on to the next generation. Many alternate approaches for
selection mechanisms are used in practice.
Crossover. Pairs of individuals, s1 and s2 , are chosen at random from population
and are subjected to crossover. Crossover takes place when the prechosen probability of crossover, Pc , exceeds a generated random number in the range [0,1].
In the “single point crossover” scheme, the position, say, k, within chromosome
is chosen at random from the numbers 1, 2, . . . , (l − 1) with equal probability.
Crossover takes place at k, resulting in two new offsprings containing alleles
from 0 to k of s1 and from (k + 1) to l of s2 for offspring 1 and from 0 to k of s2
and from (k + 1) to l of s1 for offspring 2. The operation is depicted in Fig. 7.1.
The other crossover schemes include two-point crossover, uniform crossover,
etc.
Mutation. Mutation of a bit consists changing it from 0 to 1 or vice versa based on
probability of mutation, Pm . This provides better exploration of solution space
by restoring genetic material that could possibly be lost through generations. The
activity consists of generating a random number in the range [0,1]. If the random
number is greater than Pm , mutation is resorted . The bit position of mutation is
determined randomly by choosing a random number in [0, l]. A higher value for
Pm causes more frequent disruption. The operation is depicted in Fig. 7.2.
Termination. Many criteria exist for termination of the algorithm. Some approaches
are (a) when there is no significant improvement in the fitness value, (b) a limit
on number of iterations, etc.
7.4 Efficient Approaches to Large-Scale Feature Selection
157
Fig. 7.1 Crossover operation
Fig. 7.2 Mutation operation
Control Parameters. The choice of population size C and the values of Pc and Pm
affect the solution and speed of convergence. Although large population size
assures the convergence, it increases computation time. The choice of these parameters is problem dependent. We demonstrate the effect of their variability
in Sect. 7.4.4. Adaptive schemes for choosing the values of Pc and Pm show
improvement on final fitness value.
SGA. With the above background, we briefly discuss working of a Simple Genetic
Algorithm as given below. After encoding the parameters of an optimization
problem, consider n chromosomes, each of length l. Initialize the population
with a probability of initialization, PI . With PI = 0, all the alleles are considered
for each chromosome, and with PI = 1, none are considered. Thus, as the value
of PI varies from 0 to 1, more alleles with value 0 are expected, thereby resulting
in lesser number of features getting selected for the chromosome. In Sect. 7.4.4,
we demonstrate the effect of variation of PI and provide a discussion. As the
next step, we evaluate the function to obtain fitness values of each chromosome
of the function.
Till the convergence based on the set criteria is obtained, for each iteration, select the population for the generation and perform crossover (Pc ) and mutation (Pm )
operations to obtain new offsprings. Compute the fitness function for the new population.
158
7 Optimal Dimensionality Reduction
Simple Genetic Algorithm
{
Step 1: Initialize population containing ‘C’
strings of length ‘l’, each with
probability of initialization, Pi;
Step 2: Compute fitness of each chromosome;
while termination criterion not met
{
Step 3: Select population for the next
generation;
Step 4: Perform crossover based on Pc and
mutation Pm;
Step 5: Compute fitness of each updated
chromosome;
}
}
7.4.1.1 Steady-State Genetic Algorithm (SSGA)
In the general framework of Genetic Algorithms, we choose entire feature set of a
pattern as a chromosome. Since the features are in binary form, they indicate the
presence or absence of the corresponding feature in a pattern. The genetic operators
of Selection, Cross-over, and Mutation with corresponding probability of selection
(PI ), probability of cross-over (Pc ) and probability of mutation (Pm ) are used. Like
in the case of SGA, the given dataset is divided into training, validation, and test
data. Classification accuracy on validation data using NNC forms the fitness function. Table 7.1 contains the terminology used in the paper.
In case of SSGA, we retain a chosen percentage of highly fit individuals from
generation to generation, thereby preventing loss of such individuals during the generations at a given point of time. It is termed as generation gap. Thus, SSGA permits
larger Pm values as compared to SGA.
7.4.2 Proposed Schemes
We propose the algorithms shown in Algorithm 7.1 and Algorithm 7.2 for the study.
Algorithm 7.1 integrates run-length compression of data, classification of compressed data, SSGA, and knowledge acquired through preliminary analysis with
7.4 Efficient Approaches to Large-Scale Feature Selection
159
a generation gap of 40 %. Algorithm 7.2 integrates the concept of frequent features
in addition to GA-based optimal feature selection.
Algorithm 7.1 (Algorithm for Feature Selection using Compressed Data Classification and Genetic Algorithms)
Step 1: Consider a population of ‘C’ chromosomes, with each chromosome consisting of ‘l’ features. Initiate each chromosome by setting a feature to ‘1’
as selected with a given probability, PI .
Step 2: For each chromosome in the population,
(a) Consider those selected features in the chromosome
(b) With the selected features in training and validation data sets, compress
the data
(c) Compute classification accuracy of validation data directly using the
compressed form. The classification accuracy forms the fitness function
(d) Record the number of alleles, classification accuracy for each chromosome, and generation-wise average fitness value.
Step 3: In computing next generation of chromosomes, carry out the following
steps
(a) sort the chromosomes in the descending order of their fitness
(b) preserve 40 % of highly fit individuals for the next generation
(c) the remaining 60 % of the next population are obtained by subjecting randomly selected individuals from current population to cross-over
and mutation with respective probabilities Pc and Pm .
Step 4: Repeat Steps 2 and 3 till there is no significant change in the average fitness
between successive generations.
In the framework of optimal feature selection using genetic algorithms, each
chromosome is considered to represent entire candidate feature set. The population containing C chromosomes is initialized in Step 1. Since the features are binary valued, the initialization is carried out by setting a feature to “1” with a given
probability of initialization, PI . Based on the binary value 1 or 0 of an allele, the
corresponding feature is considered either selected or not, respectively.
In Step 2, for each initialized chromosome, original training and validation data is
updated to contain only those selected features. The data is compressed using runlength compression algorithm. The validation data is classified in its compressed
form, and the average classification accuracy is recorded.
In Step 3, the subsequent population is generated. The best 40 % of the current population are preserved. The remaining 60 % are generated by subjecting entire current population to genetic operators of selection, single-point cross-over, and
mutation with preselected probabilities. The terminating criterion is verified for percentage change of fitness between two successive generations.
An elaborate experimentation is carried out by changing population initialization procedure such as (a) preselected population, (b) preselecting some features as
unused, and (c) initialization using probability of initialization, varying values of
160
7 Optimal Dimensionality Reduction
probabilities of selection, cross-over, mutation, etc. The nature of exercises and the
results are discussed in the following section.
Genetic Algorithms (GAs) are well studied for feature selection and feature extraction. We restrict our study for feature selection. Given a feature set, C, the problem of dimensionality reduction can be defined as arriving at a subset of original
feature set of dimension d < C such that the best classification accuracy is obtained. Single dominant computation block is the evaluation of fitness function. If
this could be speeded up, overall speed can be achieved. In order to achieve this, we
propose to compress the training and validation data and compute the classification
accuracy directly on the compressed data without having to uncompress.
In the current section, before discussing the proposed procedure, we present compressed data classification and Steady-State Genetic Algorithm for feature selection
in the following subsections.
7.4.2.1 Compressed Data Classification
We make use of the algorithm discussed in Chap. 3 to compress input binary data
and operate directly on the compressed data without decompressing for classification using runs. This forms a total nonlossy compression–decompression scenario.
It is possible to perform this when classification is achieved with the help of the
Manhattan distance function. The distance function on the compressed data results
in the same classification accuracy as that obtained on the original data as shown
in Chap. 3. The compression algorithm is applied on large data, and it is noticed to
reduce processing requirements significantly.
7.4.2.2 Frequent Features
Albeit Genetic Algorithms provide optimal feature subset, it is interesting to explore
whether the input set of features can be reduced by simpler means. Frequent pattern
approach, as discussed in Sects. 2.4.1 and 4.5.2, provides frequently used features,
which could possibly help discrimination too. A binary-valued pattern can be considered as a transaction with each feature representing the presence and absence of
a feature. Support of an item can be defined as the percentage of transactions in
the given database that contain the item. We make use of the concept of support
in identifying the feature set that is frequent above a chosen threshold. This results
in reduction in the number of features that need to be explored for an optimal set.
In Sect. 7.4.3, as part of preliminary analysis on the considered data, we demonstrate this aspect. Figure 7.3 demonstrates the concept of support. The support and
percentage-support are used equivalently in the present chapter.
Algorithms 7.1 and 7.2 are studied in detail in the following sections.
Algorithm 7.2 (Optimal Feature Selection using Genetic Algorithms combined
with frequent features)
Step 1: Identify frequent features based on a chosen support threshold.
7.4 Efficient Approaches to Large-Scale Feature Selection
161
Fig. 7.3 The figure depicts
the concepts of transaction,
items, and support
Step 2: Consider only those frequent features for further exploration.
Step 3: All steps of Algorithm 7.1.
We briefly elaborate each of the steps along with results of preliminary analysis.
7.4.3 Preliminary Analysis
Preliminary analysis of the data brings out insights of the data and forms domain
knowledge. The analysis primarily consists of computation of measures of central
tendency and dispersion, feature occupancy of patterns, class-wise variability, and
inter-class similarities. The results of the analysis help in choosing appropriate parameters and forming the experimental setup.
We consider 10-class handwritten digit data consisting of 10,000 192-featured
patterns. Each digit is formed as a 16 × 12 matrix with binary-valued features. The
data is divided into three mutually exclusive sets for training, validation, and testing.
In order to find optimal feature selection, it is useful to understand the basic statistics on the number of nonzero features in the training data. Although care is taken
while forming the handwritten dataset in terms of centering and depicting all variations of collected digits through the pattern matrix, it is possible that some regions
within the 16 × 12 not fully utilized depending on the class-label. Figure 7.4 contains these details. The topmost figure depicts class-wise details of average number
of nonzero features. It can be seen that the digits 0, 2, and 8 contain about 68 nonzero
features, each with digit 1 requiring the least number of nonfeatures of about 30 for
162
7 Optimal Dimensionality Reduction
Fig. 7.4 Statistics of features in the training dataset
Fig. 7.5 The figure contains
nine 3-featured patterns
occupying different feature
locations in 3 × 3 pattern
representation. It can be
observed that all locations are
occupied cumulatively at the
end of 9 sample patterns
representation. The middle figure indicates the standard deviation of the number of
nonzero features, indicating comparatively a larger dispersion for the digits 0, 2, 3, 5,
and 8. The third plot in the figure provides an interesting aspect of occupancy of
features within digit. Considering the digit 0, although, on the average, 68 nonzero
features suffice to represent the digit, the nonzero features occupied about 175 of
192 features by one training pattern or the other. Similar observations can be seen
for other digits too.
When the objective is to find an optimal subset of features, this provides a
glimpse of complexity involved. Figure 7.5 summarizes this argument that although
the average number of features per pattern is small, all the feature locations can be
occupied at least once. We consider a pattern such as handwritten digit “1” in a 3 × 3
pattern-representation. The average number of features needed to represent the digit
is 3. It can be noted here that all the feature locations are occupied after passing
through 9 patterns.
7.4 Efficient Approaches to Large-Scale Feature Selection
163
Fig. 7.6 The figure contains patterns with frequent features excluded with minimum support
thresholds of 13, 21, 52, and 70. The excluded feature regions are depicted as gray and black
portion corresponding to retain a feature set for exploration
7.4.3.1 Redundancy of Features Vis-a-Vis Support
We make use of the concept of support, as discussed in Fig. 7.3 and Sect. 7.4.2.2 to
identify the features that occur above a prechosen support threshold. We compute
empirical the probability for each feature. We vary support to find the set of frequent
features. We will later examine experimentally whether such excluded features have
impact on feature selection. Figure 7.6 contains an image of a 192-featured pattern
with excluded features corresponding to various support thresholds. The figure indicates features of low minimum support. It should be noted that they occurred in
this case of low minimum support on the edges of the pattern. As the support is
increased, the pattern representability will be affected.
7.4.3.2 Data Compression and Statistics
The considered patterns consist of binary-valued features. The data is compressed
using the run-length compression scheme as discussed in Chap. 3. The scheme consist of the following steps.
• Consider each pattern.
• Form runs of continuous occurrence of each feature. For ease of dissimilarity
computation, consider each pattern as starting with a feature value of 1, so that
the first run corresponds to number of 1s. In case the first feature of the pattern
is 0, the corresponding length would be 0.
The compression results in unequal number of runs for various patterns as shown
Fig. 7.7. The dissimilarity computation in the compressed domain is based on the
work in Chap. 3.
7.4.4 Experimental Results
Experimentation is planned to explore each of the parameters of Table 7.1 in order
to arrive at a minimal set of features that provides the best classification accuracy.
We initially study the choice of probabilities of initialization, cross-over, and mutation based on few generations of execution of genetic algorithms. After choosing
164
7 Optimal Dimensionality Reduction
Fig. 7.7 Statistics of runs in compressed patterns. For each class label, the vertical bar indicates
the range of number of runs in the patterns. For example, for class label “0”, the compressed
image length ranges from 34 to 67. The discontinuities indicate that there are no patterns that
have compressed lengths of 36 to 39. The figure provides range of compressed pattern lengths
corresponding to the original pattern length of 192 for all the patterns
Table 7.1 Terminology
Term
Description
C
Population size
t
No. of generations
l
Length of chromosome
PI
Probability of initialization
Pc
Probability of cross-over
Pm
Probability of mutation
ε
Support threshold
appropriate values of these three values, we proceed with feature selection. We also
bring out comparison of computation time with and without compression and bring
out comparisons. All the exercises are carried out with run-length-encoded nonlossy
compression, and classification is performed in the compressed domain directly.
7.4.4.1 Choice of Probabilities
In order to choose appropriate values for probabilities of cross-over, mutation, and
initialization, exercises are carried out using the proposed algorithm for 10–15 gen-
7.4 Efficient Approaches to Large-Scale Feature Selection
165
Fig. 7.8 Result of genetic algorithms after 10 generations on sensitivity of the probabilities of
initialization, cross-over, and mutation. The two plots in each case indicate the number of features
for best chromosome across 10 generations and the corresponding fitness value
erations. For these exercises, we consider the complete set containing 192 features.
Figure 7.8 contains the results of these exercises. The objective of the study is to
obtain a subset of features that provides a reasonable classification accuracy.
Choice of Probability of Initialization (PI ). A feature is included when the corresponding probability is more than the probability of initialization(PI ). As PI
increases, the number of selected features reduces. When PI = 0, all features
are considered for the study. The classification accuracy of corresponding best
fit chromosome reduces as PI increases since the representability reduces in