5 Heuristics, Metaheuristics, and Hyper-Heuristics
Tải bản đầy đủ - 0trang
10
1 Introduction
problem. By searching over a large set of feasible solutions, metaheuristics can often
find good solutions with less computational effort than calculus-based methods, or
simple heuristics, can.
Metaheuristics can be single-solution-based or population-based. Single-solution
based metaheuristics are based on a single solution at any time and comprise
local search-based metaheuristics such as SA, Tabu search, iterated local search
[40,42], guided local search [61], pattern search or random search [31], Solis–Wets
algorithm [54], and variable neighborhood search [45]. In population-based metaheuristics, a number of solutions are updated iteratively until the termination condition is satisfied. Population-based metaheuristics are generally categoried into EAs
and swarm-based algorithms. Single-solution-based metaheuristics are regarded to
be more exploitation-oriented, whereas population-based metaheuristics are more
exploration-oriented.
The idea of hyper-heuristics can be traced back to the early 1960s [23]. Hyperheuristics can be thought of as heuristics to choose heuristics or as search algorithms
that explore the space of problem solvers. A hyper-heuristic is a heuristic search
method that seeks to automate the process of selecting, combining, generating, or
adapting several simpler heuristics to efficiently solve hard search problems. The lowlevel heuristics are simple local search operators or domain-dependent heuristics,
which operate directly on the solution space for a given problem instance. Unlike
metaheuristics that search in a space of problem solutions, hyper-heuristics always
search in a space of low-level heuristics.
Heuristic selection and heuristic generation are currently the two main methodologies in hyper-heuristics. In the first method, the hyper-heuristic chooses heuristics
from a set of known domain-dependent low-level heuristics. In the second method,
the hyper-heuristic evolves new low-level heuristics by utilizing the components
of the existing ones. Hyper-heuristics can be based on genetic programming [11]
or grammatical evolution [10], which becomes an excellent candidate for heuristic
generation.
Several Single-Solution-Based Metaheuristics
Search strategies that randomly generate initial solutions and perform a local search
are also called multi-start descent search methods. However, to randomly create an
initial solution and perform a local search often results in low solution quality as the
complete search space is uniformly searched and search cannot focus on promising
areas of the search space.
Variable neighborhood search [45] combines local search strategies with dynamic
neighborhood structures subject to the search progress. The local search is an intensification step focusing the search in the direction of high-quality solutions. Diversification is a result of changing neighborhoods. By changing neighborhoods, the
method can easily escape from local optima. With an increasing cardinality of the
neighborhoods, diversification gets stronger as the shaking steps can choose from a
larger set of solutions and local search covers a larger area of the search space.
Guided local search [61] uses a similar principle and dynamically changes the
fitness landscape subject to the progress that is made during the search so that local
1.5 Heuristics, Metaheuristics, and Hyper-Heuristics
11
search can escape from local optima. The neighborhood structure remains constant.
It starts from a random solution x0 and performs a local search returning the local
optimum x1 . To escape the local optimum, a penalty is added to the fitness function
f such that the resulting fitness function h allows local search to escape. A new local
search is started from x1 using the modified fitness function h. Search continues until
a termination criterion is met.
Iterated local search [40,42] connects the unrelated local search phases as it creates
initial solutions not randomly but based on solutions found in previous local search
runs. If the perturbation steps are too small, the search cannot escape from a local
optimum. If perturbation is too strong, the search has the same behavior as multi-start
descent search methods. The modification step as well as the acceptance criterion
can depend on the search history.
1.6 Optimization
Optimization can generally be categorized into discrete or continuous optimization,
depending on whether the variables are discrete or continuous ones. There may be
limits or constraints on the variables. Optimization can be a static or a dynamic
problem depending upon whether the output is a function of time. Traditionally,
optimization is solved by calculus-based method, or based on random search, or
enumerative search. Heuristics-based optimization is the topic treated in this book.
Optimization techniques can generally be divided into derivative methods and
nonderivative methods, depending on whether or not derivatives of the objective
function are required for the calculation of the optimum. Derivative methods are
calculus-based methods, which can be either gradient search methods or secondorder methods. These methods are local optimizers. The gradient descent is also
known as steepest descent. It searches for a local minimum by taking steps along
the negative direction of the gradient of the function. Examples of second-order
methods are Newton’s method, the Gauss-Newton method, quasi-Newton methods,
the trust-region method, and the Levenberg-Marquardt method. Conjugate gradient
and natural gradient methods can also be viewed as reduced forms of the quasiNewton method.
Derivative methods can also be classified into model-based and metric-based
methods. Model-based methods improve the current point by a local approximating
model. Newton and quasi-Newton methods are model-based methods. Metric-based
methods perform a transformation of the variables and then apply a gradient search
method to improve the point. The steepest-descent, quasi-Newton, and conjugate
gradient methods belong to this latter category.
Methods that do not require gradient information to perform a search and sequentially explore the solution space are called direct search methods. They maintain
a group of points. They utilize some sort of deterministic exploration methods to
search the space and almost always utilize a greedy method to update the maintained
12
1 Introduction
Figure 1.3 The landscape of Rosenbrock function f (x) with two variables x1 , x2 ∈
[−204.8, 204.8]. The spacing of the grid is set as 1. There are many local minima, and the global
minimum 0 is at (1, 1).
points. Simplex search and pattern search are two examples of effective direct search
methods.
Typical nonderivative methods for multivariable functions are random-restart
hill-climbing, random search, many heuristic and metaheuristic methods, and their
hybrids. Hill-climbing attempts to optimize a discrete or continuous function for
a local optimum. When operating on continuous space, it is called gradient ascent.
Other nonderivative search methods include univariant search parallel to an axis (i.e.,
coordinate search method), sequential simplex method, and acceleration methods in
direct search such as the Hooke-Jeeves method, Powell’s method and Rosenbrock’s
method. Interior-point methods represent state-of-the-art techniques for solving linear, quadratic, and nonlinear optimization programs.
Example 1.1: The Rosenbrock function
n−1
f (x) =
100 xi+1 − xi2
2
+ (1 − xi )2 .
i=1
has the global minimum f (x) = 0 at xi = 1, i = 1, . . . , n. Our simulation is limited
to the two-dimensional case (n = 2), with x1 , x2 ∈ [−204.8, 204.8]. The landscape
of this function is shown in Figure 1.3.
1.6.1 Lagrange Multiplier Method
The Lagrange multiplier method can be used to analytically solve continuous function optimization problem subject to equality constraints [24]. By introducing the
1.6 Optimization
13
Lagrangian formulation, the dual problem associated with the primal problem is
obtained, based on which the optimal values of the Lagrange multipliers can be
found.
Let f (x) be the objective function and hi (x) = 0, i = 1, . . . , m, be the constraints.
The Lagrange function can be constructed as
m
L (x; λ1 , . . . , λm ) = f (x) +
λi hi (x),
(1.1)
i=1
where λi , i = 1, . . . , m, are called the Lagrange multipliers.
The constrained optimization problem is converted into an unconstrained optimization problem: Optimize L (x; λ1 , . . . , λm ). By setting
∂
L (x; λ1 , . . . , λm ) = 0,
∂x
(1.2)
∂
L (x; λ1 , . . . , λm ) = 0, i = 1, . . . , m,
∂λi
(1.3)
and solving the resulting set of equations, we can obtain the x position at the extremum
of f (x) under the constraints.
To deal with constraints, the Karush-Kuhn-Tucker (KKT) theorem, as a generalization to the Lagrange multiplier method, introduces a slack variable into each
inequality constraint before applying the Lagrange multiplier method. The conditions
derived from the procedure are known as the KKT conditions [24].
1.6.2 Direction-Based Search and Simplex Search
In direct search, generally the gradient information cannot be obtained; thus, it is
impractical to implement a step in the negative gradient direction for a minimum
problem. However, when the objectives of a group of solutions are available, the
best one can guide the search direction of the other solutions. Many direction-based
search methods and EAs are inspired by this intuitive idea.
Some of the direct search methods use improvement direction information to
search the objective space. Thus, it is useful to embed these directions into an EA as
either a local search method or an exploration operator.
Simplex search [47], introduced by Nelder and Mead in 1965, a well-known deterministic direction-based search method. MATLAB contains a direct search toolbox
based on simplex search. Scatter search [26] includes the elitism mechanism into
simplex search. Like simplex search, for a group of points, the algorithm finds new
points, accepts the better ones, and discards the worse ones. Differential evolution
(DE) [56] uses the directional information from the current population. The mutation
operator of DE needs three randomly selected different individuals from the current
population for each individual to form a simplex-like triangle.
14
1 Introduction
Simplex Search
Simplex search is a group-based deterministic local search method capable of exploring the objective space very fast. Thus many EAs use simplex search as a local search
method after mutation.
A simplex is a collection of n + 1 points in n-dimensional space. In an optimization
problem involving n variables, simplex method searches for an optimization solution
by evaluating a set of n + 1 points. The method continuously forms new simplices
by replacing the point having the worst performance in a simplex with a new point.
The new point is generated by reflection, expansion, and contraction operations.
In a multidimensional space, the subtraction of two vectors means a new vector
starting at one vector and ending at the other, like x2 − x1 . We often refer to the
subtraction of two vectors as a direction. Addition of two vectors can be implemented
in a triangular way, moving the start of one vector to the end of the other to form
another vector. The expression x3 + (x2 − x1 ) can be regarded as the destination of
a moving point that starts at x3 and has a length and direction of x2 − x1 .
For every new simplex, several points are assigned according to their objective
values. Then simplex search repeats reflection, expansion, contraction, and shrink in
a very efficient and deterministic way. Vertices of the simplex will move toward the
optimal point and the simplex will become smaller and smaller. Stop criteria can be
selected as a predetermined number of maximal iterations, the length of the edge or
the improving rate of B.
Simplex search for minimization is shown in Algorithm 1.1. The coefficients for
the reflection, expansion, contraction, and shrinking operations are typically selected
as α = 1, β = 2, γ = −1/2, and δ = 1/2. The initial simplex is important. The
search may easily get stuck for too small an initial simplex. This simplex should be
selected depending on the nature of the problem.
1.6.3 Discrete Optimization Problems
The discrete optimization problem is also known as combinatorial optimization problem (COP). Any problem that has a large set of discrete solutions and a cost function
for rating those solutions relative to one another is a COP. COPs are known to be
NP-complete.1 The goal for COPs is to find an optimal solution or sometimes a
nearly optimal solution. In COPs, the number of solutions grows exponentially with
the size of the problem n at O(n!) or O (en ) such that no algorithm can find the global
minimum solution in a polynomial computational time.
Definition 1.1 (Discrete optimization problem). A discrete optimization problem
is denoted as (X , f , ), or as minimizing the objective function
min f (x), x ∈ X , subject to
1 Namely,
nondeterministic polynomial-time complete.
,
(1.4)
1.6 Optimization
15
Algorithm 1.1 (Simplex Search).
1. Initialize parameters.
Randomize the set of individuals xi .
2. Repeat:
a. Find the worst and best individuals as xh and xl .
Calculate the centroid of all xi ’s, i = h, as x.
b. Enter reflection mode:
xr = x + α(x − xh );
c. if f (xl ) < f (xr ) < f (xh ), xh ← xr ;
else if f (xr ) < f (xl ), enter expansion mode:
xe = x + β(x − xh );
if f (xe ) < f (xl ), xh ← xe ;
else xh ← xr ;
end
else if f (xr ) > f (xi ), ∀i = h, enter contraction mode:
xc = x + γ(x − xh );
if f (xc ) < f (xh ), xh ← xc ;
else enter shrinking mode:
xi = xl + δ(xi − xl ), ∀i = l;
end
end
until termination condition is satisfied.
where X ⊂ RN is the search space defined over a finite set of N discrete decision
variables x = (x1 , x2 , . . . , xN )T , f : X → R, is the set of constraints on x. Space
X is constructed according to all the constraints imposed on the problem.
Definition 1.2 (Feasible solution). A vector x that satisfies the set of constraints for
an optimization problem is called a feasible solution.
Traveling salesman problem (TSP) is perhaps the most famous COP. Given a set
of points, either nodes on a graph or cities on a map, find the shortest possible tour
that visits every point exactly once and then returns to its starting point. There are
(n − 1)!/2 possible tours for an n-city TSP. TSP arises in numerous applications,
from routing of wires on a printed circuit board (PCB), VLSI circuit design, to fast
food delivery.
Multiple traveling salesmen problem (MTSP) generalizes TSP using more than
one salesman. Given a set of cities and a depot, m salesmen must visit all cities
according to the constraints that the route formed by each salesman must start and
end at the depot, that each intermediate city must be visited once and by a single
salesman, and that the cost of the routes must be minimum. TSP with a time window
is a variant of TSP in which each city is visited within a given time window.
The vehicle routing problem concerns the transport of items between depots and
customers by means of a fleet of vehicles. It can be used for logistics and public
16
1 Introduction
services, such as milk delivery, mail or parcel pick-up and delivery, school bus
routing, solid waste collection, dial-a-ride systems, and job scheduling. Two wellknown routing problems are TSP and MTSP.
The location-allocation problem is defined as follows. Given a set of facilities,
each of which serves a certain number of nodes on a graph, the objective is to place
the facilities on the graph so that the average distance between each node and its
serving facility is minimized.
1.6.4 P, NP, NP-Hard, and NP-Complete
An issue related to the efficiency and efficacy of an algorithm is how hard the problem
itself is. The optimization problem is first transformed into a decision problem.
Problems that can be solved using a polynomial-time algorithm are tractable. A
polynomial-time algorithm has an upper bound O(nk ) on its running time, where k is
a constant and n is the problem size (input size). Usually, tractable problems are easy
to solve as running time increases relatively slowly with n. In contrast, problems are
intractable if they cannot be solved by a polynomial-time algorithm and there is a
lower bound on the running time which is (k n ), where k > 1 is a constant and n is
the input size.
The complexity class P (standing for polynomial time complexity) is defined as
the set of decision problems that can be solved by a deterministic Turing machine
using an algorithm with worst-case polynomial time complexity. P problems are
usually easy as there are algorithms that solve them in polynomial time.
The class NP (standing for nondeterministic polynomial time complexity) is the
set of all decision problems that can be verified by a nondeterministic Turing machine
using a nondeterministic algorithm in worst-case polynomial time. Although nondeterministic algorithms cannot be executed directly on conventional computers, this
concept is important and helpful for the analysis of the computational complexity
of problems. All problems in P also belong to the class NP, i.e., P ⊆ NP. There are
also problems where correct solutions cannot be verified in polynomial time.
All decision problems in P are tractable. Those problems that are in NP, but not in
P, are difficult as no polynomial-time algorithms exist for them. There are problems
in NP where no polynomial algorithm is available and which can be transformed into
one another with polynomial effort. A problem is said to be NP-hard, if an algorithm
for solving this problem is polynomial-time reducible to an algorithm that is able to
solve any problem in NP. Therefore, NP-hard problems are at least as hard as any
other problem in NP, and are not necessarily in NP.
The set of NP-complete problems is a subset of NP [14]. A decision problem A is
said to be NP-complete, if A is in NP and A is also NP-hard. NP-complete problems
are the hardest problems in NP. They all have the same complexity. They are difficult
as no polynomial-time algorithms are known. Decision problems that are not in NP
are even more difficult. The relationship between all these classes is illustrated in
Figure 1.4.
1.6 Optimization
17
Figure 1.4 The relationship
between P, NP, NP-complete,
and NP-hard classes.
NP
P
NP hard
NP complete
Practical COPs are all NP-complete or NP-hard. Right now, no algorithm with
polynomial time complexity can guarantee that an optimal solution will be found.
1.6.5 Multiobjective Optimization Problem
A multiobjective optimization problem (MOP) requires finding a variable vector x
in the domain X that optimizes the objective vector f (x).
Definition 1.3 (Multiobjective optimization problem). MOP is to optimize a system with k conflicting objectives
min f (x) = (f1 (x), f2 (x), . . . , fk (x))T , x ∈ X
(1.5)
gi (x) ≤ 0, i = 1, 2, . . . , m,
(1.6)
hi (x) = 0, i = 1, 2, . . . , p,
(1.7)
subject to
where x = (x1 , x2 , . . . , xn )T ∈ Rn , the objective functions fi : Rn → R, i = 1, . . . , k,
and gi , hj : Rn → R, i = 1, . . . , m, j = 1, . . . , p are the constraint functions of the
problem.
Conflicting objectives will be the case where increasing the quality of one objective
tends to simultaneously decrease the quality of another objective. The solution to
an MOP is not a single optimal solution, but a set of solutions representing the best
trade-offs among the objectives.
In order to optimize a system with conflicting objectives, the weighted sum of
these objectives is usually used as the compromise of the system
k
wi f i (x),
F(x) =
(1.8)
i=1
fi (x)
are normalized objectives, and ki=1 wi = 1.
where f i (x) = |max(f
i (x))|
For many problems, there are difficulties in normalizing the individual objectives,
and also in selecting the weights. The lexicographic order optimization is based on
the ranking of the objectives in terms of their importance.
18
1 Introduction
The Pareto method is a popular method for multiobjective optimization. It is based
on the principle of nondominance. The Pareto optimum gives a set of solutions for
which there is no way of improving one criterion without deteriorating another
criterion. In MOPs, the concept of dominance provides a means by which multiple
solutions can be compared and subsequently ranked.
Definition 1.4 (Pareto dominance). A variable vector x1 ∈ Rn is said to dominate
another vector x2 ∈ Rn , denoted x1 x2 , if and only if x1 is better than or equal to
x2 in all attributes, and strictly better in at least one attribute, i.e., ∀i: fi (x1 ) ≥ fi (x2 )
∧∃j: fj (x1 ) > fj (x2 ).
For two solutions x1 , x2 , if x1 is better in all objectives than x2 , x1 is said to
strongly dominate x2 . If x1 is not worse than x2 in all objectives and better in at least
one objective, x1 is said to dominate x2 . A nondominated set is a set of solutions that
are not weakly dominated by any other solution in the set.
Definition 1.5 (Nondominance). A variable vector x1 ∈ X ⊂ Rn is nondominated
with respect to X , if there does not exist another vector x2 ∈ X such that x2 ≺ x1 .
Definition 1.6 (Pareto optimality). A variable vector x∗ ∈ F ⊂ Rn (F is the feasible region) is Pareto optimal if it is nondominated with respect to F .
Definition 1.7 (Pareto optimal frontier). The Pareto optimal frontier P ∗ is defined
by the space in Rn formed by all Pareto optimal solutions P ∗ = {x ∈ F |x
is Pareto optimal}.
The Pareto optimal frontier is a set of optimal nondominated solutions, which
may be infinite.
Definition 1.8 (Pareto front). The Pareto front PF ∗ is defined by
PF ∗ = {f (x) ∈ Rk |x ∈ P ∗ }.
(1.9)
The Pareto front is the image set of the Pareto optimal frontier mapping into the
objective space.
Obtaining the Pareto front of a MOP is the main goal of multiobjective optimization. A good solution must contain a limited number of points, which should be as
close as possible to the exact Pareto front, as well as they should be uniformly spread
so that no regions are left unexplored.
An illustration of Pareto optimal solutions for a two-dimensional problem with
two objectives is given in Figure 1.5. The upper border from points A to B of the
domain X , denoted P ∗ , contains all Pareto optimal solutions. The frontier from points
f A to f B along the lower border of the domain Y , denoted PF ∗ , contains all Pareto
frontier in the objective space. For two points a and b, their mapping f a dominates f b ,
1.6 Optimization
19
x1
f1
A
f1*
P*
X
a
B
fA
fb
f (x)
b
Y
fa
fB
PF *
Parameter space
x2
Objective space f2*
f2
Figure 1.5 An illustration of Pareto optimal solutions for a two-dimensional problem with two
objectives. X ⊂ Rn is the domain of x, and Y ⊂ Rm is the domain of f (x).
(a)
(b)
(c)
f1
f1
f1
f1*
fA
f1*
fA
fB
PF *
PF *
f2*
f2
f1*
Y
Y
fA
fB
f2*
Y
fB
f2
PF *
f2*
f2
Figure 1.6 Different Pareto fronts. a Convex. b Concave. c Discontinuous.
denoted f a ≺ f b . Hence, the decision vector xa is a nondominated solution. Figure 1.6
illustrates that Pareto fronts can be convex, concave, or discontinuous.
Definition 1.9 (ε-dominance). A variable vector x1 ∈ Rn is said to ε-dominate
another vector x2 ∈ Rn , denoted x1 ε x2 , if and only if x1 is better than or
equal to εx2 in all attributes, and strictly better in at least one attribute, i.e., ∀i:
fi (x1 ) ≥ fi (εx2 ) ∧∃j: fj (x1 ) > fj (εx2 ) [69].
If ε = 1, ε-dominance is the same as Pareto dominance; otherwise, the area dominated by xi is enlarged or shrunk. Thus, ε-dominance relaxes the area of Pareto
dominance by a factor of ε.
1.6.6 Robust Optimization
The robustness of a particular solution can be confirmed by resampling or by reusing
neighborhood solutions. Resampling is reliable, but computationally expensive. In
20
1 Introduction
contrast, the method of reusing neighborhood solutions is cheap but unreliable. A
confidence measure increases the reliability of the latter method. In [44], confidencebased operators are defined for robust metaheuristics. The confidence metric and five
confidence-based operators are employed to design confidence-based robust PSO
and confidence-based robust GA. History can be utilized in helping to estimate the
expected fitness of an individual to produce more robust solutions in EAs.
Confidence metric defines the confidence level of a robust solution. The highest
confidence is achieved when there are a large number of solutions available with
greatest diversity within a suitable neighborhood around the solution in the parameter
space. Mathematically, confidence is expressed by [44]
n
,
(1.10)
C=
rσ
where n is the number of sampled points in the neighborhood, r is the radius of the
neighborhood, and σ is the distribution of the available points in the neighborhood.
1.7 Performance Indicators
For evaluation of different EA or iterative algorithms, one can implement overall
performance indicators and evolving performance indicators.
Overall Performance Indicators
The overall performance indicators provide a general description for the performance. Overall performance can be compared according to their efficacy, efficiency,
and reliability on a benchmark problem with many runs.
Efficacy evaluates the quality of the results without caring about the speed of an
algorithm. Mean best fitness (MBF) is defined as the average of the best fitness in the
last population over all runs. The best fitness values thus far can be used as a more
absolute measure for efficacy.
Reliability indicates the extent to which the algorithm can provide acceptable
results. Success rate (SR) is defined as the percentage of runs terminated with success.
A successful run is defined as the difference between the best fitness value in the last
generation f ∗ and a predefined value f o under a predefined threshold ε.
Efficiency requires finding the global optimal solution rapidly. Average number
of evaluations to a solution (AES) is defined as the average number of evaluations
it takes for the successful runs. If an algorithm has no successful runs, its AES is
undefined.
Low SR and high MBF may indicate that the algorithm converges slowly, while
high SR and low MBF may indicate that the algorithm is basically reliable, but may
provide very bad results accidentally. It is desirable to have smaller AES and larger
SR, thus small AES/SR criterion considers reliability and efficiency at the same time.