3 Choosing the Threshold: The Influence of the Scaling Factor
Tải bản đầy đủ - 0trang
Exploring Sparse Covariance Estimation …
273
The maximal entry of the population covariance was chosen to make the parameter data dependent and not too small. In future research, we will take a closer
look at the choice above. We investigated seven values for the factor ρ with
ρ ∈ {0.001, 0.01, 0.1, 1, 2, 5, 10}. The goal of this paper is to gain more insights
into the question whether larger or smaller scalings are preferable. The experiments
are conducted for N = 20. Of course, the search space dimensionality may additionally influence the parameter settings and will be investigated in future research.
Finally, the decision of using the maximal entry of the population covariance matrix is
reconsidered by investigating additionally a data-independent choice setting δ equal
to ρ
δ = ρ.
(21)
the first variant (20). The Tables 2, 3 and 4 show the results on selected functions of
the test suite. For each function, 30 runs were conducted. The subset includes two
separable functions, the sphere with id 1 and the ellipsoidal with id 2. followed by
the attractive sector with id 6 and the rotated Rosenbrock (id 9) as representatives
Table 2 The results for different settings of the parameter ρ in (15) on selected functions for the
CMSA-Thres-ES
Expected running time (ERT in number of function evaluations) divided by the respective best ERT
measured during BBOB-2009 in dimension 20. The ERT and in braces, as dispersion measure, the
half difference between 90 and 10 %-tile of bootstrapped run lengths appear for each algorithm and
target, the corresponding best ERT in the first row. The different target f best -values are shown in the
top row. #succ is the number of trials that reached the (final) target f opt + 10−8 . The median number
of conducted function evaluations is additionally given in italics, if the target in the last column
was never reached. Entries, succeeded by a star, are statistically significantly better (according to
the rank-sum test) when compared to all other algorithms of the table, with p = 0.05 or p = 10−k
when the number k following the star is larger than 1, with Bonferroni correction by the number of
instances
274
S. Meyer-Nieberg and E. Kropat
Table 3 The results for different settings of the parameter ρ in (15) on selected functions for the
CMSA-Diag-ES
Please refer to Table 2 for more information
Table 4 The results for different settings of the parameter ρ in (21) on selected functions for the
CMSA-Diag-ES
Please refer to Table 2 for more information
for functions with moderate conditioning. The group of ill-condition functions is
represented by another ellipsoidal with id 10 and the bent cigar (id 12). The parameter investigation does not consider the multimodal functions since our experiments
showed that these pose challenges for the algorithms especially for larger search
space dimensionalities. The first series of experiments considers CMSA-ESs which
apply (15). Table 2 provides the findings for the variant which also subjects the diagonal entries of the covariance matrix to thresholding. It reveals different experimental
Exploring Sparse Covariance Estimation …
275
findings for the functions. In the case of the sphere, the choice of the scaling factor
does not affect the outcome considerably. This changes when the other functions are
considered. In the case of the elliposoidal with id 2, the size of the parameter does
not influence whether the optimization is successfull or not. However, it influences
the number of fitness functions necessary to reach the intermediate target precisions.
In the case of the elliposoidal, choices between 0.01 and 1 lead to lower values.
Larger and smaller values lead to a gradual loss of performance. The case of the
attractive sector (id 6) shows that the choice of the scaling factor may strongly influences the optimization success. Before continuing,it should be taken into account,
however, that in all cases the number of successfull outcomes is less that 30 % of
all runs. Therefore, it cannot be excluded that initialization effects may have played
a role. For this reason, experiments with more repeats will be conducted in future
work. Here, we can state that too small choices do not allow the ES to achieve the
final target precision. Otherwise, successfull runs are recorded. However, no clear
dependency of the performance on the factor ρ emerges. Regarding the experiment
series present, the factor ρ = 1 should be preferred. The remaining functions of
the subset do not pose the same challenge for the ES as the attractive sector. The
majority of the runs leads to successfull outcomes. In the case of the Rosenbrock
function, setting ρ = 0.1 is beneficial for the number of successes as well as for the
number of fitness evaluations required of reaching the various target precisions. The
findings roughly transfer to the next function, the ill-conditioned elliposoidal. Here,
however, the outcomes do not vary as much as previously. The bent cigar represents
a outlier of the function set chosen. In contrast to the other functions, the ESs with
low scaling factors ρ perform better. Since the factor is coupled with the maximal
entry of the covariance matrix, two potential explanations can be given: First, in
some cases, thresholding should be conducted only for few values. Or, second, the
covariance matrix estimated contains extreme elements which require a reconsideration of choosing (15). To summarize the findings so far: Relatively robust choices of
the scaling factor are presented by medium-sized values which lie between 0.1 and 1
for the experiments conducted so far. This, however, may not be the optimal choice
for all functions of the test suite. Future research will therefore investigate adaptive
techniques. The findings transfer to the ES variant that does not apply thresholding
to the diagonal entries of the covariance matrix as Table 3 shows.
Table 4 reports the results for the data-independent scaling factor setting
δ = ρ, see (21). Here, experiments were only conducted for the CMSA-Diag-ES.
The results vary strongly over the subset of fitness functions. In the case of the separable functions, the performance can be influenced by the choice of scaling factor.
Again, this is more pronounced for the elliposoidal than for the sphere. Interestingly, in the case of the remaining functions, a correct choice of the factor is decisive
leading to successfull runs for all 30 trials or to a complete miss of the final target
precision. In general, larger values are detrimental and smaller values achieve better outcomes. The exception is the attractive sector (id 6) where a scaling factor of
one leads to successes in nearly all trials. This setting cannot be transferred to the
other functions. As we have seen, choosing the scaling factor represents an important
276
S. Meyer-Nieberg and E. Kropat
point in thresholding. Therefore, future research will focus on potential adaptation
approaches.
For the remainder of the paper, the data-dependend version (15) is used. The
first series of experiments indicated that the values of ρ = 0.1 and ρ = 1 lead to
comparably good results, therefore the parameter was set to 0.5 for the evaluation
experiments with the black box optimization benchmarking test suite. The results
are provided in the next subsection.
3.4 Results and Discussion
The findings are interesting—indicating advantages for thresholding in many but not
in all cases. The result of the comparison depends on the function class. In the case of
the separable functions with ids 1–5, the strategies behave on the whole very similar
in the case of both dimensionalities 10D and 20D. This can be seen in the empirical
cumulative distribution functions plots in Figs. 2 and 3 for example.
Concerning the particular functions, differences are revealed as Tables 5 and 6
show for the expected running time (ERT) which is provided for several precision
targets. The expected running time is provided relative to the best results achieved
during the black-box optimization workshop in 2009. The first line of the outcomes
for each function reports the ERT of the best algorithm of 2009. However, not only
the ERT values but also the number of successes is important. The ERT can only be
measured if the algorithm achieved the respective target in the run. If the number
of trials where is the full optimization objective has been reached is low then the
remaining targets should be discussed with care. If only a few runs contribute to the
result, the findings may be strongly influenced by initialization effects. To summarize,
only a few cases end with differences that are statistically significant. To achieve this,
the algorithm has to perform significantly better than both competing methods—the
other thresholding variant and the original CMSA-ES.
In the case of the sphere (function with id 1), slight advantages for the thresholding
variants are revealed. A similar observation can be made for the second function,
the separable ellipsoid. Here, both thresholded ESs are faster, with the one that only
shrinks the off-diagonal elements more strongly (Table 6). This is probably due to
the enforced more regular structure.
No strategy is able to reach the required target precision in the case of the separable
Rastrigin (id 3) and the separable Rastrigin-Bueche (id 4). Since all strategies only
achieve the lowest target precision of 101 , a comparison is not performed. The linear
slope is solved fast by all, with the original CMSA-ES the best strategy.
In the case of the function class containing test functions with low to moderate conditioning (ids 6–9), different findings can be made for the two search space
dimensionalities. This is also shown by the empirical cumulative distribution functions plots in Figs. 2 and 3, especially for N = 10. Also in the case of N = 10, Table 5
shows that the strategies with thresholding achieve a better performance in a majority
of cases. In addition, thresholding that is not applied to the diagonal appears to lead
Exploring Sparse Covariance Estimation …
separable fcts
277
moderate fcts
best 2009
best 2009
Diag
Thres
Thres
CMSA
CMSA
Diag
ill-conditioned fcts
multi-modal fcts
best 2009
best 2009
CMSA
Diag
Thres
CMSA
Diag
Thres
weakly structured multi-modal fcts
all functions
best 2009
best 2009
Thres
Thres
Diag
CMSA
CMSA
Diag
Fig. 2 Bootstrapped empirical cumulative distribution of the number of objective function evaluations divided by dimension (FEvals/DIM) for 50 targets in 10[−8.2] for all functions and subgroups
in 10-D. The “best 2009” line corresponds to the best ERT observed during BBOB 2009 for each
single target
to a well-performing strategy with the exception of f6, where the CMSA-Diag-ES
appears highly successfull.
The results for f6, the so-called attractive sector, in 10D are astonishing. While
the original CMSA-ES could only reach the required target precision in six of the
15 runs, the thresholding variants were able to succeed 13 times (CMSA-Thres-ES)
278
S. Meyer-Nieberg and E. Kropat
separable fcts
moderate fcts
best 2009
best 2009
CMSA
Diag
Diag
Thres
Thres
CMSA
ill-conditioned fcts
best 2009
Diag
Thres
CMSA
Fig. 3 Bootstrapped empirical cumulative distribution of the number of objective function evaluations divided by dimension (FEvals/DIM) for 50 targets in 10[−8.2] for all functions and subgroups
in 20-D. The “best 2009” line corresponds to the best ERT observed during BBOB 2009 for each
single target
and 15 times (CMSA-Diag-ES). The latter achieved lower expected running times in
addition. For N = 20, only a minority of runs were successfull for all strategies with
the CMSA-Diag-ES reaching the final target precision in nearly half of the trials.
Experiments with a larger number of fitness evaluations must be conducted in order
to investigate the findings more closely.
The same holds for the step ellipsoid (id 7) which cannot be solved with the target
precision required by any strategy. The exception is one run for the CMSA-Thres-ES
which is able to reach the target precision of 10−8 in one case for N = 10. Since this
may be due to the initialization, it is not taken into further consideration. Concerning
the lower precision targets, sometimes the CMSA-ES and sometimes the CMSADiag-ES appears superior. However, more research is required, since the number of
runs entering the data for some of the target precisions is low and initial positions
may be influential.
On the original Rosenbrock function (id 8), the CMSA-ES and the strategies with
thresholding show a similar behavior with the CMSA-ES performing better for the
first intermediate target precision whereas the CMSA-Diag-ES shows slightly lower
Exploring Sparse Covariance Estimation …
279
Table 5 Expected running time (ERT in number of function evaluations) divided by the respective
best ERT measured during BBOB-2009 in 10
The ERT and in braces, as dispersion measure, the half difference between 90 and 10 %-tile of
bootstrapped run lengths appear for each algorithm and target, the corresponding best ERT in the
first row. The different target f -values are shown in the top row. #succ is the number of trials that
reached the (final) target f opt + 10−8 . The median number of conducted function evaluations is
additionally given in italics, if the target in the last column was never reached. Entries, succeeded
by a star, are statistically significantly better (according to the rank-sum test) when compared to all
other algorithms of the table, with p = 0.05 or p = 10−k when the number k following the star is
larger than 1, with Bonferroni correction by the number of instances
280
S. Meyer-Nieberg and E. Kropat
Table 6 Expected running time (ERT in number of function evaluations) divided by the respective
best ERT measured during BBOB-2009 in 20
The ERT and in braces, as dispersion measure, the half difference between 90 and 10 %-tile of
bootstrapped run lengths appear for each algorithm and target, the corresponding best ERT in the
first row. The different target f -values are shown in the top row. #succ is the number of trials that
reached the (final) target f opt + 10−8 . The median number of conducted function evaluations is
additionally given in italics, if the target in the last column was never reached. Entries, succeeded
by a star, are statistically significantly better (according to the rank-sum test) when compared to all
other algorithms of the table, with p = 0.05 or p = 10−k when the number k following the star is
larger than 1, with Bonferroni correction by the number of instances
running times for the later. The version which subjects the complete covariance to
thresholding performs always slightly worse. The roles of the CMSA-Diag-ES and
the CMSA-Thres-ES change for the rotated Rosenbrock (id 9). Here, the best results
can be observed for the complete thresholding variant.
In the case of ill-conditioned functions (ids 10–14), the findings are mixed. In
general, thresholding without including the diagonal does not appear to improve
the performance. The strategy performs worst of all—an indicator that keeping the
diagonal unchanged may be sometimes inappropriate due to the space transformation.
However, since there are interactions with the choice of the thresholding parameters
which may have resulted in comparatively too large diagonal elements, we need to
address this issue further before coming to a conclusion. First of all for N = 10,
all strategies are successfull in all cases for the ellipsoid (id 10), the discus (id 11),
the bent cigar (id 12), and the sum of different powers (id 14). Only the CMSA-ES
reaches the optimization target in the case of the sharp ridge (id 13). This, however,
Exploring Sparse Covariance Estimation …
281
only twice. The reasons for this require further analysis. Either the findings may be
due to a violation of the sparseness assumption or considering that this is only a
weak assumption the choice of the thresholding parameters and the function should
be reconsidered.
All strategies exhibit problems in the case of the group of multi-modal functions, Rastrigin (id 15), Weierstrass (id 16), Schaffer F7 with condition number 10
(id 17), Schaffer F7 with condition 1000 (id 18), and Griewank-Rosenbrock F8F2
(id 19). Partly, this may be due to the maximal number of fitness evaluations permitted. Even the best performing methods of the 2009 BBOB workshop required more
evaluations than we allowed in total. Thus, experiments with larger values for the
maximal function evaluations should be conducted in future research. Concerning
the preliminary targets with lower precision, the CMSA-ES achieves the best results
in a majority of cases. However, the same argumentation as for the step ellipsoid
applies.
In the case of N = 20, the number of function evaluations that were necessary
in the case of the best algorithms of 2009 to reach even the lower precision target
of 10−1 exceeds the budget chosen here. Therefore, the function group is excluded
from the analysis for N = 20 and not shown in Fig. 3 and Table 6.
The remaining group consists of multi-modal functions with weak global structures. Here, especially the functions with numbers 20 (Schwefel x sin(x)),
23 (Kaatsuuras), and 24 (Lunacek bi-Rastrigin) represent challenges for the algorithms. In the case of N = 10, they can only reach the first targets of 101 and 100 .
Again, the maximal number of function evaluations should be increased to allow a
more detailed analysis on these functions. For the case of the remaining functions,
function 21, Gallagher 101 peaks, and function 22, Gallagher 21 peaks, the results
indicate a better performance for the CMSA-ES versions with thresholding compared with the original algorithm. Again due to similar reasons as for the first group
of multi-modal functions, the results are only shown for N = 10.
4 Conclusions and Outlook
This paper presents covariance matrix adaptation techniques for evolution strategies.
The original versions are based on the sample covariance—an estimator known to
be problematic. Especially in high-dimensional search spaces, where the population
size does not exceed the search space dimensionality, the agreement of the estimator and the true covariance may be low. Therefore, thresholding, a comparably
computationally simple estimation technique, has been integrated into the covariance
adaptation process. Thresholding stems from estimation theory for high-dimensional
spaces and assumes an approximately sparse structure of the covariance matrix. The
matrix entries are therefore thresholded, meaning a thresholding function is applied.
The paper considered adaptive entry-wise thresholding. Since the covariance matrix
cannot be assumed to be sparse in general, a basis transformation was carried out and
the thresholding process was performed in the transformed space. The performance
282
S. Meyer-Nieberg and E. Kropat
of the resulting new covariance matrix adapting evolution strategies was compared to
the original variant on the black-box optimization benchmarking test suite. Two main
variants were considered: A CMSA-ES which subjected the complete covariance to
thresholding and a variant which left the diagonal elements unchanged. While the latter is more common in statistics, it is not easy to justify its preforation in optimization.
The first findings were interesting with the new variants performing better for several
function classes. While this is promising, more experiments and analyses are required
and will be conducted in future research. The choice of the thresholding function
and the scaling parameter for the threshold represent important research questions.
In this paper, the scaling factor was analyzed by a small series of experiments. The
findings were then used in the benchmarking experiments. They represent, however,
only the first steps of the research. Techniques to make the parameter adaptive are
currently under investigation.
References
1. Bäck, T., Foussette, C., Krause, Peter: Contemporary Evolution Strategies. Natural Computing.
Springer, Berlin (2013)
2. Beyer, H.-G., Meyer-Nieberg, S.: Self-adaptation of evolution strategies under noisy fitness
evaluations. Genet. Program. Evolv. Mach. 7(4), 295–328 (2006)
3. Beyer, H.-G., Schwefel, H.-P.: Evolution strategies: a comprehensive introduction. Nat. Comput. 1(1), 3–52 (2002)
4. Beyer, H.-G., Sendhoff, B.: Lecture Notes in Computer Science. In: Rudolph, G., et al. (eds.)
PPSN. Covariance matrix adaptation revisited - the CMSA evolution strategy, vol. 5199, pp.
123–132. Springer, Berlin (2008)
5. Cai, Tony, Liu, Weidong: Adaptive thresholding for sparse covariance matrix estimation. J.
Am. Stat. Assoc. 106(494), 672–684 (2011)
6. Chen, X., Wang, Z.J., McKeown, M.J.: Shrinkage-to-tapering estimation of large covariance
matrices. IEEE Trans. Signal Process. 60(11), 5640–5656 (2012)
7. Dong, W., Yao, X.: Covariance matrix repairing in gaussian based EDAs. In: 2007 IEEE
Congress on Evolutionary Computation, 2007. CEC, pp. 415–422 (2007)
8. Eiben, A.E., Smith, J.E.: Introduction to Evolutionary Computing. Natural Computing Series.
Springer, Berlin (2003)
9. Fan, J., Liao, Y., Liu, H.: An overview on the estimation of large covariance and precision
matrices. arXiv:1504.02995
10. Fan, J., Liao, Y., Mincheva, Martina: Large covariance estimation by thresholding principal
orthogonal complements. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 75(4), 603–680 (2013)
11. Finck, S., Hansen, N., Ros, R., Auger, A.: Real-parameter black-box optimization benchmarking 2010: presentation of the noiseless functions. Technical report, Institute National de
Recherche en Informatique et Automatique (2010) 2009/22
12. Fisher, T.J., Sun, Xiaoqian: Improved Stein-type shrinkage estimators for the high-dimensional
multivariate normal covariance matrix. Comput. Stat. Data Anal. 55(5), 1909–1918 (2011)
13. Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical
lasso. Biostatistics 9(3), 432–441 (2008)
14. Guillot, D., Rajaratnam, B.: Functions preserving positive definiteness for sparse matrices.
Trans. Am. Math. Soc. 367(1), 627–649 (2015)
15. Hansen, N.: The CMA evolution strategy: a comparing review. In: Lozano, J.A. et al., (ed.)
Towards a new evolutionary computation. Advances in estimation of distribution algorithms,
pp. 75–102. Springer (2006)
Exploring Sparse Covariance Estimation …
283
16. Hansen, N., Auger, A., Finck, S., Ros, R.: Real-parameter black-box optimization benchmarking 2012: Experimental setup. Technical report, INRIA (2012)
17. Hansen, N., Ostermeier, A.: Completely derandomized self-adaptation in evolution strategies.
Evolut. Comput. 9(2), 159–195 (2001)
18. Hansen, Nikolaus: Adaptive encoding: How to render search coordinate system invariant. In:
Rudolph, G., Jansen, T., Beume, N., Lucas, Simon, Poloni, Carlo (eds.) Parallel Problem
Solving from Nature PPSN X. Lecture Notes in Computer Science, vol. 5199, pp. 205–214.
Springer, Berlin (2008)
19. Ledoit, O., Wolf, Michael: A well-conditioned estimator for large dimensional covariance
matrices. J. Multivar. Anal. Arch. 88(2), 265–411 (2004)
20. Levina, E., Rothman, A., Zhu, J.: Sparse estimation of large covariance matrices via a nested
lasso penalty. Ann. Appl. Stat. 2(1), 245–263 (2008)
21. Marˇcenko, V.A., Pastur, L.A.: Distribution of eigenvalues for some sets of random matrices.
Sbornik: Math. 1(4), 457–483 (1967)
22. Meyer-Nieberg, S., Kropat, E: Adapting the covariance in evolution strategies. In: Proceedings
of ICORES 2014, pp. 89–99. SCITEPRESS (2014)
23. Meyer-Nieberg, S., Kropat, E.: Communications in Computer and Information Science. In:
Pinson, E., Valente, F., Vitoriano, B. (eds.) Operations Research and Enterprise System. A new
look at the covariance matrix estimation in evolution strategies, vol. 509, pp. 157–172. Springer
International Publishing, Berlin (2015)
24. Pourahmadi, M.: High-Dimensional Covariance Estimation: With High-Dimensional Data.
Wiley, New York (2013)
25. Rechenberg, I.: Evolutionsstrategie: Optimierung technischer Systeme nach Prinzipien der
biologischen Evolution. Frommann-Holzboog, Stuttgart (1973)
26. Ros, R., Hansen, N.: Parallel Problem Solving from Nature – PPSN X: 10th International Conference, Dortmund, Germany, Sept 13-17, 2008. Proceedings, chapter A Simple Modification
in CMA-ES Achieving Linear Time and Space Complexity, pp. 296–305. Springer, Heidelberg
(2008)
27. Schäffer, J., Strimmer, K.: A shrinkage approach to large-scale covariance matrix estimation
and implications for functional genomics. Stat. Appl. Genet. Mol. Biol. 4(1), 32 (2005)
28. Schwefel, H.-P.: Numerical Optimization of Computer Models. Wiley, Chichester (1981)
29. Stein, C.: Inadmissibility of the usual estimator for the mean of a multivariate distribution.
In: Proceedings of 3rd Berkeley Symposium on Mathematical Statistics Probability, vol.1, pp.
197–206. Berkeley, CA (1956)
30. Stein, C.: Estimation of a covariance matrix. In: Rietz Lecture, 39th Annual Meeting. IMS,
Atlanta, GA (1975)