Tải bản đầy đủ - 0 (trang)

3 Choosing the Threshold: The Influence of the Scaling Factor

Exploring Sparse Covariance Estimation …

273

The maximal entry of the population covariance was chosen to make the parameter data dependent and not too small. In future research, we will take a closer

look at the choice above. We investigated seven values for the factor ρ with

ρ ∈ {0.001, 0.01, 0.1, 1, 2, 5, 10}. The goal of this paper is to gain more insights

into the question whether larger or smaller scalings are preferable. The experiments

are conducted for N = 20. Of course, the search space dimensionality may additionally influence the parameter settings and will be investigated in future research.

Finally, the decision of using the maximal entry of the population covariance matrix is

reconsidered by investigating additionally a data-independent choice setting δ equal

to ρ

δ = ρ.

(21)

the first variant (20). The Tables 2, 3 and 4 show the results on selected functions of

the test suite. For each function, 30 runs were conducted. The subset includes two

separable functions, the sphere with id 1 and the ellipsoidal with id 2. followed by

the attractive sector with id 6 and the rotated Rosenbrock (id 9) as representatives

Table 2 The results for different settings of the parameter ρ in (15) on selected functions for the

CMSA-Thres-ES

Expected running time (ERT in number of function evaluations) divided by the respective best ERT

measured during BBOB-2009 in dimension 20. The ERT and in braces, as dispersion measure, the

half difference between 90 and 10 %-tile of bootstrapped run lengths appear for each algorithm and

target, the corresponding best ERT in the first row. The different target f best -values are shown in the

top row. #succ is the number of trials that reached the (final) target f opt + 10−8 . The median number

of conducted function evaluations is additionally given in italics, if the target in the last column

was never reached. Entries, succeeded by a star, are statistically significantly better (according to

the rank-sum test) when compared to all other algorithms of the table, with p = 0.05 or p = 10−k

when the number k following the star is larger than 1, with Bonferroni correction by the number of

instances

274

S. Meyer-Nieberg and E. Kropat

Table 3 The results for different settings of the parameter ρ in (15) on selected functions for the

CMSA-Diag-ES

Please refer to Table 2 for more information

Table 4 The results for different settings of the parameter ρ in (21) on selected functions for the

CMSA-Diag-ES

Please refer to Table 2 for more information

for functions with moderate conditioning. The group of ill-condition functions is

represented by another ellipsoidal with id 10 and the bent cigar (id 12). The parameter investigation does not consider the multimodal functions since our experiments

showed that these pose challenges for the algorithms especially for larger search

space dimensionalities. The first series of experiments considers CMSA-ESs which

apply (15). Table 2 provides the findings for the variant which also subjects the diagonal entries of the covariance matrix to thresholding. It reveals different experimental

Exploring Sparse Covariance Estimation …

275

findings for the functions. In the case of the sphere, the choice of the scaling factor

does not affect the outcome considerably. This changes when the other functions are

considered. In the case of the elliposoidal with id 2, the size of the parameter does

not influence whether the optimization is successfull or not. However, it influences

the number of fitness functions necessary to reach the intermediate target precisions.

In the case of the elliposoidal, choices between 0.01 and 1 lead to lower values.

Larger and smaller values lead to a gradual loss of performance. The case of the

attractive sector (id 6) shows that the choice of the scaling factor may strongly influences the optimization success. Before continuing,it should be taken into account,

however, that in all cases the number of successfull outcomes is less that 30 % of

all runs. Therefore, it cannot be excluded that initialization effects may have played

a role. For this reason, experiments with more repeats will be conducted in future

work. Here, we can state that too small choices do not allow the ES to achieve the

final target precision. Otherwise, successfull runs are recorded. However, no clear

dependency of the performance on the factor ρ emerges. Regarding the experiment

series present, the factor ρ = 1 should be preferred. The remaining functions of

the subset do not pose the same challenge for the ES as the attractive sector. The

majority of the runs leads to successfull outcomes. In the case of the Rosenbrock

function, setting ρ = 0.1 is beneficial for the number of successes as well as for the

number of fitness evaluations required of reaching the various target precisions. The

findings roughly transfer to the next function, the ill-conditioned elliposoidal. Here,

however, the outcomes do not vary as much as previously. The bent cigar represents

a outlier of the function set chosen. In contrast to the other functions, the ESs with

low scaling factors ρ perform better. Since the factor is coupled with the maximal

entry of the covariance matrix, two potential explanations can be given: First, in

some cases, thresholding should be conducted only for few values. Or, second, the

covariance matrix estimated contains extreme elements which require a reconsideration of choosing (15). To summarize the findings so far: Relatively robust choices of

the scaling factor are presented by medium-sized values which lie between 0.1 and 1

for the experiments conducted so far. This, however, may not be the optimal choice

for all functions of the test suite. Future research will therefore investigate adaptive

techniques. The findings transfer to the ES variant that does not apply thresholding

to the diagonal entries of the covariance matrix as Table 3 shows.

Table 4 reports the results for the data-independent scaling factor setting

δ = ρ, see (21). Here, experiments were only conducted for the CMSA-Diag-ES.

The results vary strongly over the subset of fitness functions. In the case of the separable functions, the performance can be influenced by the choice of scaling factor.

Again, this is more pronounced for the elliposoidal than for the sphere. Interestingly, in the case of the remaining functions, a correct choice of the factor is decisive

leading to successfull runs for all 30 trials or to a complete miss of the final target

precision. In general, larger values are detrimental and smaller values achieve better outcomes. The exception is the attractive sector (id 6) where a scaling factor of

one leads to successes in nearly all trials. This setting cannot be transferred to the

other functions. As we have seen, choosing the scaling factor represents an important

276

S. Meyer-Nieberg and E. Kropat

point in thresholding. Therefore, future research will focus on potential adaptation

approaches.

For the remainder of the paper, the data-dependend version (15) is used. The

first series of experiments indicated that the values of ρ = 0.1 and ρ = 1 lead to

comparably good results, therefore the parameter was set to 0.5 for the evaluation

experiments with the black box optimization benchmarking test suite. The results

are provided in the next subsection.

3.4 Results and Discussion

The findings are interesting—indicating advantages for thresholding in many but not

in all cases. The result of the comparison depends on the function class. In the case of

the separable functions with ids 1–5, the strategies behave on the whole very similar

in the case of both dimensionalities 10D and 20D. This can be seen in the empirical

cumulative distribution functions plots in Figs. 2 and 3 for example.

Concerning the particular functions, differences are revealed as Tables 5 and 6

show for the expected running time (ERT) which is provided for several precision

targets. The expected running time is provided relative to the best results achieved

during the black-box optimization workshop in 2009. The first line of the outcomes

for each function reports the ERT of the best algorithm of 2009. However, not only

the ERT values but also the number of successes is important. The ERT can only be

measured if the algorithm achieved the respective target in the run. If the number

of trials where is the full optimization objective has been reached is low then the

remaining targets should be discussed with care. If only a few runs contribute to the

result, the findings may be strongly influenced by initialization effects. To summarize,

only a few cases end with differences that are statistically significant. To achieve this,

the algorithm has to perform significantly better than both competing methods—the

other thresholding variant and the original CMSA-ES.

In the case of the sphere (function with id 1), slight advantages for the thresholding

variants are revealed. A similar observation can be made for the second function,

the separable ellipsoid. Here, both thresholded ESs are faster, with the one that only

shrinks the off-diagonal elements more strongly (Table 6). This is probably due to

the enforced more regular structure.

No strategy is able to reach the required target precision in the case of the separable

Rastrigin (id 3) and the separable Rastrigin-Bueche (id 4). Since all strategies only

achieve the lowest target precision of 101 , a comparison is not performed. The linear

slope is solved fast by all, with the original CMSA-ES the best strategy.

In the case of the function class containing test functions with low to moderate conditioning (ids 6–9), different findings can be made for the two search space

dimensionalities. This is also shown by the empirical cumulative distribution functions plots in Figs. 2 and 3, especially for N = 10. Also in the case of N = 10, Table 5

shows that the strategies with thresholding achieve a better performance in a majority

of cases. In addition, thresholding that is not applied to the diagonal appears to lead

Exploring Sparse Covariance Estimation …

separable fcts

277

moderate fcts

best 2009

best 2009

Diag

Thres

Thres

CMSA

CMSA

Diag

ill-conditioned fcts

multi-modal fcts

best 2009

best 2009

CMSA

Diag

Thres

CMSA

Diag

Thres

weakly structured multi-modal fcts

all functions

best 2009

best 2009

Thres

Thres

Diag

CMSA

CMSA

Diag

Fig. 2 Bootstrapped empirical cumulative distribution of the number of objective function evaluations divided by dimension (FEvals/DIM) for 50 targets in 10[−8.2] for all functions and subgroups

in 10-D. The “best 2009” line corresponds to the best ERT observed during BBOB 2009 for each

single target

to a well-performing strategy with the exception of f6, where the CMSA-Diag-ES

appears highly successfull.

The results for f6, the so-called attractive sector, in 10D are astonishing. While

the original CMSA-ES could only reach the required target precision in six of the

15 runs, the thresholding variants were able to succeed 13 times (CMSA-Thres-ES)

278

S. Meyer-Nieberg and E. Kropat

separable fcts

moderate fcts

best 2009

best 2009

CMSA

Diag

Diag

Thres

Thres

CMSA

ill-conditioned fcts

best 2009

Diag

Thres

CMSA

Fig. 3 Bootstrapped empirical cumulative distribution of the number of objective function evaluations divided by dimension (FEvals/DIM) for 50 targets in 10[−8.2] for all functions and subgroups

in 20-D. The “best 2009” line corresponds to the best ERT observed during BBOB 2009 for each

single target

and 15 times (CMSA-Diag-ES). The latter achieved lower expected running times in

addition. For N = 20, only a minority of runs were successfull for all strategies with

the CMSA-Diag-ES reaching the final target precision in nearly half of the trials.

Experiments with a larger number of fitness evaluations must be conducted in order

to investigate the findings more closely.

The same holds for the step ellipsoid (id 7) which cannot be solved with the target

precision required by any strategy. The exception is one run for the CMSA-Thres-ES

which is able to reach the target precision of 10−8 in one case for N = 10. Since this

may be due to the initialization, it is not taken into further consideration. Concerning

the lower precision targets, sometimes the CMSA-ES and sometimes the CMSADiag-ES appears superior. However, more research is required, since the number of

runs entering the data for some of the target precisions is low and initial positions

may be influential.

On the original Rosenbrock function (id 8), the CMSA-ES and the strategies with

thresholding show a similar behavior with the CMSA-ES performing better for the

first intermediate target precision whereas the CMSA-Diag-ES shows slightly lower

Exploring Sparse Covariance Estimation …

279

Table 5 Expected running time (ERT in number of function evaluations) divided by the respective

best ERT measured during BBOB-2009 in 10

The ERT and in braces, as dispersion measure, the half difference between 90 and 10 %-tile of

bootstrapped run lengths appear for each algorithm and target, the corresponding best ERT in the

first row. The different target f -values are shown in the top row. #succ is the number of trials that

reached the (final) target f opt + 10−8 . The median number of conducted function evaluations is

additionally given in italics, if the target in the last column was never reached. Entries, succeeded

by a star, are statistically significantly better (according to the rank-sum test) when compared to all

other algorithms of the table, with p = 0.05 or p = 10−k when the number k following the star is

larger than 1, with Bonferroni correction by the number of instances

280

S. Meyer-Nieberg and E. Kropat

Table 6 Expected running time (ERT in number of function evaluations) divided by the respective

best ERT measured during BBOB-2009 in 20

The ERT and in braces, as dispersion measure, the half difference between 90 and 10 %-tile of

bootstrapped run lengths appear for each algorithm and target, the corresponding best ERT in the

first row. The different target f -values are shown in the top row. #succ is the number of trials that

reached the (final) target f opt + 10−8 . The median number of conducted function evaluations is

additionally given in italics, if the target in the last column was never reached. Entries, succeeded

by a star, are statistically significantly better (according to the rank-sum test) when compared to all

other algorithms of the table, with p = 0.05 or p = 10−k when the number k following the star is

larger than 1, with Bonferroni correction by the number of instances

running times for the later. The version which subjects the complete covariance to

thresholding performs always slightly worse. The roles of the CMSA-Diag-ES and

the CMSA-Thres-ES change for the rotated Rosenbrock (id 9). Here, the best results

can be observed for the complete thresholding variant.

In the case of ill-conditioned functions (ids 10–14), the findings are mixed. In

general, thresholding without including the diagonal does not appear to improve

the performance. The strategy performs worst of all—an indicator that keeping the

diagonal unchanged may be sometimes inappropriate due to the space transformation.

However, since there are interactions with the choice of the thresholding parameters

which may have resulted in comparatively too large diagonal elements, we need to

address this issue further before coming to a conclusion. First of all for N = 10,

all strategies are successfull in all cases for the ellipsoid (id 10), the discus (id 11),

the bent cigar (id 12), and the sum of different powers (id 14). Only the CMSA-ES

reaches the optimization target in the case of the sharp ridge (id 13). This, however,

Exploring Sparse Covariance Estimation …

281

only twice. The reasons for this require further analysis. Either the findings may be

due to a violation of the sparseness assumption or considering that this is only a

weak assumption the choice of the thresholding parameters and the function should

be reconsidered.

All strategies exhibit problems in the case of the group of multi-modal functions, Rastrigin (id 15), Weierstrass (id 16), Schaffer F7 with condition number 10

(id 17), Schaffer F7 with condition 1000 (id 18), and Griewank-Rosenbrock F8F2

(id 19). Partly, this may be due to the maximal number of fitness evaluations permitted. Even the best performing methods of the 2009 BBOB workshop required more

evaluations than we allowed in total. Thus, experiments with larger values for the

maximal function evaluations should be conducted in future research. Concerning

the preliminary targets with lower precision, the CMSA-ES achieves the best results

in a majority of cases. However, the same argumentation as for the step ellipsoid

applies.

In the case of N = 20, the number of function evaluations that were necessary

in the case of the best algorithms of 2009 to reach even the lower precision target

of 10−1 exceeds the budget chosen here. Therefore, the function group is excluded

from the analysis for N = 20 and not shown in Fig. 3 and Table 6.

The remaining group consists of multi-modal functions with weak global structures. Here, especially the functions with numbers 20 (Schwefel x sin(x)),

23 (Kaatsuuras), and 24 (Lunacek bi-Rastrigin) represent challenges for the algorithms. In the case of N = 10, they can only reach the first targets of 101 and 100 .

Again, the maximal number of function evaluations should be increased to allow a

more detailed analysis on these functions. For the case of the remaining functions,

function 21, Gallagher 101 peaks, and function 22, Gallagher 21 peaks, the results

indicate a better performance for the CMSA-ES versions with thresholding compared with the original algorithm. Again due to similar reasons as for the first group

of multi-modal functions, the results are only shown for N = 10.

4 Conclusions and Outlook

This paper presents covariance matrix adaptation techniques for evolution strategies.

The original versions are based on the sample covariance—an estimator known to

be problematic. Especially in high-dimensional search spaces, where the population

size does not exceed the search space dimensionality, the agreement of the estimator and the true covariance may be low. Therefore, thresholding, a comparably

computationally simple estimation technique, has been integrated into the covariance

adaptation process. Thresholding stems from estimation theory for high-dimensional

spaces and assumes an approximately sparse structure of the covariance matrix. The

matrix entries are therefore thresholded, meaning a thresholding function is applied.

The paper considered adaptive entry-wise thresholding. Since the covariance matrix

cannot be assumed to be sparse in general, a basis transformation was carried out and

the thresholding process was performed in the transformed space. The performance

282

S. Meyer-Nieberg and E. Kropat

of the resulting new covariance matrix adapting evolution strategies was compared to

the original variant on the black-box optimization benchmarking test suite. Two main

variants were considered: A CMSA-ES which subjected the complete covariance to

thresholding and a variant which left the diagonal elements unchanged. While the latter is more common in statistics, it is not easy to justify its preforation in optimization.

The first findings were interesting with the new variants performing better for several

function classes. While this is promising, more experiments and analyses are required

and will be conducted in future research. The choice of the thresholding function

and the scaling parameter for the threshold represent important research questions.

In this paper, the scaling factor was analyzed by a small series of experiments. The

findings were then used in the benchmarking experiments. They represent, however,

only the first steps of the research. Techniques to make the parameter adaptive are

currently under investigation.

References

1. Bäck, T., Foussette, C., Krause, Peter: Contemporary Evolution Strategies. Natural Computing.

Springer, Berlin (2013)

2. Beyer, H.-G., Meyer-Nieberg, S.: Self-adaptation of evolution strategies under noisy fitness

evaluations. Genet. Program. Evolv. Mach. 7(4), 295–328 (2006)

3. Beyer, H.-G., Schwefel, H.-P.: Evolution strategies: a comprehensive introduction. Nat. Comput. 1(1), 3–52 (2002)

4. Beyer, H.-G., Sendhoff, B.: Lecture Notes in Computer Science. In: Rudolph, G., et al. (eds.)

PPSN. Covariance matrix adaptation revisited - the CMSA evolution strategy, vol. 5199, pp.

123–132. Springer, Berlin (2008)

5. Cai, Tony, Liu, Weidong: Adaptive thresholding for sparse covariance matrix estimation. J.

Am. Stat. Assoc. 106(494), 672–684 (2011)

6. Chen, X., Wang, Z.J., McKeown, M.J.: Shrinkage-to-tapering estimation of large covariance

matrices. IEEE Trans. Signal Process. 60(11), 5640–5656 (2012)

7. Dong, W., Yao, X.: Covariance matrix repairing in gaussian based EDAs. In: 2007 IEEE

Congress on Evolutionary Computation, 2007. CEC, pp. 415–422 (2007)

8. Eiben, A.E., Smith, J.E.: Introduction to Evolutionary Computing. Natural Computing Series.

Springer, Berlin (2003)

9. Fan, J., Liao, Y., Liu, H.: An overview on the estimation of large covariance and precision

matrices. arXiv:1504.02995

10. Fan, J., Liao, Y., Mincheva, Martina: Large covariance estimation by thresholding principal

orthogonal complements. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 75(4), 603–680 (2013)

11. Finck, S., Hansen, N., Ros, R., Auger, A.: Real-parameter black-box optimization benchmarking 2010: presentation of the noiseless functions. Technical report, Institute National de

Recherche en Informatique et Automatique (2010) 2009/22

12. Fisher, T.J., Sun, Xiaoqian: Improved Stein-type shrinkage estimators for the high-dimensional

multivariate normal covariance matrix. Comput. Stat. Data Anal. 55(5), 1909–1918 (2011)

13. Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical

lasso. Biostatistics 9(3), 432–441 (2008)

14. Guillot, D., Rajaratnam, B.: Functions preserving positive definiteness for sparse matrices.

Trans. Am. Math. Soc. 367(1), 627–649 (2015)

15. Hansen, N.: The CMA evolution strategy: a comparing review. In: Lozano, J.A. et al., (ed.)

Towards a new evolutionary computation. Advances in estimation of distribution algorithms,

pp. 75–102. Springer (2006)

Exploring Sparse Covariance Estimation …

283

16. Hansen, N., Auger, A., Finck, S., Ros, R.: Real-parameter black-box optimization benchmarking 2012: Experimental setup. Technical report, INRIA (2012)

17. Hansen, N., Ostermeier, A.: Completely derandomized self-adaptation in evolution strategies.

Evolut. Comput. 9(2), 159–195 (2001)

18. Hansen, Nikolaus: Adaptive encoding: How to render search coordinate system invariant. In:

Rudolph, G., Jansen, T., Beume, N., Lucas, Simon, Poloni, Carlo (eds.) Parallel Problem

Solving from Nature PPSN X. Lecture Notes in Computer Science, vol. 5199, pp. 205–214.

Springer, Berlin (2008)

19. Ledoit, O., Wolf, Michael: A well-conditioned estimator for large dimensional covariance

matrices. J. Multivar. Anal. Arch. 88(2), 265–411 (2004)

20. Levina, E., Rothman, A., Zhu, J.: Sparse estimation of large covariance matrices via a nested

lasso penalty. Ann. Appl. Stat. 2(1), 245–263 (2008)

21. Marˇcenko, V.A., Pastur, L.A.: Distribution of eigenvalues for some sets of random matrices.

Sbornik: Math. 1(4), 457–483 (1967)

22. Meyer-Nieberg, S., Kropat, E: Adapting the covariance in evolution strategies. In: Proceedings

of ICORES 2014, pp. 89–99. SCITEPRESS (2014)

23. Meyer-Nieberg, S., Kropat, E.: Communications in Computer and Information Science. In:

Pinson, E., Valente, F., Vitoriano, B. (eds.) Operations Research and Enterprise System. A new

look at the covariance matrix estimation in evolution strategies, vol. 509, pp. 157–172. Springer

International Publishing, Berlin (2015)

24. Pourahmadi, M.: High-Dimensional Covariance Estimation: With High-Dimensional Data.

Wiley, New York (2013)

25. Rechenberg, I.: Evolutionsstrategie: Optimierung technischer Systeme nach Prinzipien der

biologischen Evolution. Frommann-Holzboog, Stuttgart (1973)

26. Ros, R., Hansen, N.: Parallel Problem Solving from Nature – PPSN X: 10th International Conference, Dortmund, Germany, Sept 13-17, 2008. Proceedings, chapter A Simple Modification

in CMA-ES Achieving Linear Time and Space Complexity, pp. 296–305. Springer, Heidelberg

(2008)

27. Schäffer, J., Strimmer, K.: A shrinkage approach to large-scale covariance matrix estimation

and implications for functional genomics. Stat. Appl. Genet. Mol. Biol. 4(1), 32 (2005)

28. Schwefel, H.-P.: Numerical Optimization of Computer Models. Wiley, Chichester (1981)

29. Stein, C.: Inadmissibility of the usual estimator for the mean of a multivariate distribution.

In: Proceedings of 3rd Berkeley Symposium on Mathematical Statistics Probability, vol.1, pp.

197–206. Berkeley, CA (1956)

30. Stein, C.: Estimation of a covariance matrix. In: Rietz Lecture, 39th Annual Meeting. IMS,

Atlanta, GA (1975)

3 Choosing the Threshold: The Influence of the Scaling Factor