A Machine Learning-Based Approximation of Strong Branching

短短的路走走停停

发布于 2021-08-12 11:45:13

1.1K0

发布于 2021-08-12 11:45:13

文章被收录于专栏：程序猿声

论文阅读笔记，个人理解，如有错误请指正，感激不尽！该文分类到Machine learning alongside optimization algorithms。

01 Introduction

目前大部分 mixed-integer programming (MIP) solvers 都是基于 branch-and-bound (B&B) 算法。近几年很多新的特性如cutting planes, presolve, heuristics, and advanced branching strategie等都被添加到了MIP solvers上以提高求解的效率。

branching strategie对MIP的求解效果有着很大的影响，也有很多文献对其进行了研究。

最简单的branching strategie为most-infeasible branching，即每次选择小数部分最接近0.5的变量进行分支。
还有pseudocost branching，该策略首先对分支过程中的dual bound increases进行一个记录，然后在分支的时候利用该信息对候选变量的dual bound increases进行一个评估，这种方法有个缺点就是一开始的时候没有足够的信息可供使用。
strong branching则克服了这一缺点，它通过计算每个fractional variable分支后的linear programming (LP) relaxations，从而显式地评估dual bound increases，然后最大的 increases就作为当前分支的节点。它的效果是显而易见的，但是，分支节点过多，每次求解LP relaxations需要花费过多的时间，导致了strong branching的求解效率过低。
当然还有很多分支策略，例如reliability branching, inference branching和nonchimerical branching等，感兴趣的可以去读读文献。

这篇文章的contribution在于，利用机器学习的方法，对strong branching进行了学习，并模型集成到B&B算法的框架中，以加速算法的求解速度。

这篇文章处理的二进制MILP问题有如下的形式：

其中

c \in \mathbb{R}^{m\times n}, b \in \mathbb{R}^m

，分别表示成本系数和系数矩阵。在右边，

和

分别为整数变量和实数变量的下标集合。

02 Learning Branching Decisions

本文并不是提出一个新的branching heuristic，而是使用machine learning通过学习模仿其他分支策略生成的决策去构建一个branching strategy。本文的目标是创造一个效率更高的strong branching的approximation。

2.1 Data Set Generation

机器学习首先是收集数据对

(\phi_i, y_i)

，其中

\phi_i

为当前分支节点中待分支变量的特征值，而

y_i

则是该特征对应的标签。对模型进行训练。收集数据可以使用strong branching对training problems进行求解，并将求解的中间过程记录下来。具体就是每个节点分支变量的特征以及标签值，这些数据最终作为机器学习算法的输入而对模型进行训练。

2.2 Machine Learning Algorithm

在这篇文章中，使用的算法为Extremely Randomized Trees或者ExtraTrees。ExtraTrees是从随机森林直接修改过来的，之所以选择ExtraTrees是因为它对于参数的取值具有较强的robust性。

03 Features Describing the Current Subproblem

特征对于机器学习算法来说是非常重要的，他们对方法的有效性起着关键作用。一方面，特征需要足够完整和精确以更加准确地描述子问题。而另一方面，他们又需要在计算中足够高效。这两方面是需要权衡的，因为有一些特征对方法的效果起着非常显著的作用，但是计算的代价又非常大。比如，在分支过程中，对某支进行分支时LP目标值的提升值，就是一个非常好的特征，也在strong branching中使用了。但是计算这个值需要消耗的代价还是太大了，因此不适合该文的算法。

至于要选择哪些特征，又是处于什么样的考虑，作者也说了很多。感兴趣的可以看看论文。下面是详细的特征介绍，比较重要，因此直接把原文搬过来了。

3.1 Static Problem Features

The first set of features are computed from the sole parameters

c, A

and

. They are calculated once and for all, and they represent the static state of the problem. Their goal is to give an overall description of the problem.

Besides the sign of the element

c_i

, we also use

|c_i| / \sum_{j:c_j \ge 0}|c_j|

and

|c_i| / \sum_{j:c_j < 0}|c_j|

. Distinguishing both is important, because the sign of the coefficient in the cost function is of utmost importance for evaluating the impact of a variable on the objective value.

The second class of static features is meant to represent the influence of the coefficients of variable

in the coefficient matrix

. We develop three measures—namely,

M^1_j, M^2_j

, and

M^3_j

— that describe variable

within the problem in terms of the constraint

. Once the values of the measure

M_j

are computed, the corresponding features added to the feature vector

are given by

min_j M_j

and

max_j M_j

. The rationale behind this choice is that, when it comes to describing the constraints of a given problem, only the extreme values are relevant.

The first measure

M^1_j

is composed of two parts:

M^{1+}_j

computed by

A_{ji}/|b_j|, ∀j

such that

b_j ≥ 0

, and

M^{1−}_j

computed by

A_{ji}/|b_j|, ∀j

such that

b_j <0

. The minimum and maximum values of

M^{1+}_j

and

M^{1−}_j

are used as features, to indicate by how much a variable contributes to the constraint violations.

Measure

M^2_j

models the relationship between the cost of a variable and the coefficients of the same variable in the constraints. Similarly to the first measure,

M^2_j

is split in

M^{2+}_j = |c_i|/A_{ji}, ∀j

with

c_i ≥ 0

, and

M^{2−}_j =|c_i|/A_{ji}, ∀j

with

c_i < 0

. As for the previous measure, the feature vector

\phi_i

contains both the minimum and the maximum values of

M{2+}_j

and

M^{2−}_j

Finally, the third measure

M^3_j

represents the inter-variable relationships within the constraints. The measure is split into

M^{3+}_j = |A_{ji}|/ \sum_{k:A_{jk} \ge 0}|A_{jk}|

and

M^{3−}_j = |A_{ji}|/ \sum_{k:A_{jk} < 0}|A_{jk}|

M^{3+}_j

is in turn divided in

M^{3++}_j

and

M^{3+−}_j

that are calculated using the formula of

M^{3+}_j

for

A_{ji} ≥ 0

and

A_{ji} < 0

, respectively. The same splitting is performed for

M^{3−}_j

. Again, the minimum and maximum of the four

M^3_j

computed for all constraints are added to the features.

3.2 Dynamic Problem Features

The second type of features are related to the solution of the problem at the current B&B node. Those features contain the proportion of fixed variables at the current solution, the up and down fractionalities of variable

, the up and down Driebeek penalties (Driebeek 1966) corresponding to variable

, normalized by the objective value at the current node, and the sensitivity range of the objective function coefficient of variable

, also normalized by

|ci|

3.3 Dynamic Optimization Features

The last set of features is meant to represent the overall state of the optimization. These features summarize global information that is not available from the single current node. When branching is performed on a variable, the objective increases are stored for that variable. From these numbers, we extract statistics for each variable: the minimum, the maximum, the mean, the standard deviation, and the quartiles of the objective increases. These statistics are used as features to describe the variable for which they were computed.

As those features should be independent of the scale of the problem, we divide each objective increase by the objective value at the current node, such that the computed statistics correspond to the relative objective increase for each variable. Finally, the last feature added to this subset is the number of times variable i has been chosen as branching variable, normalized by the total number of branchings performed.

04 Experiments

(steps 1) we generate a data set

D_{sb}

using strong branching,

(steps 2) we learn from it a branching heuristic, and
(steps 3) we compare the learned heuristic with other branching strategies on various problems.

4.1 Problem Sets

two types of problem sets, namely randomly generated problem sets and standard benchmark problems from the MIPLIB.
The random problem sets are used for both training (steps 1 and 2) and assessing (step 3) the heuristics. MIPLIB problems are only used for assessment (step 3).

The possible constraints are chosen among set covering (SC), multiknapsack (MKN), bin packing (BP), and equality (EQ) constraints. We generated problems that contain constraints of type BP-EQ, BP-SC, and MKN-SC.

As some of those problems are going to be used to generate the training data set, we randomly split each family into a train and a test set. In the end, we have six data sets: BPEQ_train, BPEQ_test, BPSC_train, BPSC_test, MKNSC_train, and MKNSC_test.

The chosen parameters of ExtraTrees are

N = 100, k = |\phi|

, and

n_{min} = 20

4.3 Results

We compare our approach to five other branching strategies, namely random branching (random), most-infeasible branching (MIB), nonchimerical branching (NCB) (Fischetti and Monaci 2012), full strong branching (FSB), and reliability branching (RB) (Achterberg et al. 2005).

In these tables (Tables 4, 5, and 7), Cl. Gap refers to the closed gap and S/T indicates the number of problems solved within the provided nodes or time limit versus the total number of problems. Nodes and Time, respectively, represent the number of explored nodes and the time spent (in seconds) before the optimization either finds the optimal solution or stops earlier because of one stopping criterion.

The normal learned branching strategy (i.e., (

n_{min}=20

, all)) is learned based on a data set containing samples from the three types of training problems (i.e., BPSC, BPEQ, and MKNSC).

Those results show that our approach succeeds in efficiently imitating FSB. Indeed, the experiments performed with a limit on the number of nodes show that the closed gap is only 9% smaller, while the time spent is reduced by 85% compared to FSB.
The experiments with a time limit show that the reduced time required to take a decision allows the learned strategy to explore more nodes and to thus further close the gap than FSB. While these results are encouraging, they are still slightly worse than the results obtained with RB, which is both closer to FSB and faster than our approach

we separated the problems that were solved by all methods from the problems that were not solved by at least one of the compared methods.
the results obtained with the learned branching strategy are still a little below the results obtained with reliability branching. The results presented here are averaged over all considered problems.

Finally, Tables 6 and 7 report the results form our last set of experiments.

Additionally, in another experiment, we let CPLEX use cuts and heuristics (with default CPLEX parameters) in the course of the optimization to observe their impact on the efficiency of each branching strategy.

Overall, our method compares favorably to its competitors when cuts and heuristics are used by CPLEX. Indeed, in that case, our learned branching strategy is the fastest (almost three times faster than the second fastest method (i.e., MIB)) to solve all 30 considered problems.
Note that the apparent bad results of RB are due to three problems that are especially hard for that branching heuristic (air04, air05, and mod011).
Things appear to be different when cuts and heuristics are not used. Indeed, based on the results of Table 7, our method seems to be very slow, but the large number of nodes and the large amount of time is actually due to a small number of problems for which the method does not work well.
A possible explanation for why our method does not perform well on those problems can be that these problems, because too large, are not well represented in the data set that we use to learn the branching strategy.

Reference

[1] Alejandro Marcos Alvarez, Quentin Louveaux, Louis Wehenkel (2017) A Machine Learning-Based Approximation of Strong Branching. INFORMS Journal on Computing 29(1):185-195. http://dx.doi.org/10.1287/ijoc.2016.0723

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-08-07，如有侵权请联系 cloudcommunity@tencent.com 删除

linux