Tutorial: Hyper-parameter Optimisation for Machine Learning Models using MiP-EGO

When you start with using machine learning, the choices that you have to make can be overwhelming. There are many machine learning algorithms / models that you can use that are already available in packages like scikit-learn. Which of these algorithms should you use? Which will perform best for your specific data-set or the problem you want to solve?

And once you have chosen one of these models, each algorithm has usually many hyper-parameters. Hyper-parameters are options of the algorithm that are most of the time essential for it's performance.

For example, if we want to perform a classification task, we might want to use Support Vector Machines due to its outstanding performance. However, SVC has a staggering 15 hyper-parameters, not all of these are essential to tune, but the C, kernel, degree and gamma parameters are crucial for its performance.

How to best tune these hyper-parameters?

There are a few strategies that we can use when choosing hyper-parameters. In some cases we can first just try the default values, for example if we want to use Random Forests, we can first try the default parameter settings to see if this kind of model will perform well for us. Random Forests are quite robust and do not require a lot of hyper-parameter tuning. However, for models such as SVC this is not the case and can give us a wrong impression from the start.
If there are only a limited set of options for the hyper-parameters we can apply grid-search or random-search, see for a code example the page on scikit-learn.
But, grid-search and random search are far from optimal, they will only evaluate a subset of parameters and they usually take a long time to execute if done properly. Instead of using grid-search or random search, you should use a hyper-parameter optimisation algorithm, one package that can be very handy here is a package I developed together with my colleague Hao Wang: MiP-EGO, standing for Mixed-integer Parallel Efficient Global Optimisation. In principle this algorithm is designed to handle all kind of mixed-integer optimisation problems (meaning that the problem can have integers, real-values and categorical inputs). Hyper-parameter optimisation is just one of this kind of problems that it can efficiently solve. Plus, it is very easy to setup and run and it is efficient, meaning it finds a good solution with only a few evaluations. MiP-EGO uses bayesian optimisation as the underlying technique and build a surrogate model to learn from the problem it is optimising. For more scientific sources, see the original EGO paper and the MiP-EGO paper.

To use it though, you do not necessarily need to know all the underlying details (though it is interesting!).

So let's start by installing MiP-EGO. Make sure you have python 3 installed and use pip or pip3 to install the mipego package. By installing mipego you automatically also install numpy and a couple of other required packages in case you did not have them already.

pip install mipego

For this example we will use 

from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, KFold
import numpy as np

#import our package, the surrogate model and the search space classes
from mipego import ParallelBO
from mipego.Surrogate import RandomForest
from mipego.SearchSpace import ContinuousSpace, NominalSpace, OrdinalSpace

After importing all required libraries, let us load the Iris dataset from sklearn and define the data and target of the task.


# Load the dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

After loading the data we define the search space of the optimization problem. You can easily create multiple variables using the ContinuousSpace, OrdinalSpace and NominalSpace functions. Just adding them with + gives you the searchspace object.

# First we need to define the Search Space
# the search space consists of one continues variable
# one ordinal (integer) variable
# and two categorical (nominal) variables.
Cvar = ContinuousSpace([1.0, 20.0],'C') # one integer variable with label C
degree = OrdinalSpace([2,6], 'degree') 
gamma = NominalSpace(['scale', 'auto'], 'gamma') 
kernel = NominalSpace(['linear', 'poly', 'rbf', 'sigmoid'], 'kernel') 

#the complete search space is just the sum of the parameter spaces
search_space = Cvar + gamma + degree + kernel

After defining the Search Space, we have to define our objective function that we want to optimize. In our case this is training a SVC model with the parameters given by a certain configuration (c) and calculating a cross validation score. Here we simply use the cross_val_score method from scikit-learn. We return the mean cross validation score multiplied by -1. We multiply by -1 because the optimizer treats the problem as a minimization problem by default.


#now we define the objective function (the model optimization)
def train_model(c):
    #define the model
    # We will use a Support Vector Classifier
    svm = SVC(kernel=c['kernel'], gamma=c['gamma'], C=c['C'], degree=c['degree'])
    cv = KFold(n_splits=4, shuffle=True, random_state=42)

    # Nested CV with parameter optimization
    cv_score = cross_val_score(svm, X=X_iris, y=y_iris, cv=cv)

    #by default mip-ego minimises, so we reverse the accuracy
    return -1 * np.mean(cv_score)

Once we have defined the objective function and the search space, it is time to create a standard RandomForest surrogate model and execute the Parallel Bayesian Optimization optimizer.
We define here several hyper-parameters. We want to evaluate at most 40 configurations (max_FEs) and we start with 5 random configurations (DoE_size). We want to evaluate 3 configurations in parallel (n_job and n_point) and we want to optimize using the MGFI function (this is the default for parallel optimization).

model = RandomForest(levels=search_space.levels)
opt = ParallelBO(
    search_space=search_space, 
    obj_fun=train_model, 
    model=model, 
    max_FEs=40, 
    DoE_size=5,    # the initial DoE size
    eval_type='dict',  # has to be dict for parallel evaluations
    acquisition_fun='MGFI',
    acquisition_par={'t' : 2},
    n_job=3,       # number of processes
    n_point=3,     # number of the candidate solution proposed in each iteration
    verbose=True   # turn this off, if you prefer no output
)
xopt, fopt, stop_dict = opt.run()

After running the optimizer we get the best model so far (xopt) the performance of the model (fopt) and why the optimizer stopped (most likely becausel max_FEs is reached).

The output for me was:

xopt: {'C': 6.099174228041696, 'gamma': 'scale', 'degree': 2, 'kernel': 'rbf'}
fopt: -0.9797297297297298
stop criteria: {'max_FEs': 41}

Now we know that we can use a SVC with rbf kernel and C equals to 6.1 to optimally solve the Iris classification task (with mean accuracy of 0.98, where SVC with default parameters has a mean accuracy of 0.96). Of course this is a relative easy test problem, but more advanced classification and regression problems benefit even more from optimizing the hyper-parameters.

Hereby also the complete source code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
#import packages
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, KFold
import numpy as np

#import our package, the surrogate model and the search space classes
from mipego import ParallelBO
from mipego.Surrogate import RandomForest
from mipego.SearchSpace import ContinuousSpace, NominalSpace, OrdinalSpace

# Load the dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# First we need to define the Search Space
# the search space consists of one continues variable
# one ordinal (integer) variable
# and two categorical (nominal) variables.
Cvar = ContinuousSpace([1.0, 20.0],'C') # one integer variable with label C
degree = OrdinalSpace([2,6], 'degree') 
gamma = NominalSpace(['scale', 'auto'], 'gamma') 
kernel = NominalSpace(['linear', 'poly', 'rbf', 'sigmoid'], 'kernel') 

#the complete search space is just the sum of the parameter spaces
search_space = Cvar + gamma + degree + kernel

#now we define the objective function (the model optimization)
def train_model(c):
    #define the model
    # We will use a Support Vector Classifier
    svm = SVC(kernel=c['kernel'], gamma=c['gamma'], C=c['C'], degree=c['degree'])
    cv = KFold(n_splits=4, shuffle=True, random_state=42)

    # Nested CV with parameter optimization
    cv_score = cross_val_score(svm, X=X_iris, y=y_iris, cv=cv)

    #by default mip-ego minimises, so we reverse the accuracy
    return -1 * np.mean(cv_score)


model = RandomForest(levels=search_space.levels)
opt = ParallelBO(
    search_space=search_space, 
    obj_fun=train_model, 
    model=model, 
    max_FEs=6, 
    DoE_size=5,    # the initial DoE size
    eval_type='dict',
    acquisition_fun='MGFI',
    acquisition_par={'t' : 2},
    n_job=3,       # number of processes
    n_point=3,     # number of the candidate solution proposed in each iteration
    verbose=True   # turn this off, if you prefer no output
)
xopt, fopt, stop_dict = opt.run()

print('xopt: {}'.format(xopt))
print('fopt: {}'.format(fopt))
print('stop criteria: {}'.format(stop_dict))

Comments