Skip to content

Api

find_best_model

find_best_model(
    X,
    y,
    model,
    search_space,
    optimizing_metric,
    k_outer=5,
    skip_outer_folds=None,
    k_inner=5,
    skip_inner_folds=None,
    n_initial_points=5,
    n_calls=10,
    calibrate="no",
    calibrate_params=None,
    other_metrics=None,
    skopt_func=gp_minimize,
    verbose=False,
)

Performs nested cross validation to find the best classification model. The inner loop does hyperparameters tuning (using a skopt primitive) and the outer loop computes metrics for assessing the quality of the model without risk of overfitting bias.

After the nested loop, the whole procedure is used with the full dataset to return a single model trained on all available data.

Parameters:

Name Type Description Default
X array-like of shape (n_samples, n_features)

Features

required
y array-like of shape (n_samples,)

Targets to predict. It has to be discrete (for classification), and both binary and multiclass targets are supported.

required
model estimator object

estimator object implementing fit and predict The object to use to fit the data. It can be a complex object like a pipeline or another composite object with hyperparameters to tune.

required
search_space list of tuple

Search space dimensions provided as a list. Each search dimension should be defined as an instance of a skopt Dimension object, that is, Real, Integer or Categorical. In addition to the constraints and the prior, when applicable, the Dimension object should have as name the name of the param in the model, using the double underscore convention for nested objects. See examples for several examples for how to provide the search space.

required
optimizing_metric str or callable

Strategy to evaluate the performance of the cross-validated model on each inner test set, to find the best hyperparameters. It should follow the sklearn convention of greater is better. One can use:

  • a single string
  • a callable
required
k_outer int

Number of folds for the outer cross-validation.

5
skip_outer_folds list

If set, list of folds to skip during the loop, to reduce computational cost.

None
k_inner int

Number of folds for the inner cross-validation.

5
skip_inner_folds list

If set, list of folds to skip during the loop, to reduce computational cost.

None
n_initial_points int

Number of initial points to use in Skopt Optimization.

5
n_calls int

Number of additional calls to use in Skopt Optimization.

10
calibrate str

Whether to calibrate the output probabilities. Options:

  • "no" if no calibration at all.
  • "only_best" if only the best model on the inner loop should be calibrated
  • "all" if all inner models should be calibrated (maybe more accurate results, but much more time-consuming)
'no'
calibrate_params dict

Dictionary of params for the CalibratedClassifierCV

None
other_metrics dict

If not empty, in the report output every metric specified in this parameter will be computed, showing the results over the inner folds (during tuning) and over the outer folds (during performance evaluation). The parameter should be provided as a dictionary with metric names as keys and callables or str a values. See examples for examples.

None
skopt_func callable

Minimization function of the skopt library to be used. Available options are:

  • gp_minimize: performs bayesian optimization using Gaussian Processes.
  • dummy_minimize: performs a random search.
  • forest_minimize: performs sequential optimization using decision trees.
  • gbrt_minimize: performs sequential optimization using gradient boosting trees.
gp_minimize
verbose bool

Whether to trace progress or not.

False

Returns:

Name Type Description
model estimator

Model trained with the full dataset using the same procedure as in the inner cross validation.

params dict

Dictionary of (hyper)parameters of the best model.

loop_info (dataclass) : Dataclass with information about the optimization process. The opt_info object has a method to_dataframe() that converts the information into a dataframe, for easier exploration of results.

Source code in nestedcvtraining/api.py
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
def find_best_model(
        X,
        y,
        model,
        search_space,
        optimizing_metric,
        k_outer=5,
        skip_outer_folds=None,
        k_inner=5,
        skip_inner_folds=None,
        n_initial_points=5,
        n_calls=10,
        calibrate="no",
        calibrate_params=None,
        other_metrics=None,
        skopt_func=gp_minimize,
        verbose=False,
):
    """Performs nested cross validation to find the best classification model.
    The inner loop does hyperparameters tuning (using a `skopt` primitive)
    and the outer loop computes metrics for assessing the quality of the model
    without risk of overfitting bias.

    After the nested loop, the whole procedure is used with the full dataset to return
    a single model trained on all available data.

    Args:
        X (array-like of shape (n_samples, n_features) ): Features
        y (array-like of shape (n_samples,) ): Targets to predict.
            It has to be discrete (for classification), and both binary and multiclass
            targets are supported.
        model (estimator object): estimator object implementing `fit` and `predict`
            The object to use to fit the data. It can be a complex object like a pipeline
            or another composite object with hyperparameters to tune.
        search_space (list of tuple): Search space dimensions provided as a list.
            Each search dimension should be defined as an instance of
            a `skopt` `Dimension` object, that is, `Real`, `Integer` or
            `Categorical`.
            In addition to the constraints and the prior, when applicable,
            the `Dimension` object should have as name the name of the param
            in the model, using the double underscore convention for nested objects.
            See [examples](examples.md) for several examples for how to provide the search space.
        optimizing_metric (str or callable): Strategy to evaluate
            the performance of the cross-validated model on
            each inner test set, to find the best hyperparameters. It should
            follow the `sklearn` convention of *greater is better*.
            One can use:

            - a single string
            - a callable
        k_outer (int): Number of folds for the outer cross-validation.
        skip_outer_folds (list): If set, list of folds to skip during the loop,
            to reduce computational cost.
        k_inner (int): Number of folds for the inner cross-validation.
        skip_inner_folds (list): If set, list of folds to skip during the loop,
            to reduce computational cost.
        n_initial_points (int): Number of initial points to use in Skopt Optimization.
        n_calls (int): Number of additional calls to use in Skopt Optimization.
        calibrate (str): Whether to calibrate the output probabilities.
            Options:

            - "no" if no calibration at all.
            - "only_best" if only the best model on the inner loop should be calibrated
            - "all" if all inner models should be calibrated (maybe more accurate results,
               but much more time-consuming)
        calibrate_params (dict): Dictionary of params for the CalibratedClassifierCV
        other_metrics (dict): If not empty, in the report output every metric specified in this parameter
            will be computed, showing the results over the inner folds (during tuning)
            and over the outer folds (during performance evaluation).
            The parameter should be provided as a dictionary with metric names as keys
            and callables or str a values. See [examples](examples.md) for examples.
        skopt_func (callable): Minimization function of the skopt library to be used.
            Available options are:

            - gp_minimize: performs bayesian optimization using Gaussian Processes.
            - dummy_minimize: performs a random search.
            - forest_minimize: performs sequential optimization using decision trees.
            - gbrt_minimize: performs sequential optimization using gradient boosting trees.
        verbose (bool): Whether to trace progress or not.

    Returns:
        model (estimator): Model trained with the full dataset using the same procedure
            as in the inner cross validation.
        params (dict): Dictionary of (hyper)parameters of the best model.
        loop_info (dataclass) : Dataclass with information about the optimization process.
            The opt_info object has a method `to_dataframe()` that converts the
            information into a dataframe, for easier exploration of results.
    """
    X, y = check_X_y(X, y,
                     accept_sparse=['csc', 'csr', 'coo'],
                     force_all_finite=False, allow_nd=True)

    if skip_inner_folds is None:
        skip_inner_folds = []
    if skip_outer_folds is None:
        skip_outer_folds = []
    if other_metrics is None:
        other_metrics = []

    if calibrate_params is None:
        calibrate_params = dict()

    outer_cv = StratifiedKFold(n_splits=k_outer)
    loop_info = Report()
    for k, (train_index, test_index) in enumerate(outer_cv.split(X, y)):
        print(f"Looping over {k} outer fold")
        if k not in skip_outer_folds:
            _, _, inner_loop_info = train_model(
                X_outer_train=X[train_index], y_outer_train=y[train_index],
                model=model, search_space=search_space,
                X_outer_test=X[test_index], y_outer_test=y[test_index],
                k_inner=k_inner, skip_inner_folds=skip_inner_folds, outer_kfold=k, outer_test_indexes=test_index,
                n_initial_points=n_initial_points, n_calls=n_calls,
                calibrate=calibrate, calibrate_params=calibrate_params, optimizing_metric=optimizing_metric,
                other_metrics=other_metrics, verbose=verbose, skopt_func=skopt_func)
            loop_info._extend(inner_loop_info)

    # After assessing the procedure, we repeat it on the full dataset:
    final_model, final_params, _ = train_model(
        X_outer_train=X, y_outer_train=y,
        model=model, search_space=search_space,
        X_outer_test=[], y_outer_test=[],
        k_inner=k_inner, skip_inner_folds=skip_inner_folds, outer_kfold=None, outer_test_indexes=None,
        n_initial_points=n_initial_points, n_calls=n_calls,
        calibrate=calibrate, calibrate_params=calibrate_params, optimizing_metric=optimizing_metric, other_metrics={},
        verbose=verbose, skopt_func=skopt_func)
    return final_model, final_params, loop_info

separate_signature: True