Automating predictions with OptiML

In this guide you'll learn what is OptiML and how you can use it to automatize the creation of models


What is OptiML?

OptiML is an automated optimization process for model selection and parametrization (or hyper-parametrization) to solve classification and regression problems. Selecting the right algorithm and its optimal parameter values is currently a manual and time-consuming process which requires a high level of expertise. This is because different supervised learning algorithms can be used to solve similar problems and it is very difficult to know in advance which one is the most suitable for your problem until you try them all.

OptiML uses Bayesian parameter optimization for model selection and parameter tuning. It sequentially tries groups of parameters training and evaluating models using them, and based on the results, it tries a new group of parameters. The model search is guided by an optimization metric that can be configured by the user. When the process finishes, a list of the top performing models is returned in an interface to compare them and select the one that best fits your needs.


OptiML with custom configurations - Basic

On the dataset view, go to configuration menu and select the OptiML option.

Configure menu with OptiML option highlighted

The OptiML configuration screen will open, with the following options.

  • Objective Field: a numeric or categorical field that you want to predict.
    OptiML configuration menu with Objective Field highlighted
  • Max. Training Time: The maximum time to limit the OptiML runtime. Since at least four models are guaranteed, the maximum training time can be overrun significantly in cases where big datasets are coupled with low times. On the contrary, if all the model candidates are trained faster than the maximum time set, the OptiML will finish earlier than the maximum time set.
    The main challenge is to set a high enough time to allow the creation of the most model candidates. This optimum time depends on your dataset characteristics. For example, a higher number of rows and fields or the type of fields can negatively influence the speed of the OptiML creation.
    OptiML configuration menu with Max. Training Time highlighted
  • Number of model candidates: The maximum number of different models using a unique configuration to be trained and evaluated during the OptiML process. The default is 128 which is usually enough to find the best model, but you can set it from 4 up to 200.
    The top-performing half of the model candidates will be returned in the final result. Therefore, if you select to evaluate 128 models, the top performing 64 models will be returned unless the maximum training time halts the process before.
    The OptiML will usually create a higher number of models than the candidates requested due to cross-validation. These extra models have the same parametrization as the candidates but they use different dataset samples.
    OptiML configuration menu with Number of model candidates highlighted
  • OptiML name: the name of your OptiML request.

OptiML with custom configurations - Advanced

If you want to configure which models are optimized and how they are optimized, you can do this at the Advanced configuration dropdown.


Models to Optimize

Select the types of models that you want to optimize. The options are the following.

  • Decision Trees
  • Ensembles (up to 256 trees)
  • Logistic Regressions (only classifications)
  • Deepnets

By default all types of models are optimized, except in the case of logistic regression, which cannot be trained for regression problems. At least one model per type requested is guaranteed to be evaluated during the optimization process.
However, it may happen that one or more of the selected algorithms are not located among the top half performing models, in which case they will not be included in the final list of models.

OptiML advanced configuration menu with Models to Optimize highlighted

Evaluation

By default, the module performs Monte Carlo cross-validation during the optimization process to evaluate the models and select the top performing ones. Cross-validation evaluations usually yield more accurate results than single evaluations since they avoid the potential error derived from randomly selecting an overly optimistic dataset. Usually smaller datasets, which are presumably less representative of the whole population, have a higher variation in their performance depending on the random split. OptiML advanced configuration menu with Evaluations highlighted



Optimization Metric and Positive Class

The optimization metric is the one used for model selection during the optimization process. For classification problems, maximum phi coefficient is used by default, and for regression problems, the default is the R squared. You can also use the following metrics below.

Classification Metrics
Classification Metric Description
Accuracy Percentage of correctly classified instances over the total instances evaluated. It can take values between 0% and 100%.
Precision The percentage of correctly predicted instances over the total instances predicted for the positive class. It can take values between 0% and 100%.
Recall The percentage of correctly classified instances over the total actual instances for the positive class. It can take values between 0% and 100%.
F-measure The balanced harmonic mean between precision and recall. It can take values between 0 and 1.
Phi Coefficient The correlation coefficient between the predicted and actual values. It can take values between -1 and 1.
Max. phi coefficient The maximum phi coefficient taking into account all the possible probability thresholds of the evaluation. It can take values between -1 and 1.
ROC AUC The Area Under the ROC curve measures the trade-off between sensitivity and specificity. It can take values between 0 and 1.
Precision-Recall AUC The Area Under the Precision-Recall Curve measures the trade-off between precision and recall. It can take values between 0 and 1.
K-S Statistic Measures the maximum difference between the True Positive Rate (TPR) and the False Positve Rate (FPR) over all possible thresholds. It can take values between 0% and 100%.
Kendall’s Tau Based on all possible pair of rankings considering each instance in the testing dataset. It measures the degree of correlation between the ranked instances. It can take values between -1 and 1.
Spearman’s Rho The degree of correlation between the model predictions and the actual values of the testing dataset. It can take values between -1 and 1.

Regression Metrics
Regression Metric Description
R Squared Measures how much better the model is than always predicting the mean value of the target variable. It can take values from -1 to 1.

Choosing the right metric to be optimized by the OptiML depends on your data characteristics. This decision is mostly related to the need of setting a threshold.
When evaluating or predicting with your model you can set a probability (or confidence) threshold for a selected class, known as the positive class. Therefore, the model only predicts the positive class if the probability (or the confidence) is greater than the threshold set, otherwise it predicts the negative class.
According to this decision of whether to set a threshold or not, you can follow the rules below:

  • If you need to set a threshold, use the maximum phi, the K-S statistic, the ROC AUC, the precision-recall AUC,the Kendall’s Tau or the Spearman’s Rho.
  • If you need to set a threshold and your problem is a ranking problem where you care more of your instances with higher probability (for example in the case of customer churn where you have a limited budget so you can only target the top n customers most likely to churn), then the decision is reduced to the maximum phi or the K-S statistic.
  • If you do not need to set a threshold and your data is balanced (i.e., all classes have similar instances), then you can use the accuracy, the recall or the precision.
  • If you do not need to set a threshold and your data is unbalanced (i.e., some classes have significantly more instances than others), then you can use the phi coefficient or the f-measure.

Apart from the optimization metric, for classification problems you can also select a positive class to be optimized; otherwise, the average metric for all classes will be optimized.

OptiML advanced configuration menu with optimization metric and positive class options highlighted

Weights

Maybe some instances in your dataset are more important than others. So you can assign specific weights to instances in two ways: using Objective weights and Weight Field.

Objective Weights

In classification problems, is not unusual for a dataset to have an unbalanced objective field, where some categories are common and others very rare. In some cases, such as fraud prediction, you may be more interested in predicting rare values rather than successfully predicting frequent ones. In this case, you may want to assign moore weight to the scarce instances so they are equivalent to the abundant ones.

The module over samples your weighted instances replicating them as many times as the weight establishes.
For example, the majority class, FALSE, has a weight of 1, while the minority class, TRUE, has a weight of 3.

OptiML advanced configuration menu with weight options highlighted

Weight Field

The Weight field option allows you to assign individual weights to each instance by choosing a special weight field. It can be used for both regression and classification models. The selected field must be numeric and it must not contain any negative or missing values. The weight field will be excluded from the input fields when building the models. You can select an existing field in your dataset or you may create a new one in order to assign customized weights.

However, weight fields works differently for deepnets. For deepnets, the weight field modifies the loss function to include the instance weight. The outcome is similar to the oversampling technique.

OptiML advanced configuration menu with weight options highlighted

Missing Values

When training the OptiML, the module may encounter missing values in your dataset, which can be either considered or ignored.
The way to include or exclude missing values from your models depends on the model type:

  • For decision trees and ensembles, you can configure the Missing splits parameter.
  • For logistic regressions and deepnets you can configure the Default numeric value and the Missing numerals parameters.
Missing Splits

By default, decision trees and ensembles ignore instances that contain missing values (numeric or categorical) for a given field. If the missing values in your data may be important to predict the objective field, then you can include them using the Missing splits option.
If missing values are included, you may find rules with predicates of the following kind: field x = "is missing" or field x = "y or is missing".

OptiML advanced configuration menu with missing splits option highlighted

Default Numeric Value and Missing Numerals

By default, logistic regression and deepnets always include missing values (categorical and numeric), but you can choose to exclude the missing numeric values. If the missing numeric values in your data are not important to predict the objective field, then you can exclude or replace them using the Default numeric value or the Missing numerals parameters.

If the missing numeric values are not important for your objective field and you do not mind removing instances to train the models, then you may opt to disable the Missing numerics option.
If you do not want to include the missing values but you care about removing the instances with missing numerals, you can use the Default numeric value. By using this option, you can replace the missing numeric values with the mean, median, zero, maximum or minimum.

OptiML advanced configuration menu with missing numerics and default numerics options highlighted


Sampling

You can specify a subset of the instances of your dataset to create the OptiML. The rate option allows you to specify the percentage of instances you want to train your model.

OptiML advanced configuration menu with missing numerics and default numerics options highlighted

Notifications

You can enable the notification before creating an OptiML. If you click in the bell, you'll receive an e-mail notifying you that the OptiML has been created.

OptiML advanced configuration menu with notification bell highlighted