Automating predictions with OptiML
In this guide you'll learn what is OptiML and how you can use it to automatize the creation of models
OptiML is an automated optimization process for model selection and parametrization (or hyper-parametrization) to solve classification and regression problems. Selecting the right algorithm and its optimal parameter values is currently a manual and time-consuming process which requires a high level of expertise. This is because different supervised learning algorithms can be used to solve similar problems and it is very difficult to know in advance which one is the most suitable for your problem until you try them all.
OptiML uses Bayesian parameter optimization for model selection and parameter tuning. It sequentially tries groups of parameters training and evaluating models using them, and based on the results, it tries a new group of parameters. The model search is guided by an optimization metric that can be configured by the user. When the process finishes, a list of the top performing models is returned in an interface to compare them and select the one that best fits your needs.
On the dataset view, go to configuration menu and select the
The OptiML configuration screen will open, with the following options.
- Objective Field: a numeric or categorical field that you want to predict.
- Max. Training Time: The maximum time to limit the OptiML runtime. Since at least four
models are guaranteed, the maximum
training time can be overrun significantly in cases where big datasets are coupled with low times. On
the contrary, if all the model candidates are trained faster than the maximum time set,
the OptiML will finish earlier than the maximum time set.
The main challenge is to set a high enough time to allow the creation of the most model candidates. This optimum time depends on your dataset characteristics. For example, a higher number of rows and fields or the type of fields can negatively influence the speed of the OptiML creation.
- Number of model candidates: The maximum number of different models using a
unique configuration to be trained and
evaluated during the OptiML process. The default is 128 which is usually enough to find the best model,
but you can set it from 4 up to 200.
The top-performing half of the model candidates will be returned in the final result. Therefore, if you select to evaluate 128 models, the top performing 64 models will be returned unless the maximum training time halts the process before.
The OptiML will usually create a higher number of models than the candidates requested due to cross-validation. These extra models have the same parametrization as the candidates but they use different dataset samples.
- OptiML name: the name of your OptiML request.
If you want to configure which models are optimized and how they are optimized, you can do this at the Advanced configuration dropdown.
Select the types of models that you want to optimize. The options are the following.
- Decision Trees
- Ensembles (up to 256 trees)
- Logistic Regressions (only classifications)
By default all types of models are optimized, except in the case of logistic regression, which cannot be trained
for regression problems. At least one model per type requested is guaranteed to be evaluated during the
However, it may happen that one or more of the selected algorithms are not located among the top half performing models, in which case they will not be included in the final list of models.
By default, the module performs Monte Carlo cross-validation during the optimization process to evaluate the models
and select the top performing ones. Cross-validation evaluations usually yield more accurate results than single
evaluations since they avoid the potential error derived from randomly selecting an overly optimistic dataset.
Usually smaller datasets, which are presumably less representative of the whole population, have a higher
variation in their performance depending on the random split.
There are some problems that cannot be evaluated using a random split. For example, problems where you need to
train the model using past data and evaluate it with the most recent data like sales forecasting or stock market
For those cases, you can select a test dataset that OptiML will use during the optimization process to evaluate the models. To avoid unrealistically good evaluations due to the lack of cross-validation, the module takes several subsets of the training data to create the models and evaluates them using the testing dataset.
Deepnets is a exception
Deepnets have their own optimization logic hence the test dataset selection does not apply to them. Cross-validation is always perform to optimize deepnets.
The optimization metric is the one used for model selection during the optimization process. For classification problems, maximum phi coefficient is used by default, and for regression problems, the default is the R squared. You can also use the following metrics below.
Deepnets is a exception
Deepnets have their own optimization logic hence the optimization metric and positive
class configurations do not apply to them.
Deepnets are always optimized using a combination of several metrics for all the classes in the objective field in the case of classification problems.
|Accuracy||Percentage of correctly classified instances over the total instances evaluated. It can take values between 0% and 100%.|
|Precision||The percentage of correctly predicted instances over the total instances predicted for the positive class. It can take values between 0% and 100%.|
|Recall||The percentage of correctly classified instances over the total actual instances for the positive class. It can take values between 0% and 100%.|
|F-measure||The balanced harmonic mean between precision and recall. It can take values between 0 and 1.|
|Phi Coefficient||The correlation coefficient between the predicted and actual values. It can take values between -1 and 1.|
|Max. phi coefficient||The maximum phi coefficient taking into account all the possible probability thresholds of the evaluation. It can take values between -1 and 1.|
|ROC AUC||The Area Under the ROC curve measures the trade-off between sensitivity and specificity. It can take values between 0 and 1.|
|Precision-Recall AUC||The Area Under the Precision-Recall Curve measures the trade-off between precision and recall. It can take values between 0 and 1.|
|K-S Statistic||Measures the maximum difference between the True Positive Rate (TPR) and the False Positve Rate (FPR) over all possible thresholds. It can take values between 0% and 100%.|
|Kendall’s Tau||Based on all possible pair of rankings considering each instance in the testing dataset. It measures the degree of correlation between the ranked instances. It can take values between -1 and 1.|
|Spearman’s Rho||The degree of correlation between the model predictions and the actual values of the testing dataset. It can take values between -1 and 1.|
|R Squared||Measures how much better the model is than always predicting the mean value of the target variable. It can take values from -1 to 1.|
Choosing the right metric to be optimized by the OptiML depends on your data characteristics. This
decision is mostly related to the need of setting a threshold.
When evaluating or predicting with your model you can set a probability (or confidence) threshold for a selected class, known as the positive class. Therefore, the model only predicts the positive class if the probability (or the confidence) is greater than the threshold set, otherwise it predicts the negative class.
According to this decision of whether to set a threshold or not, you can follow the rules below:
- If you need to set a threshold, use the maximum phi, the K-S statistic, the ROC AUC, the precision-recall AUC,the Kendall’s Tau or the Spearman’s Rho.
- If you need to set a threshold and your problem is a ranking problem where you care more of your instances with higher probability (for example in the case of customer churn where you have a limited budget so you can only target the top n customers most likely to churn), then the decision is reduced to the maximum phi or the K-S statistic.
- If you do not need to set a threshold and your data is balanced (i.e., all classes have similar instances), then you can use the accuracy, the recall or the precision.
- If you do not need to set a threshold and your data is unbalanced (i.e., some classes have significantly more instances than others), then you can use the phi coefficient or the f-measure.
Apart from the optimization metric, for classification problems you can also select a positive class to be optimized; otherwise, the average metric for all classes will be optimized.
Maybe some instances in your dataset are more important than others. So you can assign specific weights to instances in two ways: using Objective weights and Weight Field.
In classification problems, is not unusual for a dataset to have an unbalanced objective field, where some categories are common and others very rare. In some cases, such as fraud prediction, you may be more interested in predicting rare values rather than successfully predicting frequent ones. In this case, you may want to assign moore weight to the scarce instances so they are equivalent to the abundant ones.
The module over samples your weighted instances replicating them as many times as the weight establishes.
For example, the majority class,
FALSE, has a weight of 1, while the minority
TRUE, has a weight of 3.
This option is valid only for classification models and you can combine it with the Weight field. When combining it with the Weight field, both weights are multiplied. For example if you assign a weight of 3 for the “True” class and the weight field assigns a weight of 2 for a given instance labeled as “True”, that instance will have a total weight of 6.
The Weight field option allows you to assign individual weights to each instance by choosing a special weight field. It can be used for both regression and classification models. The selected field must be numeric and it must not contain any negative or missing values. The weight field will be excluded from the input fields when building the models. You can select an existing field in your dataset or you may create a new one in order to assign customized weights.
However, weight fields works differently for deepnets. For deepnets, the weight field modifies the loss function to include the instance weight. The outcome is similar to the oversampling technique.
When training the OptiML, the module may encounter missing values in your dataset, which can be either
considered or ignored.
The way to include or exclude missing values from your models depends on the model type:
- For decision trees and ensembles, you can configure the Missing splits parameter.
- For logistic regressions and deepnets you can configure the Default numeric value and the Missing numerals parameters.
By default, decision trees and ensembles ignore instances that contain missing values (numeric
categorical) for a given field. If the missing values in your data may be important to predict the objective
field, then you can include them using the Missing splits option.
If missing values are included, you may find rules with predicates of the following kind:
field x = "is
field x = "y or is missing".
The module includes missing values following the MIA approach.
Default Numeric Value and Missing Numerals
By default, logistic regression and deepnets always include missing values (categorical and numeric), but you can choose to exclude the missing numeric values. If the missing numeric values in your data are not important to predict the objective field, then you can exclude or replace them using the Default numeric value or the Missing numerals parameters.
If the missing numeric values are not important for your objective field and you do not mind removing
instances to train the models, then you may opt to disable the Missing numerics option.
If you do not want to include the missing values but you care about removing the instances with missing numerals, you can use the Default numeric value. By using this option, you can replace the missing numeric values with the mean, median, zero, maximum or minimum.
The missing categorical values are always included for logistic regressions and deepnets.
You can specify a subset of the instances of your dataset to create the OptiML. The rate option allows you to specify the percentage of instances you want to train your model.
You can enable the notification before creating an OptiML. If you click in the bell, you'll receive an e-mail notifying you that the OptiML has been created.
To find out more about OptiML and it's capabilities, please refer to it's documentation.