Manipulating a dataset

In this guide you'll learn how to manipulate and transform your dataset to be machine-learning enabled..


Introduction

The Data Intelligence Module allows you to easily sample, split, filter, and transform your dataset to be machine-learning ready.
In the CONFIGURE DATASET menu, you can find the options to manipulate your dataset, as shown below.

Options for manipulating data, with Sampling and Filtering options highlighted

Splitting Datasets

For most Machine Learning tasks, it is essential to evaluate your model to get an estimate of its performance.
To do so, you need two different data sets. Use the bigger subset (called training data) to build your model, and later test the performance of your model against the smaller sub-set (called test data).

It is important to note that the test data is data that the algorithm never saw when building the model. By doing this, you will be able to measure the real performance of your model when a new case appears.

In the Data Intelligence module, you can split your dataset using 1-click option and custom configurations.


Splitting Dataset with 1-Click

This option divides your dataset into two subsets, 80% of your data to train the model and the 20% left to test it. The module provides two different splitting options: a random and a linear option. If you are training a classification or regression model, you usually use the random split which randomly takes instances for each subset.
If you are training a time series model, you need to use the linear split which assumes that the instances are chronologically ordered in the dataset and takes the first 80% for training and the last 20% for testing.

In the dataset view, go to 1-Click options and select the most suitable option for your use case - 1-CLICK RANDOM TRAINING | TEST or 1-CLICK LINEAR TRAINING | TEST (Time Series).

Options for 1-click splitting highlighted

After the request is processed, both subsets are automatically created and displayed in your Dashboard.
You can see the two separate subsets in the dataset list view.

Splitted datasets

Splitting Dataset with custom configurations

In the Dataset view, click on the configure option menu and select TRAINING | TEST SPLIT.

Splitted datasets

A new screen will appear, with the following options.

  • Percentage slider: Used for split your data in training and test. Default is 80%/20%.
  • Seed: Input for a string. The string is used to generate deterministic samples and get repeatable results.
    If you use the same seed for a given dataset, each time you make the training/test split the training and test subsets will contain the same instances.
    Otherwise, the instances for each subset will be randomly selected and you will get different training and test sets each time you make a split for a given dataset.
  • Linear split: If selected, the subsets will be created taking into account the order of the instances in your dataset (the first subset of instances for training and the last subset for testing).
    This option needs to be activated in case you want to train and test a time series model since the instances are chronologically distributed.
  • Datasets name: The name of your training and test datasets.
Split Dataset configuration screen

Sampling Datasets - Basic options

If you have very large datasets, sampling may be a good way of getting results and iterating faster.

On the configure option menu, select the SAMPLE option.

Sample dataset option highlighted

After that, you can configure the sampling rate moving the slider and you can also name your sample dataset.

Basic sampling menu

Sampling Datasets - Advanced options

If you prefer to sample differently your dataset, you can configure advanced options in the configuration panel for advanced sampling.
The following options will appear.

  • Range: Specify a subset of instances, when the instances are ordered, from which to sample.
    For example, choose a range from instances 100 to 200. The specified rate will be applied over the subset configured.
    This option may be useful when you have temporal data, and you want to train your model with historical data and test it with the most recent one to check if it can predict based on time.
  • Sampling: By default, the module selects your instances for the sample by using a random number generator, which means two samples from the same dataset will likely be different even when using the same rates and row ranges, except when the rate is 100% and do not use repetition.
    If you choose deterministic sampling, the random-number generator will always use the same seed, thus producing repeatable results. This lets you work with identical samples from the same dataset.
  • Replacement: If selected, it'll allows to a single instance to be selected multiple times. By default, the module generates samples without replacement.
  • Out-of-bag: Create a sample containing only out-of-bag instances for the currently defined rate, the final total number of instances for your sample will be one minus the rate configured for your sample (when replacement is false).
    This can be useful for splitting a dataset into training and testing subsets. It is only selectable when a sample rate is less than 100%.
Advanced sampling menu

Filtering Datasets

On the configure option menu, select the FILTER option.

Configuration menu with Filter option highlighted

A screen will open showing the configuration panel for filtering. You need to select a field, a pre-defined operation and a value to create a filter. When everything is done, click on CREATE DATASET to create a filtered dataset.

Filter Configuration menu