Manipulating a dataset
In this guide you'll learn how to manipulate and transform your dataset to be machine-learning enabled..
The Machine Learning Module allows you to easily sample, split, filter, and transform your dataset to be
CONFIGURE DATASET menu, you can find the options to manipulate your
dataset, as shown below.
For most Machine Learning tasks, it is essential to evaluate your model to get an estimate of its performance.
To do so, you need two different data sets. Use the bigger subset (called training data) to build your model, and later test the performance of your model against the smaller sub-set (called test data).
It is important to note that the test data is data that the algorithm never saw when building the model. By doing this, you will able to measure the real performance of your model when a new case appears.
In the machine learning module, you can split your dataset using 1-click option and custom configurations.
This option divides your dataset into two subsets, 80% of your data to train the model and the 20% left to test
it. The module provides two different splitting options: a random and a linear option. If you are training a
classification or regression model, you usually use the random split which randomly takes instances for each
If you are training a time series model, you need to use the linear split which assumes that the instances are chronologically ordered in the dataset and takes the first 80% for training and the last 20% for testing.
In the dataset view, go to 1-Click options and select the most suitable option
for your use case -
1-CLICK RANDOM TRAINING | TEST or
1-CLICK LINEAR TRAINING | TEST (Time Series).
After the request is processed, both subsets are automatically created and displayed in your Dashboard.
You can see the two separate subsets in the dataset list view.
In the Dataset view, click on the configure option menu and select
TRAINING | TEST SPLIT.
A new screen will appear, with the following options.
- Percentage slider: Used for split your data in training and test. Default is 80%/20%.
- Seed: Input for a string. The string is used to generate deterministic samples and get
If you use the same seed for a given dataset, each time you make the training/test split the training and test subsets will contain the same instances.
Otherwise, the instances for each subset will be randomly selected and you will get different training and test sets each time you make a split for a given dataset.
- Linear split: If selected, the subsets will be created taking into account the order of the
instances in your dataset (the first subset of instances for training and the last subset for testing).
This option needs to be activated in case you want to train and test a time series model since the instances are chronologically distributed.
- Datasets name: The name of your training and test datasets.
If you have very large datasets, sampling may be a good way of getting results and iterating faster.
On the configure option menu, select the
After that, you can configure the sampling rate moving the slider and you can also name your sample dataset.
If you prefer to sample differently your dataset, you can configure advanced options in the configuration
panel for advanced sampling.
The following options will appear.
- Range: Specify a subset of instances, when the instances are ordered, from which to sample.
For example, choose a range from instances 100 to 200. The specified rate will be applied over the subset configured.
This option may be useful when you have temporal data, and you want to train your model with historical data and test it with the most recent one to check if it can predict based on time.
- Sampling: By default, the module selects your instances for the sample by using a random
number generator, which means two samples from the same dataset will likely be different even when using the
same rates and row ranges, except when the rate is 100% and do not use repetition.
If you choose deterministic sampling, the random-number generator will always use the same seed, thus producing repeatable results. This lets you work with identical samples from the same dataset.
- Replacement: If selected, it'll allows to a single instance to be selected multiple times. By default, the module generates samples without replacement.
- Out-of-bag: Create a sample containing only out-of-bag instances for the currently defined
rate, the final total number of instances for your sample will be one minus the rate configured for your
sample (when replacement is false).
This can be useful for splitting a dataset into training and testing subsets. It is only selectable when a sample rate is less than 100%.
On the configure option menu, select the
A screen will open showing the configuration panel for filtering. You need to select a field, a
pre-defined operation and a value to create a filter. When everything is done, click on
CREATE DATASET to create a filtered dataset.