Transforming your data
In this guide we'll help you how to add fields, aggregating instances, joining, merging and ordering datasets.
If you need to create new fields (i.e., feature engineering), the ML module allows you to do it using common operations over your existing data, or writing custom operations with Flatline formulas.
To start, access the configuration option menu and select ADD FIELDS.
This leads you to a configuration panel for adding fields, where you can add a name for the new fields, decide which operation you wish to apply, and select the field you will use to generate the new one.
To know in details about all the add fields possibilities as discretization, replacing missing values,normalizing and many other, please check the subsections 8.1 of this documentation, it will help you define each of the operations you can apply to an existing field to create a new one.
The aggregating instances option allows you to group the rows of a dataset by a given field.
The example above can be easily executed in the machine learning module by following these steps:
Find the AGGREGATE INSTANCES option in the dataset configuration menu.
When the configuration panel has been displayed, select a field to aggregate your instances. You can select any type of field (numeric, categorical, text or datetime fields) and your instances will be grouped by the unique values of this field. In this case, we select “CustomerID” because we want a dataset with one row per customer.
The next steps can be found in the Subsection 8.2 of this documentation.
It is very common to have the data scattered in two or more different datasets. Our module allows you to join several datasets to combine their fields and instances based on one or more related fields.
First of all, you need to find both sources in the machine learning module and create datasets from each source. When the datasets are created, find the JOIN DATASETS option in the dataset configuration menu.
This option will display the join configuration panel in which you need to input parameters.
Please, follow the next steps in the subsection 8.3 of this documentation to know more about the configurations of the parameters
In case you have instances in different datasets and you want to merge them all into one single dataset, you can do it using the merging datasets option.
From one of the datasets, open the CONFIGURE DATASET menu. By convention, this first dataset
defines the final dataset fields. All datasets should have the same field names and IDs.
If this first dataset has fields not found in the other datasets, the merge will give an error. However, if the other datasets have some fields that are not found in the first dataset, you can still excute the merge and these fields will be dropped from the final dataset.
When the datasets are created, find the JOIN DATASETS option in the dataset configuration menu.
Select the datasets you want to merge.
Check the nexts steps here in the subsection 8.4 to continue your merging.
The ordering instances option allows you to sort the rows of a dataset by one or more selected fields in ascending or descending order. The instances will be sorted first by the first selected field, then by the second field, and so on. You can select up to 8 different sorting fields.This option is very useful for time series, when you have a dataset containing a date field and you need to sort your instances chronologically. Please, check the following steps:
From the dataset view, click on the ORDER INSTANCES menu option;
You cannot select the full date-time field to sort instances, but you can select the expanded fields (year, month, day of month, etc.) to do so. Remember when you select multiple fields to sort your instances, the first field is the one that decides the final order first, then the second field (keeping the order of the first field) and so on. That’s why we need to select first the larger date unit, in this case the year, and then the next date unit, the month in this case.
A new dataset will be created with the sorted instances. You can see the confirmation message on top of the dataset view in blue color.