Creating and configuring a dataset

In this guide you'll learn what is a dataset and how to create and configure a dataset on the Data Intelligence Module.


Introduction

A dataset is a structured version of your data, which you can transform to use on machine learning models. On the datasets section, you can see all the data sets created.

Screenshot of Dataset section

The datasets section computes both general statistics for the dataset and individual statistics per field. Each dataset has some statistics, described below.

  • Count: shows the number of instances containing data for this field.
  • Missing: the number of instances missing value for this field.
  • Errors: information about ill-formatted fields that includes the total format errors for the field and a sample of the ill-formatted tokens.
  • The histograms communicate the underlying distributions of your data. Depending on the size of your dataset and the number of unique values, these histograms may either be exact or maybe approximations.

Creating a dataset with 1-click

First, access the Machine Learning tool on the Data Intelligence Module.

Then, on the sources view on Sources section, click on the 1-click action menu and select the 1-CLICK DATASET option.

Screenshot of Sources section in source view, with 1-click dataset highlighted

You can also create a dataset using 1-click on the pop-up menu in the source list view.

Screenshot of Sources section in source list view, with 1-click dataset highlighted

Creating a dataset with custom configurations

In the Sources section, on source view mode, click on CONFIGURE DATASET. A new screen will appear.
Change the default name by typing a new name in the dataset name box.

Screenshot of Configure Dataset screen, with configure dataset button and dataset name box highlighted

By default, all the data you have in your source will be used on the dataset. However, if you don't want to use all of your data, you can move the dataset size slider.

Screenshot of Configure Dataset screen, with dataset size slider highlighted

You can also exclude fields when creating a dataset. The following image shows the include all/exclude all fields options.

Screenshot of Configure Dataset screen, with include all and exclude all options highlighted

You can also select or deselect fields by going manually on each one.

Screenshot of Configure Dataset screen, with select field option highlighted

Everything set, click on the Create dataset button.


Updating fields

You can change field names, labels/tags and descriptions by clicking the edit icon next to the field name.

You can also set a field as the objective field for models, ensembles, and logistic regressions. These models use this field for their predictions, making this a key field.

Screenshot of Update Field screen

You can also set a field as non-preferred, which will not be taken into account to generate your models.