Creating and configuring a dataset
In this guide you'll learn what is a dataset, how to create and configure a dataset on the Machine Learning Module.
A dataset is a structured version of your data, which you can transform to use on machine learning models. On the datasets section, you can see all the data sets created.
The datasets section computes both general statistics for the dataset and individual statistics per field. Each dataset has some statistics, described below.
- Count: shows the number of instances containing data for this field.
- Missing: the number of instances missing value for this field.
- Errors: information about ill-formatted fields that includes the total format errors for the field and a sample of the ill-formatted tokens.
- The histograms communicate the underlying distributions of your data. Depending on the size of your dataset and the number of unique values, these histograms may either be exact or maybe approximations.
In the sources view on Sources section, click on the 1-click action
menu and select the
1-CLICK DATASET option.
You can also create a dataset using 1-click on the pop-up menu in the source list view.
In the Sources section, on source view mode, click on
DATASET. A new screen will appear.
Change the default name by typing a new name in the dataset name box.
By default, all the data you have in your source will be used on the dataset. However, if you don't want to use all of your data, you can move the dataset size slider.
You can also exclude fields when creating a dataset. The following image shows the include all/exclude all fields options.
Exclude all fields
This option is useful when your source has lots of fields but you only want to include a few of them. Keep in mind that creating a dataset with no fields does not work.
You can also select or deselect fields by going manually on each one.
Everything set, click on the
Create dataset button.
You can change field names, labels/tags and descriptions by clicking the edit icon next to the field name.
You can also set a field as the objective field for models, ensembles, and logistic regressions. These models use this field for their predictions, making this a key field.
You can also set a field as non-preferred, which will not be taken into account to generate your models.