Data Sampling

Some data mining algorithms become exponentially more complex as data volume grows, requiring excessive time and resources. To address this, we use a representative data sample to build the model. Megaladata's Sampling component offers various selection methods for you to choose from.

This demo shows how to use sampling to increase the speed of neural network training by 20 times while maintaining the quality of the model.

Launch demo

Download example

Algorithm Description

Initial data

a) Import data

The Data Import supernode has loaded a dataset containing over 900,000 records of user visit information.

The Field Features node has been used to modify the data types of the PageViews, Bounces, and NewVisits fields. The primary goal is to classify users into channel groups (ChannelGrouping). To analyze the source data, the pre-configured Statistics Visualizer can be utilized. The Bounces and NewVisits fields contain both true and null values. Given their logical nature, it's likely that rows with missing values should be interpreted as false.

Data Preparation

a) Preparing data for ML

In the Null Replacement node (utilising the Calculator component), we use the NVL function to replace empty values ​​with false:

NVL(new_visits, false)
NVL(bounces, false)

Clicking the 'Replace field' button will replace the old new_visits and bounces fields with the ones we've created using the NVL function. From now on, only the new fields with null values replaced with false will be show in the dataset.

Replace field
Picture 1. Replace field

Machine Learning

We train neural network models on both the original and sampled datasets. Then, we compare the coefficients of statistical indicators used to interpret the training results.

a) Machine Learning

1) A correlation analysis is conducted to identify the most relevant fields for model training.

Correlation Analysis node settings:

  • Set 1 - All fields except ChannelGrouping
  • Set 2 - ChannelGrouping
  • The remaining settings are default

Results of processing:

 Field1  Field2  Pearson's coefficient
SocialEngagementType ChannelGrouping null
Date ChannelGrouping -0.22
PageViews ChannelGrouping -0.07
Visits_ID ChannelGrouping -0.04
Continent ChannelGrouping 0.04
Subcontinent ChannelGrouping 0.06
Country ChannelGrouping -0.04
Visits ChannelGrouping null
Hits ChannelGrouping -0.08
Source ChannelGrouping 0.87
Medium ChannelGrouping 0.10
Browser ChannelGrouping -0.04
Operating System ChannelGrouping 0.17
IsMobile ChannelGrouping -0.23
NewVisits (replace) ChannelGrouping 0.08
Bounces (replace) ChannelGrouping 0.06

The SocialEngagementType and Visits fields have values null​​, so they will not be used for further analysis.

2) Data sampling was performed using the Sampling node.

We set up the node as follows:

  • The Sampling Method is Sequence
  • Relative percentage is 20%
  • The remaining settings are default

After sampling, 180,711 records remained out of the original 903,553.

3) At the next stage, we set up two identical Neural Network (classification) nodes. The first node will receive the whole original dataset as input, while the second node will process the sample dataset.

The following settings are for the Neural Network (classification) nodes:

  • Field ChannelGrouping: Usage type set to Output
  • Other fields except SocialEngagementType, Visits, Visitid: Usage type set to Input
  • In Normalization Settings: Fields Date and PageViews: Normalizer set to Standardization, for the rest it is set to Indicator
  • Train set - 80%, Test set - 20%
  • The remaining settings are default

Results of running the two models:

Comparison of models
Picture 2. Comparison of models

The results of training the nodes are the same. However, the first node took about 3 hours to complete, while the second one only needed about 10 minutes. As the number of records in the training set increases, the training time of the model tends to increase dramatically. Thus, sampling is an efficient way to speed up the workflow.


Download and open the file in Megaladata. If necessary, you can install the free Megaladata Community Edition.

Download example

results matching ""

    No results matching ""