Clustering

Clustering is a fundamental task in data analysis and data mining. It involves grouping objects or observations into non-overlapping clusters.

In the field of data mining, clustering is used for customer and market segmentation, medical diagnostics, social and demographic studies, assessing creditworthiness of borrowers, and many other areas.

To illustrate the task, the "Fisher's Iris" dataset is commonly used. This dataset was used by R. Fisher to demonstrate the functionality of his developed discriminant analysis method.

Launch demo

Download example

Algorithm Description

1. Data Import

Import of «Fisher's Iris» dataset

The set consists of data on 150 iris specimens. Four characteristics were measured for each (in centimeters).

Name Label
sepal_length Sepal Length
sepal_width Sepal Width
petal_length Petal Length
petal_width Petal Width

2.1 EM Clustering

The basis of EM clustering is the assumption that any observation belongs to all clusters, but with varying probabilities. An object should be assigned to the cluster for which this probability is higher.

a) EM Clustering Settings

The following settings are established for the EM Clustering node:

  • For fields sepal_length, sepal_width, petal_length, petal_width - assignment is Used
  • In the parameter Given number of clusters, the value is set to 3
  • Other settings are default

If the settings are changed, retrain the model.

Interpretation of Results

b) Output Set Cluster Division

In the output set, two new columns appear, which are added to the original set:

  • Cluster Number
  • Membership Probability
Cluster Number Membership Probability Sepal Length Sepal Width Petal Length Petal Width Class
1 1.00 5.10 3.50 1.40 0.20 Iris-setosa
... ... ... ... ... ... ...
2 1.00 7.0 3.20 4.70 1.40 Iris-versicolor
... ... ... ... ... ... ...
0 1.00 5.90 3.0 5.10 1.80 Iris-virginica
c) Visualization of Results

The results of EM Clustering can be viewed in the Cluster Profiles visualizer:

Cluster Profiles
Picture 1. Cluster Profiles

In the Cluster Profiles visualizer, it is possible to view statistical indicators that can be used to compare clusters with each other:

Cluster Profiles (comparison)
Picture 2. Cluster Profiles (comparison)

The algorithm identified 3 clusters, which coincide with the number of original classes and are approximately equal, which indicates the good performance of the EM clustering algorithm.

2.2 K-means Clustering

K-means clustering is used in the case when the number of clusters is known.

a) K-means Clustering Settings

The following settings are established for the K-means Clustering node:

  • For fields sepal_length, sepal_width, petal_length, petal_width - assignment is Used
  • In the parameter Given number of clusters, the value is set to 3
  • Other settings are default

If the settings are changed, retrain the model.

Interpretation of Results

b) Output Set Cluster Division

In the output set, two new columns appear, which are added to the original set:

  • Cluster Number
  • Distance to Cluster Center
Cluster Number Distance to Cluster Center Sepal Length Sepal Width Petal Length Petal Width Class
2 0.23 5.10 3.50 1.40 0.20 Iris-setosa
... ... ... ... ... ... ...
0 0.95 7.0 3.20 4.70 1.40 Iris-versicolor
... ... ... ... ... ... ...
0 1.06 5.90 3.0 5.10 1.80 Iris-virginica
c) Visualization of Results

The results of K-means Clustering can be viewed in the Cluster Profiles visualizer:

Cluster Profiles
Picture 3. Cluster Profiles

In the Cluster Profiles visualizer, it's possible to view statistical indicators by which clusters can be compared with each other:

Cluster Profiles (comparison)
Picture 4. Cluster Profiles (comparison)

The algorithm identified 3 clusters, which correspond to the number of classes in the input dataset. However, each cluster contains a significantly different number of objects. Thus, k-means clustering is less accurate than EM.

2.3 G-means Clustering

G-means clustering is used when the initial number of clusters is unknown. The algorithm automatically determines the appropriate number of clusters.

a) G-means Clustering Settings

The following settings are established for the G-means Clustering node:

  • For fields sepal_length, sepal_width, petal_length, petal_width - assignment is Used
  • The flag "Automatic determination of the number of clusters" is set.
  • Other settings are default

If the settings are changed, retrain the model.

Interpretation of Results

b) Output Set Cluster Division_**

Two new columns have been added to the original dataset in the output:

  • Cluster number
  • Distance to the cluster center
Cluster Number Distance to Cluster Center Sepal Length Sepal Width Petal Length Petal Width Class
0 0.23 5.10 3.50 1.40 0.20 Iris-setosa
... ... ... ... ... ... ...
1 1.23 7.0 3.20 4.70 1.40 Iris-versicolor
... ... ... ... ... ... ...
1 0.56 5.9 3.0 5.10 1.80 Iris-virginica
c) Visualization of Results

The results of G-means Clustering can be viewed in the Cluster Profiles visualizer:

Cluster Profiles
Picture 5. Cluster Profiles

In the Cluster Profiles visualizer, it's possible to view statistical indicators by which clusters can be compared with each other:

Cluster Profiles (comparison)
Picture 6. Cluster Profiles (comparison)

The algorithm identified 2 clusters, which, firstly, do not match the number of classes in the original dataset, and secondly, resulted in an uneven distribution. Thus, g-means clustering has been found to be the least accurate, and its results can be considered unsatisfactory.


Download and open the file in Megaladata. If necessary, you can install the free Megaladata Community Edition.

Download example

results matching ""

    No results matching ""