Python in Megaladata

Megaladata offers extensive functionality for solving lots of tasks. However, some algorithms are not yet implemented in Megaladata as standard components. In such cases, calculations that are performed using code can be embedded into the workflow.

One of the supported programming languages in Megaladata is Python, which is often used for data analysis.

This example demonstrates how to implement the Decision Tree algorithm using Python, predicting the risk of ischemic heart disease over the next 10 years.

For the Python node to work, preliminary configuration of Megaladata and installation of Python may be required. The demo example uses the pandas, numpy, and sklearn libraries. Instructions for installing libraries.

Note: The demo example has a limitation in demonstrating all capabilities when running on the demo stand (Launch Demo). We recommend installing the example locally.

Launch demo

Download example

Algorithm Description

1. Data Import

Data Import

In the Data Import submodule, the source file is imported.

The dataset contains information about 4238 patients.

Fields used for prediction:

Name Description
Integer type male Gender male or female (1 — male, 0 — female)
Integer type age Patient's age
Integer type education Education (0 — 4)
Integer type current_smoker Is the patient currently a smoker? (1 — yes, 0 — no)
Integer type cigs_per_day Average number of cigarettes smoked per day
Integer type BPMeds Has the patient been taking blood pressure medication? (1 — yes, 0 — no)
Integer type prevalent_stroke Has the patient had a stroke before? (1 — yes, 0 — no)
Integer type diabets Does the patient have diabetes? (1 — yes, 0 — no)
Integer type tot_chol Total cholesterol level
Real type sys_bp Systolic blood pressure
Real type dia_bp Diastolic blood pressure
Real type BMI Body Mass Index
Integer type heart_rate Heart rate
Integer type glucose Glucose level

Predicted field:

  • TenYearCHD (10-year risk of ischemic heart disease IHD): 1 — yes, 0 — no.

2. Data Preprocessing

In the Preprocessing submodule, data is prepared for analysis. This results in it being brought into compliance with the requirements determined by the specifics of the task at hand.

In our case, the data does not require significant processing. For the correct operation of the Decision Tree algorithm, it is only necessary to exclude or fill in missing values.

There are few missing values in the presented dataset, so you can simply delete rows with empty field values.

Configuring the Missing Values Filling Node

Cleaning empty fields is carried out using the Missing Values Filling node. A processing method is set for each field: Delete records.

3. Decision Tree

Decision trees are one of the tools for data mining and predictive analytics. It allows solving classification and regression tasks.

Calculation

To access data ports and other built-in objects in the context of code execution, the following are provided:

  • Input data sets (InputTables, InputTable)
  • Input variables (InputVariables)
  • Output data set (OutputTable)
  • Necessary enumerations (DataType, DataKind, UsageType)

The above objects are imported from the built-in module "builtin_data". By default, an import line is added to the text of the code executed by the node.

Pass the prepared data to the Python node. Allow the formation of output columns from the code. Program code:

import builtin_data
from builtin_data import InputTable, InputTables, InputVariables, OutputTable, DataType, DataKind, UsageType
import numpy as np
import pandas as pd
from builtin_pandas_utils import to_data_frame, prepare_compatible_table, fill_table

The input port is optional:

if InputTable:
    #Create a pd.DataFrame from input set №1
    input_frame = to_data_frame(InputTable)

Split attributes and labels. Here X — all columns from the set, except:

"TenYearCHD". Y - "TenYearCHD"
X = input_frame.drop('TenYearCHD', axis=1)
Y = input_frame['TenYearCHD']

Split the data into training and testing sets:

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20)

Train the model on the training data:

from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train, Y_train)

Make predictions on the test data:

Y_pred = classifier.predict(X_test)

Merge attributes and labels:

X_test = X_test.reset_index()
Y_pred = pd.Series(Y_pred)
df = pd.DataFrame()
df = pd.concat([df, Y_pred], axis=1)
X_test = pd.concat([X_test, df], axis=1)
print(X_test)

Output the result to a table:

prepare_compatible_table(OutputTable, X_test, with_index=False)
fill_table(OutputTable, X_test, with_index=False)
Complete Code
import builtin_data
from builtin_data import InputTable, InputTables, InputVariables, OutputTable, DataType, DataKind, UsageType
import numpy as np
import pandas as pd
from builtin_pandas_utils import to_data_frame, prepare_compatible_table, fill_table

#Input port optional
if InputTable:
    # Create pd.DataFrame based on input set #1 {#create-pddataframe-based-on-input-set-1}
    input_frame = to_data_frame(InputTable)

#Separate attributes and labels. X is all columns from the set except "TenYearCHD". Y is "TenYearCHD"
X = input_frame.drop('TenYearCHD', axis=1)
Y = input_frame['TenYearCHD']

#Divide the data randomly into training and test sets. 
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20)

#Training the model on test data
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train, Y_train)

#Prediction on test data
Y_pred = classifier.predict(X_test)

#Merging attributes and labels.
X_test = X_test.reset_index()
Y_pred = pd.Series(Y_pred)
df = pd.DataFrame()
df = pd.concat([df, Y_pred], axis=1)
X_test = pd.concat([X_test, df], axis=1)
print(X_test)

#Outputting the result to a table
prepare_compatible_table(OutputTable, X_test, with_index=False)
fill_table(OutputTable, X_test, with_index=False)
Results Output

The prediction results are displayed in the last column of the output data set:

Resulting Set
Picture 1. Resulting Set

Download and open the file in Megaladata. If necessary, you can install the free Megaladata Community Edition.

Download Demo Example

results matching ""

    No results matching ""