Python in Megaladata

Megaladata offers extensive functionality for solving lots of tasks. However, some algorithms are not yet implemented in Megaladata as standard components. In such cases, calculations that are performed using code can be embedded into the workflow.

One of the supported programming languages in Megaladata is Python, which is often used for data analysis.

This example demonstrates how to implement the Decision Tree algorithm using Python, predicting the risk of ischemic heart disease over the next 10 years.

For the Python node to work, preliminary configuration of Megaladata and installation of Python may be required. The demo example uses the pandas, numpy, and sklearn libraries. Instructions for installing libraries.

Note: The demo example has a limitation in demonstrating all capabilities when running on the demo stand (Launch Demo). We recommend installing the example locally.

Launch demo

Download example

Algorithm Description

1. Data Import

Data Import

In the Data Import submodule, the source file is imported.

The dataset contains information about 4238 patients.

Fields used for prediction:

Name	Description
male	Gender male or female (1 — male, 0 — female)
age	Patient's age
education	Education (0 — 4)
current_smoker	Is the patient currently a smoker? (1 — yes, 0 — no)
cigs_per_day	Average number of cigarettes smoked per day
BPMeds	Has the patient been taking blood pressure medication? (1 — yes, 0 — no)
prevalent_stroke	Has the patient had a stroke before? (1 — yes, 0 — no)
diabets	Does the patient have diabetes? (1 — yes, 0 — no)
tot_chol	Total cholesterol level
sys_bp	Systolic blood pressure
dia_bp	Diastolic blood pressure
BMI	Body Mass Index
heart_rate	Heart rate
glucose	Glucose level

Predicted field:

TenYearCHD (10-year risk of ischemic heart disease IHD): 1 — yes, 0 — no.

2. Data Preprocessing

In the Preprocessing submodule, data is prepared for analysis. This results in it being brought into compliance with the requirements determined by the specifics of the task at hand.

In our case, the data does not require significant processing. For the correct operation of the Decision Tree algorithm, it is only necessary to exclude or fill in missing values.

There are few missing values in the presented dataset, so you can simply delete rows with empty field values.

Configuring the Missing Values Filling Node

Cleaning empty fields is carried out using the Missing Values Filling node. A processing method is set for each field: Delete records.

3. Decision Tree

Decision trees are one of the tools for data mining and predictive analytics. It allows solving classification and regression tasks.

Calculation

To access data ports and other built-in objects in the context of code execution, the following are provided:

Input data sets (InputTables, InputTable)
Input variables (InputVariables)
Output data set (OutputTable)
Necessary enumerations (DataType, DataKind, UsageType)

The above objects are imported from the built-in module "builtin_data". By default, an import line is added to the text of the code executed by the node.

Pass the prepared data to the Python node. Allow the formation of output columns from the code. Program code:

import builtin_data
from builtin_data import InputTable, InputTables, InputVariables, OutputTable, DataType, DataKind, UsageType
import numpy as np
import pandas as pd
from builtin_pandas_utils import to_data_frame, prepare_compatible_table, fill_table

The input port is optional:

if InputTable:
    #Create a pd.DataFrame from input set №1
    input_frame = to_data_frame(InputTable)

Split attributes and labels. Here X — all columns from the set, except:

"TenYearCHD". Y - "TenYearCHD"
X = input_frame.drop('TenYearCHD', axis=1)
Y = input_frame['TenYearCHD']

Split the data into training and testing sets:

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20)

Train the model on the training data:

from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train, Y_train)

Make predictions on the test data:

Y_pred = classifier.predict(X_test)

Merge attributes and labels:

X_test = X_test.reset_index()
Y_pred = pd.Series(Y_pred)
df = pd.DataFrame()
df = pd.concat([df, Y_pred], axis=1)
X_test = pd.concat([X_test, df], axis=1)
print(X_test)

Output the result to a table:

prepare_compatible_table(OutputTable, X_test, with_index=False)
fill_table(OutputTable, X_test, with_index=False)

Complete Code

import builtin_data
from builtin_data import InputTable, InputTables, InputVariables, OutputTable, DataType, DataKind, UsageType
import numpy as np
import pandas as pd
from builtin_pandas_utils import to_data_frame, prepare_compatible_table, fill_table

#Input port optional
if InputTable:
    # Create pd.DataFrame based on input set #1 {#create-pddataframe-based-on-input-set-1}
    input_frame = to_data_frame(InputTable)

#Separate attributes and labels. X is all columns from the set except "TenYearCHD". Y is "TenYearCHD"
X = input_frame.drop('TenYearCHD', axis=1)
Y = input_frame['TenYearCHD']

#Divide the data randomly into training and test sets. 
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20)

#Training the model on test data
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train, Y_train)

#Prediction on test data
Y_pred = classifier.predict(X_test)

#Merging attributes and labels.
X_test = X_test.reset_index()
Y_pred = pd.Series(Y_pred)
df = pd.DataFrame()
df = pd.concat([df, Y_pred], axis=1)
X_test = pd.concat([X_test, df], axis=1)
print(X_test)

#Outputting the result to a table
prepare_compatible_table(OutputTable, X_test, with_index=False)
fill_table(OutputTable, X_test, with_index=False)

Results Output

The prediction results are displayed in the last column of the output data set:

programming Python

Python in Megaladata

Algorithm Description

1. Data Import

2. Data Preprocessing

3. Decision Tree

results matching ""

No results matching ""