Python in Megaladata
Megaladata offers extensive functionality for solving lots of tasks. However, some algorithms are not yet implemented in Megaladata as standard components. In such cases, calculations that are performed using code can be embedded into the workflow.
One of the supported programming languages in Megaladata is Python, which is often used for data analysis.
This example demonstrates how to implement the Decision Tree algorithm using Python, predicting the risk of ischemic heart disease over the next 10 years.
For the Python node to work, preliminary configuration of Megaladata and installation of Python may be required. The demo example uses the pandas, numpy, and sklearn libraries. Instructions for installing libraries.
Note: The demo example has a limitation in demonstrating all capabilities when running on the demo stand (Launch Demo). We recommend installing the example locally.
Algorithm Description
1. Data Import
In the Data Import submodule, the source file is imported.
The dataset contains information about 4238 patients.
Fields used for prediction:
Name | Description |
---|---|
male | Gender male or female (1 — male, 0 — female) |
age | Patient's age |
education | Education (0 — 4) |
current_smoker | Is the patient currently a smoker? (1 — yes, 0 — no) |
cigs_per_day | Average number of cigarettes smoked per day |
BPMeds | Has the patient been taking blood pressure medication? (1 — yes, 0 — no) |
prevalent_stroke | Has the patient had a stroke before? (1 — yes, 0 — no) |
diabets | Does the patient have diabetes? (1 — yes, 0 — no) |
tot_chol | Total cholesterol level |
sys_bp | Systolic blood pressure |
dia_bp | Diastolic blood pressure |
BMI | Body Mass Index |
heart_rate | Heart rate |
glucose | Glucose level |
Predicted field:
- TenYearCHD (10-year risk of ischemic heart disease IHD): 1 — yes, 0 — no.
2. Data Preprocessing
In the Preprocessing submodule, data is prepared for analysis. This results in it being brought into compliance with the requirements determined by the specifics of the task at hand.
In our case, the data does not require significant processing. For the correct operation of the Decision Tree algorithm, it is only necessary to exclude or fill in missing values.
There are few missing values in the presented dataset, so you can simply delete rows with empty field values.
Cleaning empty fields is carried out using the Missing Values Filling node. A processing method is set for each field: Delete records.
3. Decision Tree
Decision trees are one of the tools for data mining and predictive analytics. It allows solving classification and regression tasks.
To access data ports and other built-in objects in the context of code execution, the following are provided:
- Input data sets (InputTables, InputTable)
- Input variables (InputVariables)
- Output data set (OutputTable)
- Necessary enumerations (DataType, DataKind, UsageType)
The above objects are imported from the built-in module "builtin_data". By default, an import line is added to the text of the code executed by the node.
Pass the prepared data to the Python node. Allow the formation of output columns from the code. Program code:
import builtin_data
from builtin_data import InputTable, InputTables, InputVariables, OutputTable, DataType, DataKind, UsageType
import numpy as np
import pandas as pd
from builtin_pandas_utils import to_data_frame, prepare_compatible_table, fill_table
The input port is optional:
if InputTable:
#Create a pd.DataFrame from input set №1
input_frame = to_data_frame(InputTable)
Split attributes and labels. Here X — all columns from the set, except:
"TenYearCHD". Y - "TenYearCHD"
X = input_frame.drop('TenYearCHD', axis=1)
Y = input_frame['TenYearCHD']
Split the data into training and testing sets:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20)
Train the model on the training data:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train, Y_train)
Make predictions on the test data:
Y_pred = classifier.predict(X_test)
Merge attributes and labels:
X_test = X_test.reset_index()
Y_pred = pd.Series(Y_pred)
df = pd.DataFrame()
df = pd.concat([df, Y_pred], axis=1)
X_test = pd.concat([X_test, df], axis=1)
print(X_test)
Output the result to a table:
prepare_compatible_table(OutputTable, X_test, with_index=False)
fill_table(OutputTable, X_test, with_index=False)
import builtin_data
from builtin_data import InputTable, InputTables, InputVariables, OutputTable, DataType, DataKind, UsageType
import numpy as np
import pandas as pd
from builtin_pandas_utils import to_data_frame, prepare_compatible_table, fill_table
#Input port optional
if InputTable:
# Create pd.DataFrame based on input set #1 {#create-pddataframe-based-on-input-set-1}
input_frame = to_data_frame(InputTable)
#Separate attributes and labels. X is all columns from the set except "TenYearCHD". Y is "TenYearCHD"
X = input_frame.drop('TenYearCHD', axis=1)
Y = input_frame['TenYearCHD']
#Divide the data randomly into training and test sets.
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20)
#Training the model on test data
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train, Y_train)
#Prediction on test data
Y_pred = classifier.predict(X_test)
#Merging attributes and labels.
X_test = X_test.reset_index()
Y_pred = pd.Series(Y_pred)
df = pd.DataFrame()
df = pd.concat([df, Y_pred], axis=1)
X_test = pd.concat([X_test, df], axis=1)
print(X_test)
#Outputting the result to a table
prepare_compatible_table(OutputTable, X_test, with_index=False)
fill_table(OutputTable, X_test, with_index=False)
The prediction results are displayed in the last column of the output data set:
Download and open the file in Megaladata. If necessary, you can install the free Megaladata Community Edition.