Approximation of Time Series
Time series can represent various phenomena, such as sales figures, order volumes, customer traffic, or environmental measurements like temperature. In this specific example, time series approximation was applied to ambient temperature values from 2017 to 2019.
The task of time series approximation involves identifying and correcting outliers within the time series. This demo showcases two possible methods for addressing such a task in Megaladata: using the Eliminate Outliers or ARIMAX component. In statistics, ARIMA models are employed to forecast future values of time series based on historical data. In the this example's workflow, the supernode "Find Outliers with ARIMAX" uses an ARIMAX forecast to detect and eliminate outliers, providing a smoothed time series that incorporates adjusted data points.
Algorithm Description
1. Data Import
In the Data Import supernode, we upload an excel table containing temperature values:
Name | Caption |
---|---|
Date | Date |
Temperature | Temperature |
A Chart visualizer is used to display the input data, where sharp point breaks show outliers or anomalies.
2. Data Processing
Next, we configure the Imputation supernode:
The Imputation component fills in the missing values in the Temperature field, using "Replace with most frequent" as the processing method in the settings.
To ensure that all the gaps are filled in properly, we employ Loop to run the Imputation node multiple times.
We set up the Loop node as follows:
- Type of group processing: Fixed group size
- Group row count: 10
Dividing the entire range into small intervals (groups) allows us to calculate the most likely value within each interval, rather than considering the entire range.
3. Correcting Anomalies
Let's now create a supernode Eliminate Outliers, which will contain the following nodes:
The Eliminate Outliers component will find and process outliers in the Temperature field. Set up the node as follows:
- Detection method: Standard deviation
- For outlier: 2
- For extreme: 3
- Outliers Elimination Method: Replace with most frequent
- Extreme Value Processing Method: Replace with most frequent
Similar to how we did with Imputation, the Loop component will repeatedly run the Eliminate Outliers node, for which the range is divided into 13-value batches. This will increase the quality of the output.
Using the Union component, the outliers and extreme values are combined into a single table.
Subsequently, the Sort component arranges these values by date ascending.
The Eliminate Outliers supernode outputs two tables: "Dataset" (corrected data) and "Outliers" (the original abnormal values).
To review the results conveniently, we configure a Chart visualizer to display the "Dataset" table. This visualization allows us to double-check that all the sharp points are eliminated.
The Eliminate Outliers node can be used to correct anomalies while preserving the original data structure: Only a specific field of the dataset (in our case, "Outliers") gets corrected. For instance, this method may be useful when you need to correct unusual morbidity statistics for a short time period (e.g., a month).
Note: When configuring the Loops for the Imputation and Eliminate Outliers nodes, the number and size of batches may vary. They depend on the frequency and periodicity of the "Gaps" and "Outliers", respectively, and the size of the original dataset.
Let's now experiment with one more method of handling outliers — in the supernode Find Outliers with ARIMAX:
The ARIMAX node automatically processes the entire range of values including outliers. In the "Model output" section of the table, we can see the following fields:
- Date: date values from the initial set
- Temperature: temperature values from the original set
- Temperature|Prediction: predicted temperature values
- Temperature|Approximation error: the difference between original and predicted temperature values
The Calculator component is used to search for outliers based on the large values from the Temperature|Approximation error field.
Then, we clear the resulting Outliers field of empty values using the Row Filter.
At the output of the supernode, we get two tables: "Dataset" and "Outliers", with the corrected data and the original abnormal values, respectively.
To visualize these results, we configure a Chart for the "Dataset" table again. The blue chart shows the initial data with outliers, while the orange chart shows the processed values.
The ARIMAX node can be used to edit anomalies by adjusting the entire dataset, not just the "Outliers" column. An example of a time series where such an approach may be helpful is a customer flow over several years.
4. Smoothing
We will employ two Smoothing nodes to processes the datasets generated in step 3. The Smoothing component refines the charts by smoothing out abrupt changes between data points, providing a more visually appealing and interpretable representation.
The settings we use for the Smoothing component:
* Processing Method: Wavelet Smoothing
* Wavelet type: ***Daubechies***
Let's configure the Chart visualizer for the "Dataset" table again. The blue chart shows the corrected values from step 3, while the orange chart shows the values after smoothing.
A Smoothing node can come in handy when you need to analyze seasonal ascending and descending trends, without focusing on specific values.
Download and open the file in Megaladata. If necessary, you can install the free Megaladata Community Edition.