Skip to the content.

Case studies

On this page, we present some preliminary results in form of three short case studies:

  1. No anomalies
  2. High contamination
  3. Parameter sensitivity

No anomalies

In this case study, we want to see how the algorithms behave on datasets that do not contain any known anomalies. For this, we included the following datasets (all without known/labelled anomalies) as examples in our experiments:

collection dataset input dimensionality learning type
232 KDD-TSAD 079_UCR_Anomaly_DISTORTEDresperation2 univariate semi-supervised
261 KDD-TSAD 108_UCR_Anomaly_NOISEresperation2 univariate semi-supervised
340 KDD-TSAD 187_UCR_Anomaly_resperation2 univariate semi-supervised
498 NAB art_daily_no_noise univariate unsupervised
500 NAB art_daily_perfect_square_wave univariate unsupervised
501 NAB art_daily_small_noise univariate unsupervised
502 NAB art_flatline univariate unsupervised
505 NAB art_noisy univariate unsupervised
513 NAB ec2_cpu_utilization_c6585a univariate unsupervised

Note that we do not have any supervised datasets without known anomalies in this list. Therefore, we cannot inspect supervised algorithms and exclude them from further analysis. In addition, all of the above datasets are univariate. Since multivariate algorithms are also executed on univariate datasets, this does not limit the number of considered algorithms further.

The following table lists the number of successfully processed, erroneous, or timed-out datasets for each of the remaining 57 algorithms:

status Status.ERROR Status.OK Status.TIMEOUT
algo_training_type algorithm
SEMI_SUPERVISED Bagel 0 0 3
DeepAnT 3 0 0
EncDec-AD 3 0 0
HealthESN 0 0 3
Hybrid KNN 3 0 0
ImageEmbeddingCAE 3 0 0
LSTM-AD 1 0 2
SR-CNN 0 0 3
TARZAN 3 0 0
TAnoGan 0 0 3
OmniAnomaly 2 1 0
Random Forest Regressor (RR) 0 1 2
Donut 0 2 1
LaserDBN 0 3 0
OceanWNN 0 3 0
Random Black Forest (RR) 0 3 0
RobustPCA 0 3 0
Telemanom 0 3 0
XGBoosting (RR) 0 3 0
UNSUPERVISED HOT SAX 9 0 0
Left STAMPi 9 0 0
NormA 9 0 0
SAND 9 0 0
k-Means 9 0 0
DBStream 8 1 0
VALMOD 6 2 1
S-H-ESD (Twitter) 6 3 0
Triple ES (Holt-Winter's) 3 5 1
COF 3 6 0
PST 3 6 0
PhaseSpace-SVM 0 6 3
Series2Graph 3 6 0
ARIMA 0 7 2
CBLOF 2 7 0
STAMP 0 7 2
Subsequence LOF 0 7 2
IF-LOF 1 8 0
NumentaHTM 1 8 0
Torsk 0 8 1
COPOD 0 9 0
DSPOT 0 9 0
DWT-MLEAD 0 9 0
Extended Isolation Forest (EIF) 0 9 0
FFT 0 9 0
GrammarViz 0 9 0
HBOS 0 9 0
Isolation Forest (iForest) 0 9 0
KNN 0 9 0
LOF 0 9 0
MedianMethod 0 9 0
PCC 0 9 0
PCI 0 9 0
SSA 0 9 0
STOMP 0 9 0
Spectral Residual (SR) 0 9 0
Subsequence IF 0 9 0
TSBitmap 0 9 0

As we can see in the table, the algorithms Bagel, DeepAnT, EncDec-AD, HealthESN, Hybrid KNN, ImageEmbeddingCAE, LSTM-AD, SR-CNN, TARZAN, TAnoGan, HOT SAX, Left STAMPi, NormA, SAND, and k-Means already fail to process the datasets entirely. Our preliminary finding is that these algorithms make assumptions about the dataset that are not met if there are no anomalies present. In contrast, the algorithms LaserDBN, OceanWNN, Random Black Forest (RR), RobustPCA, Telemanom, XGBoosting (RR), COPOD, DSPOT, DWT-MLEAD, Extended Isolation Forest (EIF), FFT, GrammarViz, HBOS, Isolation Forest (iForest), KNN, LOF, MedianMethod, PCC, PCI, SSA, STOMP, Spectral Residual (SR), Subsequence IF, and TSBitmap are able to process the datasets successfully. All other algorithms can process just some of the datasets and fail for others. The reason for those failures are a task for further analysis.

Our evaluation metrics are not defined for time series with all-zeros-labels (no single anomaly annotated), which means that we cannot present those metrics here. Alternatively, we show an example dataset and the scorings of selected algorithms. You can use the legend on the right side to enable/disable the display of individual algorithm scorings.

High contamination

In this case study, we want to see how the algorithms behave on datasets that have a high contamination. The contamination of a time series with length \(n\) is the ratio of anomalous points to all points (\(\text{contamination} = \frac{\text{# anomalous points} }{n}\)). Most algorithms assume that anomalies are rare and, thus, expect/assume a low contamination. Datasets with a high contamination have many or very large anomalous regions and pose a particular challenge for those algorithms. The following datasets have a \(\text{contamination} > 0.1\):

collection dataset input dimensionality learning type contamination length # dimensions
50 Exathlon 4_1_100000_61-29 multivariate semi-supervised 0.124655 129197 19
51 Exathlon 4_1_100000_61-30 multivariate semi-supervised 0.124655 129197 19
59 Exathlon 5_1_100000_63-64 multivariate supervised 0.126248 43066 31
483 NAB TravelTime_451 univariate unsupervised 0.100370 2162 1
486 NAB Twitter_volume_CRM univariate unsupervised 0.100176 15902 1
490 NAB Twitter_volume_IBM univariate unsupervised 0.100044 15893 1
491 NAB Twitter_volume_KO univariate unsupervised 0.100120 15851 1
492 NAB Twitter_volume_PFE univariate unsupervised 0.100139 15858 1
514 NAB ec2_cpu_utilization_fe7f93 univariate unsupervised 0.100446 4032 1
516 NAB ec2_disk_write_bytes_c0d644 univariate unsupervised 0.100446 4032 1
518 NAB ec2_network_in_5abac7 univariate unsupervised 0.100211 4730 1
521 NAB exchange-2_cpc_results univariate unsupervised 0.100369 1624 1
525 NAB exchange-4_cpc_results univariate unsupervised 0.100426 1643 1
527 NAB grok_asg_anomaly univariate unsupervised 0.100628 4621 1
528 NAB iio_us-east-1_i-a2eb1cd9_NetworkIn univariate unsupervised 0.101368 1243 1
530 NAB nyc_taxi univariate unsupervised 0.100291 10320 1
531 NAB occupancy_6005 univariate unsupervised 0.100420 2380 1
535 NAB rogue_agent_key_hold univariate unsupervised 0.100956 1882 1
538 NAB speed_7578 univariate unsupervised 0.102928 1127 1
539 NAB speed_t4013 univariate unsupervised 0.100200 2495 1

Please note that the Exathlon datasets are quite large and might not be processable by all algorithms within our time and memory limits.

Failures and timeouts

The next table lists all algorithms that could not process at least half of the matching (regarding learning type and input dimensionality) datasets:

status Status.ERROR Status.OK Status.TIMEOUT ALL
algo_input_dimensionality algo_training_type algorithm
UNIVARIATE UNSUPERVISED S-H-ESD (Twitter) 17 0 0 17
SAND 17 0 0 17
NormA 0 3 14 17
MULTIVARIATE UNSUPERVISED DBStream 17 3 0 20
SUPERVISED MultiHMM 1 0 0 1
Normalizing Flows 1 0 0 1
SEMI_SUPERVISED DeepAnT 2 0 0 2
Hybrid KNN 2 0 0 2
LSTM-AD 2 0 0 2
Random Black Forest (RR) 2 0 0 2
Telemanom 2 0 0 2
EncDec-AD 0 0 2 2
HealthESN 0 0 2 2
LaserDBN 0 0 2 2
TAnoGan 0 0 2 2

The algorithms NormA, EncDec-AD, HealthESN, LaserDBN, and TAnoGan cannot process the selected datasets because they hit the time limit. This is no suprise for the supervised and semi-supervised approaches because they have available only the large Exathlon datasets. However, the timeouts for the NormA algorithm are suprising, especially because its runtime for similarly large datasets is usually in the seconds-range. We assume that its runtime depends on the structure of the dataset and increases when there are multiple frequent patterns in the dataset.

The algorithms S-H-ESD (Twitter), SAND, DBStream, MultiHMM, Normalizing Flows, DeepAnT, Hybrid KNN, LSTM-AD, Random Black Forest (RR), and Telemanom encounter bugs when processing high contamination datasets. The reason for those failures are a task for further analysis.

We exclude all above algorithms from our further analysis.

Algorithm performance

In the next table, we show the qualitative performance of the remaining algorithms according to our evaluation metrics on the selected high contamination datasets:

ROC_AUC PR_AUC RANGE_PR_AUC AVERAGE_PRECISION
algorithm
Hybrid Isolation Forest (HIF) 0.999918 0.999469 0.971839 0.999469
OmniAnomaly 0.957648 0.699779 0.628670 0.699828
RobustPCA 0.773257 0.268411 0.771254 0.268467
DWT-MLEAD 0.747859 0.451350 0.406806 0.449522
k-Means 0.746775 0.419962 0.432895 0.420930
GrammarViz 0.702291 0.355281 0.394412 0.363837
Subsequence IF 0.630121 0.238262 0.229408 0.239008
ARIMA 0.621403 0.231054 0.215995 0.254633
Isolation Forest (iForest) 0.618316 0.276577 0.244185 0.275727
Torsk 0.615219 0.275777 0.256936 0.255155
Subsequence LOF 0.614126 0.246476 0.237810 0.247044
COPOD 0.610549 0.280198 0.247957 0.279020
SSA 0.604314 0.347935 0.320848 0.181269
Extended Isolation Forest (EIF) 0.593870 0.200386 0.233842 0.199645
HBOS 0.592026 0.251733 0.212789 0.223852
PCC 0.578449 0.260759 0.242622 0.255479
Series2Graph 0.570307 0.215902 0.199628 0.216754
PCI 0.568946 0.160600 0.128982 0.161539
FFT 0.551464 0.398682 0.411709 0.162788
MedianMethod 0.546102 0.129978 0.115209 0.130054
KNN 0.544524 0.191582 0.204380 0.167162
Left STAMPi 0.543362 0.201372 0.231741 0.202124
CBLOF 0.541039 0.182654 0.205906 0.169130
PhaseSpace-SVM 0.538457 0.205488 0.211348 0.131119
IF-LOF 0.534835 0.157076 0.158743 0.133459
NumentaHTM 0.531800 0.130608 0.134919 0.121894
LOF 0.525029 0.150761 0.169535 0.123988
COF 0.517095 0.126022 0.109826 0.121233
TSBitmap 0.511606 0.141991 0.184770 0.149793
DSPOT 0.504938 0.269787 0.250136 0.105735
VALMOD 0.501475 0.168835 0.179224 0.158639
Spectral Residual (SR) 0.497764 0.106476 0.104096 0.106886
Triple ES (Holt-Winter's) 0.488773 0.131470 0.126784 0.138829
HOT SAX 0.484899 0.172027 0.176918 0.180242
STAMP 0.423169 0.138710 0.207461 0.139272
PST 0.420949 0.143024 0.164422 0.116969
STOMP 0.410927 0.144260 0.215710 0.132876

The Hybrid Isolation Forest and OmniAnomaly methods seem to cope very well with the high number of anomalies, while most of the other methods really struggle to get good results. To contrast the algorithm performance on the high contamination datasets with their overall performance, we plot the change in their metric-scores (\( mean(\text{metric on high contamination}) - mean(\text{metric on all}) \)) in the following figure. You can change the displayed metric by clicking on the metric name in the legend.

In the above figure, we can see that only HIF, OmniAnomaly, RobustPCA, k-Means, FFT, and SSA have a higher ROC_AUC score on high contamination datasets compared to all datasets. All other methods perform worse on the high contamination datasets. The discord-based methods STAMP, VALMOD, and STOMP, which are very good methods overall, struggle the most with the high contamination datasets and their performance degrades to random guessing (ROC_AUC \( \simeq 0.5 \); cf. previous table).

We must note here that the datasets of the NAB collection with a high contamination seem to be labeled poorly. The following example shows one of the time series. We cannot find a reason, why the third anomalous region should be labeled anomalous. If this region actually is anomalous, then no indication in the time series exist or more time context is needed to find this anomaly.

The first three algorithms Hybrid Isolation Forest (HIF), OmniAnomaly, and RobustPCA were not executed on the NAB datasets because they need training data.

Parameter sensitivity

In this case study, we look at the parameter sensitivity of the SSA algorithm in detail. SSA is very susceptible to small changes in its window size parameter. We show this sensitivity using a small experiment with a single dataset varying the window size parameter.

Experiment setup: We use one of our synthetically generated GutenTAG datasets, called “ecg-diff-count-5”, with a dominant frequency of 20 (average peak-to-peak distance of the ECG signal). We vary the window_size parameter of SSA and fix all other parameters. We start with setting window_size to \(2*period = 40\) and then look one step left (\(39\)) and right (\(41\)) respectively.

Parameters: The following snippet shows the parameter values used for the experiment:

{
  "rf_method": "alpha",
  "alpha": 0.2,
  "e": 3,
  "random_state": 42,
  "window_size": [39, 40, 41]
}

Dataset: The following plot shows the “ecg-diff-count-5” dataset with the ground truth labels:

Results

Window size 39: Setting the window_size parameter to \(39\) leads to SSA producing very high anomaly scores in general. For the anomalous regions, however, the scores drop below their previous level. This means that SSA does not correctly detect those regions as anomalous!

SSA scoring with window size 39

Window size 40: Setting the window_size parameter to a multiple of the datasets period size allows SSA to detect all anomalies very well. The anomaly score is relatively small for the normal parts of the time series and has noticable peaks at anomalous regions. The anomalous regions can clearly be separated from the normal behavior.

Note that SSA primarly detects the beginning and the end of an anomalous region (= the short-term pattern change).

SSA scoring with window size 40

Window size 41: Increasing the window_size parameter by \(1\) (smallest possible increment) to \(41\) renders SSA’s results useless again. The picture is very similar to the window size 39 case.

SSA scoring with window size 41

We can further show that only whole multiples of the perfect period size lead to good results. The following two figures show the SSA anomaly scores when setting the window size to \( 1 * period = 20 \) and \( 1.5 * period = 30 \):

Window size 20 (\( 1 * period = 20 \)): Good anomaly scores with clear separation between anomalies and normal behavior.

SSA scoring with window size 20

Window size 30 (\( 1.5 * period = 30 \)): Useless anomaly scores.

SSA scoring with window size 30

For this example dataset, we know the correct period size because we supplied it as an input to the generation process of the synthetic dataset in GutenTAG. Other ways to retrieve the period size are careful manual inspection or the estimation using statistical methods, which are both prone to errors and costly. Setting the window_size parameter for SSA is, thus, no easy task and higher parameter value robustness is preferred and necessary.