Skip to the content.

Experiment result analysis on the GutenTAG datasets

On this website, we present detailed results of the experiments on our synthetically generated datasets (with GutenTAG). We show errors, qualitative results, and the runtime of the different algorithms.

Result Overview

In this analysis, we just look at the results of all 60 relevant algorithms with their best parameter configuration on the GutenTAG datasets. Those datasets are generated synthetically and were also used to find the best parameter configurations for the algorithms:

The number of experiments is smaller than \(\text{# Algos} \times \text{# Datasets}\) because univariate algorithms cannot process multivariate datasets and those combinations are excluded.

The next table shows an excerpt of the result table with 9 of 26 columns. The complete table with the (quality and runtime) results of all algorithms on all datasets can be downloaded here.

algorithm dataset status ROC_AUC AVERAGE_PRECISION PR_AUC RANGE_PR_AUC execute_main_time hyper_params
0 ARIMA cbf-combined-diff-1 Status.OK 0.815319 0.454742 0.465248 0.453215 71.414111 {"differencing_degree": 1, "distance_metric": ...
1 ARIMA cbf-combined-diff-3 Status.OK 0.955978 0.241877 0.127965 0.136431 129.666755 {"differencing_degree": 1, "distance_metric": ...
2 ARIMA cbf-diff-count-1 Status.OK 0.439091 0.014368 0.008516 0.016521 72.992341 {"differencing_degree": 1, "distance_metric": ...
3 ARIMA cbf-diff-count-3 Status.OK 0.868527 0.129214 0.090548 0.053913 75.303179 {"differencing_degree": 1, "distance_metric": ...
4 ARIMA cbf-diff-count-4 Status.OK 0.626002 0.082363 0.054644 0.034841 183.925331 {"differencing_degree": 1, "distance_metric": ...
... ... ... ... ... ... ... ... ... ...
10423 k-Means sinus-type-pattern Status.OK 0.999999 0.999901 0.999900 0.577762 67.510581 {"anomaly_window_size": 100, "n_clusters": 50,...
10424 k-Means sinus-type-pattern-shift Status.OK 0.999738 0.957231 0.956725 0.544578 53.177865 {"anomaly_window_size": 100, "n_clusters": 50,...
10425 k-Means sinus-type-platform Status.OK 0.998038 0.738244 0.735666 0.555714 59.745593 {"anomaly_window_size": 100, "n_clusters": 50,...
10426 k-Means sinus-type-trend Status.OK 0.999994 0.999410 0.999407 0.560816 48.915376 {"anomaly_window_size": 100, "n_clusters": 50,...
10427 k-Means sinus-type-variance Status.OK 0.999990 0.999019 0.999014 0.579041 82.035531 {"anomaly_window_size": 100, "n_clusters": 50,...

10428 rows × 9 columns

Error analysis

We first want to look at the ability of the algorithms to process the different datasets. Some of the algorithms are restricted by our time and memory constraints and others produce errors when specific invariants or implementation deficits are encountered.

Algorithm problems grouped by algorithm training type

Unsupervised:

status Status.ERROR Status.OK Status.TIMEOUT ALL
algo_input_dimensionality algorithm
UNIVARIATE SAND 26 137 0 163
VALMOD 6 157 0 163
Series2Graph 3 160 0 163
Left STAMPi 1 162 0 163
ARIMA 0 163 0 163
DSPOT 0 160 3 163
DWT-MLEAD 0 163 0 163
FFT 0 163 0 163
GrammarViz 0 163 0 163
HOT SAX 0 114 49 163
MedianMethod 0 163 0 163
NormA 0 153 10 163
NumentaHTM 0 163 0 163
PCI 0 163 0 163
PST 0 163 0 163
PhaseSpace-SVM 0 163 0 163
S-H-ESD (Twitter) 0 163 0 163
SSA 0 163 0 163
STAMP 0 163 0 163
STOMP 0 163 0 163
Spectral Residual (SR) 0 163 0 163
Subsequence IF 0 163 0 163
Subsequence LOF 0 163 0 163
TSBitmap 0 163 0 163
Triple ES (Holt-Winter's) 0 163 0 163
MULTIVARIATE DBStream 155 32 0 187
CBLOF 0 187 0 187
COF 0 187 0 187
COPOD 0 187 0 187
Extended Isolation Forest (EIF) 0 187 0 187
HBOS 0 187 0 187
IF-LOF 0 187 0 187
Isolation Forest (iForest) 0 187 0 187
KNN 0 187 0 187
LOF 0 187 0 187
PCC 0 187 0 187
Torsk 0 180 7 187
k-Means 0 187 0 187

Semi-supervised:

status Status.ERROR Status.OK Status.TIMEOUT ALL
algo_input_dimensionality algorithm
UNIVARIATE TARZAN 32 131 0 163
Bagel 0 163 0 163
Donut 0 163 0 163
ImageEmbeddingCAE 0 163 0 163
OceanWNN 0 163 0 163
Random Forest Regressor (RR) 0 163 0 163
SR-CNN 0 163 0 163
XGBoosting (RR) 0 163 0 163
MULTIVARIATE LSTM-AD 98 81 8 187
EncDec-AD 39 17 131 187
LaserDBN 23 164 0 187
DeepAnT 10 177 0 187
OmniAnomaly 4 183 0 187
HealthESN 0 150 37 187
Hybrid KNN 0 187 0 187
Random Black Forest (RR) 0 174 13 187
RobustPCA 0 187 0 187
TAnoGan 0 73 114 187
Telemanom 0 187 0 187

Supervised:

status Status.ERROR Status.OK Status.TIMEOUT ALL
algo_input_dimensionality algorithm
MULTIVARIATE MultiHMM 95 92 0 187
Normalizing Flows 9 66 112 187
Hybrid Isolation Forest (HIF) 0 187 0 187

As we can see in the above tables, most algorithms can process almost all of the datasets. In the next subsections, we highlight some outlying algorithms.

Very slow algorithms

Algorithms, for which more than 50% of all executions ran into the timeout:

status Status.ERROR Status.OK Status.TIMEOUT ALL
algo_training_type algo_input_dimensionality algorithm
SEMI_SUPERVISED MULTIVARIATE EncDec-AD 39 17 131 187
TAnoGan 0 73 114 187
SUPERVISED MULTIVARIATE Normalizing Flows 9 66 112 187

All time series in the GutenTAG collection have the same length (of \( 10000\) points). The algorithms EncDec-AD, TAnoGan, and Normalizing Flows are large deep learning models that take a long time to train and execute. This forces them either into the 2h training time limit or the 2h test time limit.

All unsupervised algorithms besides HOT SAX and NormA are fast enough to finish within our time limit for all datasets.

Broken algorithms

Algorithms, which failed for at least 50% of the executions:

status Status.ERROR Status.OK Status.TIMEOUT ALL
algo_training_type algo_input_dimensionality algorithm
SEMI_SUPERVISED MULTIVARIATE LSTM-AD 98 81 8 187
SUPERVISED MULTIVARIATE MultiHMM 95 92 0 187
UNSUPERVISED MULTIVARIATE DBStream 155 32 0 187

Errors exist independent of the algorithm’s learning type. Prominent algorithms in this category are LSTM-AD, MultiHMM, and DBStream that failed for more than 50% of their executions. To get a better feeling for the reason of algorithm failures, we distinguish between different errors in the next section.

Categorization of errors

We categorize all observed errors into specific categories and then count the number of executions that had errors belonging to a category. The next table shows which errors were observed how often for which algorithm.

algorithm ALL (sum) ARIMA Bagel CBLOF COF COPOD DBStream DSPOT DWT-MLEAD DeepAnT Donut EncDec-AD Extended Isolation Forest (EIF) FFT GrammarViz HBOS HOT SAX HealthESN Hybrid Isolation Forest (HIF) Hybrid KNN IF-LOF ImageEmbeddingCAE Isolation Forest (iForest) KNN LOF LSTM-AD LaserDBN Left STAMPi MedianMethod MultiHMM NormA Normalizing Flows NumentaHTM OceanWNN OmniAnomaly PCC PCI PST PhaseSpace-SVM Random Black Forest (RR) Random Forest Regressor (RR) RobustPCA S-H-ESD (Twitter) SAND SR-CNN SSA STAMP STOMP Series2Graph Spectral Residual (SR) Subsequence IF Subsequence LOF TARZAN TAnoGan TSBitmap Telemanom Torsk Triple ES (Holt-Winter's) VALMOD XGBoosting (RR) k-Means
error_category                                                                                                                          
- OK - 9443 163 163 187 187 187 32 160 163 177 163 17 187 163 163 187 114 150 187 187 187 163 187 187 187 81 164 162 163 92 153 66 163 163 183 187 163 163 163 174 163 187 163 137 163 163 163 163 160 163 163 163 131 73 163 187 180 163 157 163 187
- OOM - 146 39 98 9
- TIMEOUT - 484 3 131 49 37 8 10 112 13 114 7
Bug 177 98 10 23 25 3 12 6
Incompatible parameters 55 55
Invariance/assumption not met 1 1
Max recursion depth exceeded 20 20
Model loading error 4 4
Not converged 95 95
Wrong shape error 1 1
other 2 2

We can for example see that the high error rate of LSTM-AD is mostly due to hitting the memory limit of 3GB. However, the errors of MultiHMM are due to it’s model not reaching a converged state during training. We assume that some assumptions of the MultiHMM-approach are not met for the datasets with errors.

In general, our GutenTAG dataset are easy to process and well defined, because 91% of all experiments were successful and 6% of the remaining experiments were timeouts or OOMs.

Algorithm quality assessment based on ROC_AUC

The next table shows the min, mean, median, and max ROC_AUC metric score computed over all datasets for each algorithm:

algorithm LSTM-AD Subsequence LOF PhaseSpace-SVM DWT-MLEAD SAND Donut GrammarViz Torsk Left STAMPi EncDec-AD STOMP STAMP k-Means Normalizing Flows Telemanom Series2Graph Random Forest Regressor (RR) VALMOD XGBoosting (RR) HealthESN ImageEmbeddingCAE Random Black Forest (RR) ARIMA PST NormA SSA Subsequence IF OceanWNN HOT SAX DeepAnT DBStream PCI Triple ES (Holt-Winter's) NumentaHTM LaserDBN MedianMethod FFT OmniAnomaly TSBitmap KNN Extended Isolation Forest (EIF) CBLOF Isolation Forest (iForest) HBOS Hybrid Isolation Forest (HIF) IF-LOF LOF Spectral Residual (SR) S-H-ESD (Twitter) COF DSPOT COPOD PCC Bagel RobustPCA SR-CNN TAnoGan MultiHMM TARZAN Hybrid KNN
min 0.123730 0.341819 0.307866 0.125859 0.167172 0.151962 0.207808 0.077172 0.126879 0.344264 0.009910 0.009910 0.000000 0.004679 0.092208 0.069038 0.405782 0.055046 0.373735 0.107862 0.106465 0.130278 0.050505 0.016049 0.013301 0.114735 0.000020 0.156114 0.147374 0.000095 0.102123 0.022453 0.239234 0.377848 0.119141 0.004040 0.014141 0.077511 0.132703 0.000000 0.000000 0.039293 0.000051 0.144394 0.054343 0.000101 0.164697 0.002450 0.473684 0.000000 0.273283 0.000051 0.055375 0.058306 0.000000 0.500000 0.000960 0.047605 0.000571 0.000003
mean 0.965738 0.941804 0.920328 0.907602 0.898257 0.894965 0.894852 0.885825 0.880459 0.877664 0.874267 0.874142 0.872913 0.869716 0.863892 0.861379 0.860457 0.858050 0.856619 0.853132 0.851142 0.818027 0.816814 0.803791 0.786847 0.771233 0.765155 0.734238 0.731207 0.726896 0.719925 0.696556 0.673339 0.670671 0.655523 0.648901 0.644080 0.644023 0.637278 0.614195 0.609879 0.606390 0.603377 0.599450 0.599160 0.587309 0.577457 0.568847 0.559200 0.555700 0.554605 0.543097 0.532033 0.525684 0.514437 0.502331 0.481889 0.478073 0.474698 0.449687
median 0.996443 0.995904 0.980000 0.972041 0.984132 0.973340 0.991579 0.979313 0.981922 0.999900 0.988399 0.988399 0.997220 0.994933 0.977484 0.942775 0.883773 0.971650 0.886839 0.915416 0.944112 0.843654 0.895639 0.871631 0.954595 0.845423 0.841325 0.752219 0.760240 0.853177 0.783729 0.662587 0.668647 0.645183 0.659910 0.567188 0.593000 0.658707 0.624381 0.623641 0.594593 0.558942 0.589781 0.585596 0.584366 0.559124 0.534933 0.544846 0.500000 0.521308 0.501351 0.526410 0.508369 0.550798 0.500000 0.500000 0.481301 0.488418 0.486515 0.444118
max 1.000000 1.000000 0.999928 0.999992 1.000000 1.000000 1.000000 0.999990 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.998586 1.000000 1.000000 1.000000 1.000000 0.999800 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.999600 0.999650 1.000000 1.000000 0.998544 0.998600 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.934293 1.000000 0.880000 0.999596 1.000000 0.999784 1.000000

The following boxplots give a more visual picture of the score distributions. The algorithms are ordered by their mean ROC_AUC score (not included in the visualization) and the first and last 10 algorithms are shown by default. Use the legend on the right to display additional algorithms.

Best algorithms (based on mean ROC_AUC)

min mean median max
algorithm
LSTM-AD 0.123730 0.965738 0.996443 1.000000
Subsequence LOF 0.341819 0.941804 0.995904 1.000000
PhaseSpace-SVM 0.307866 0.920328 0.980000 0.999928
DWT-MLEAD 0.125859 0.907602 0.972041 0.999992
SAND 0.167172 0.898257 0.984132 1.000000

Worst algorithms (based on mean ROC_AUC)

min mean median max
algorithm
SR-CNN 0.500000 0.502331 0.500000 0.880000
TAnoGan 0.000960 0.481889 0.481301 0.999596
MultiHMM 0.047605 0.478073 0.488418 1.000000
TARZAN 0.000571 0.474698 0.486515 0.999784
Hybrid KNN 0.000003 0.449687 0.444118 1.000000

Scores of best algorithms

In the next figure, we show the scorings of the 4 best algorithms on the dataset “sinus-diff-count-2”:

Runtime-weighted AUC_ROC scores

In the next figure, we try to combine the runtime and result quality of the algorithms into one metric by weighting the ROC_AUC score by the inverse scaled overall runtime. Algorithms that take exceptionally long to process the datasets are punished and have a smaller weighted ROC_AUC score. Algorithms that are very fast keep their original ROC_AUC score.

Algorithm runtime assessment

This section looks at the runtime of the algorithms. We distinguish between training and execution runtime in our paper. The following figures just look at the combined (overall) runtime.

The next table shows the min, mean, median, and max overall runtime aggregated over all GutenTAG datasets for each algorithm.

Keep in mind that all GutenTAG datasets have the same length of \(10000\) points and most datasets just contain a single channel. Only 25 datasets are multivariate.

algorithm DBStream MedianMethod TSBitmap Spectral Residual (SR) FFT PCI Extended Isolation Forest (EIF) DWT-MLEAD PCC KNN LOF COPOD NormA IF-LOF TARZAN HBOS LaserDBN Isolation Forest (iForest) STOMP Subsequence IF S-H-ESD (Twitter) CBLOF Subsequence LOF SSA GrammarViz Series2Graph PST MultiHMM RobustPCA XGBoosting (RR) STAMP COF Left STAMPi VALMOD PhaseSpace-SVM SAND NumentaHTM k-Means OceanWNN Donut Hybrid Isolation Forest (HIF) EncDec-AD ImageEmbeddingCAE Normalizing Flows HOT SAX Torsk ARIMA DSPOT Random Black Forest (RR) SR-CNN Telemanom Hybrid KNN Random Forest Regressor (RR) Triple ES (Holt-Winter's) Bagel DeepAnT HealthESN LSTM-AD TAnoGan OmniAnomaly
min 0.000000 2.167620 2.092893 2.978843 2.768183 4.014369 5.236932 5.203950 5.205858 5.194909 5.189140 6.203962 0.000000 6.299420 0.000000 7.425389 0.000000 8.048582 10.053667 9.187731 10.025252 8.141591 6.378969 9.810762 2.366554 0.000000 7.068686 0.000000 11.206035 29.423896 2.546820 27.155001 0.000000 0.000000 24.383494 0.000000 72.370925 5.902524 147.090236 238.261296 298.579914 0.000000 19.853276 0.000000 0.000000 0.000000 71.414111 0.000000 0.000000 734.100854 214.989800 332.475525 1081.153213 1662.362775 1730.130522 0.000000 0.000000 0.000000 0.000000 0.000000
mean 1.485884 3.646354 4.204380 4.448068 4.683874 6.203468 6.599313 6.958002 7.298360 7.352563 7.476626 8.305900 8.640119 8.946259 9.336935 9.983120 10.513870 11.019563 11.900743 13.441904 13.829732 14.455307 18.856173 18.941656 22.910533 23.010118 24.472725 33.024754 35.213852 36.827480 38.341096 39.020412 53.121713 53.299238 82.067981 85.901954 91.581986 98.254350 250.583056 339.671690 485.455303 504.099002 654.139068 680.576232 927.950220 1244.610911 1253.151379 1295.773654 1477.619365 1478.505805 1836.957278 1929.474488 2118.757325 2487.212241 2771.290942 3128.522926 3180.770797 3269.571687 3839.777410 7113.762881
median 0.000000 2.796953 4.882090 3.889712 3.599423 5.278512 6.065740 6.412318 6.820794 6.541939 6.669530 7.604242 6.737077 8.490258 10.090874 9.324363 10.716289 10.572865 11.528972 13.585095 13.678194 9.720897 21.301228 19.145623 21.936855 21.989442 29.936323 0.000000 14.051364 35.782439 27.600829 39.622869 53.934283 59.732887 61.484677 47.611511 90.902945 74.116743 219.784961 355.775571 483.189570 0.000000 574.251242 0.000000 597.805815 1103.504400 698.592820 125.862472 1424.658453 1580.434546 1572.787677 1480.946807 1934.163334 2361.207052 2378.648069 2938.454040 3259.633157 0.000000 0.000000 7276.879391
max 13.955889 7.528286 7.265769 7.908754 21.075458 9.260761 10.825573 11.201316 12.648174 32.120757 12.498324 14.220524 34.183225 18.329226 19.855369 15.828537 19.874952 14.398077 16.301006 22.905463 22.759653 105.374497 74.296122 41.621639 79.390667 53.055090 40.106281 392.575913 519.506858 43.991729 377.011789 65.967019 60.658616 101.013242 310.056020 595.575047 135.339259 605.901906 618.353864 459.234046 725.922349 7732.446555 1803.073701 7228.146324 4702.380252 6335.579315 6603.765796 6931.930536 7014.062132 3366.463148 7286.619047 7235.540268 3809.431023 5115.088311 9596.165659 7397.980260 7226.635755 8320.575837 13536.386278 7304.766755

The following boxplots give a more visual picture of the runtime distributions. The algorithms are ordered by their mean overall runtime and the first and last 10 algorithms are shown by default. Use the legend on the right to display additional algorithms.

In the next figure, we show the algorithm mean runtime in relation to the achieved mean ROC_AUC score. We distinguish between the different learning types because the runtime of an algorithm depends on its learning procedure.

Attention

The following figure does not accommodate for OOM or TIMEOUT errors. This is especially visible for Normalizing Flows (supervised), which ran into the time limit for most of the datasets but has a relatively small runtime in the figure below! For algorithms with many errors (cf. Section Error analysis), the aggregated runtimes and metric scores are not meaningful.

Detailed analysis of certain algorithm or dataset aspects

Best algorithms for base oscillations

Find dataset names:

Sine

ECG

Random Walk

CBF

Poly

Best algorithms for anomaly type

Extremum

Frequency

Mean Shift

Pattern

Pattern Shift

Platform

Variance

Amplitude

Trend

Most fluctuating algorithms based on anomaly type

Best algorithms for single/multiple-same/multiple-different anomalies

Single anomaly datasets

Multiple same anomalies datasets

Multiple different anomaly datasets

Best algorithm of algorithm family

algorithm ROC_AUC
algo_family
trees PST 0.803791
reconstruction Donut 0.894965
forecasting LSTM-AD 0.965738
encoding GrammarViz 0.894852
distribution DWT-MLEAD 0.907602
distance Subsequence LOF 0.941804