Skip to the content.

Experiment result analysis on the Benchmark datasets

On this website, we present detailed results of the experiments on the Benchmark datasets that we collected from various sources (see Datasets Overview. We show errors, qualitative results, and the runtime of the different algorithms.

Result Overview

In this analysis, we just look at the results of all 60 relevant algorithms with their best parameter configuration on the Benchmark datasets (excluding our synthetically generated GutenTAG datasets):

The number of experiments is smaller than \(\text{# Algos} \times \text{# Datasets}\) because univariate algorithms cannot process multivariate datasets and (semi-)supervised algorithms can process only (semi-)supervised datasets. Non-compliant combinations are excluded.

The next table shows an excerpt of the result table with 10 of 26 columns. The complete table with the (quality and runtime) results of all algorithms on all datasets can be downloaded here.

algorithm collection dataset status ROC_AUC AVERAGE_PRECISION PR_AUC RANGE_PR_AUC execute_main_time hyper_params
10428 ARIMA IOPS 4d2af31a-9916-3d9f-8a8e-8a268a48c095 Status.TIMEOUT NaN NaN NaN NaN NaN {"differencing_degree": 1, "distance_metric": ...
10429 ARIMA IOPS 42d6616d-c9c5-370a-a8ba-17ead74f3114 Status.TIMEOUT NaN NaN NaN NaN NaN {"differencing_degree": 1, "distance_metric": ...
10430 ARIMA IOPS c02607e8-7399-3dde-9d28-8a8da5e5d251 Status.OK 0.838021 0.320436 0.522335 0.166973 1553.363329 {"differencing_degree": 1, "distance_metric": ...
10431 ARIMA IOPS 301c70d8-1630-35ac-8f96-bc1b6f4359ea Status.OK 0.644397 0.086475 0.076127 0.072827 578.490987 {"differencing_degree": 1, "distance_metric": ...
10432 ARIMA KDD-TSAD 001_UCR_Anomaly_DISTORTED1sddb40 Status.OK 0.162272 0.006566 0.004535 0.004952 2914.872097 {"differencing_degree": 1, "distance_metric": ...
... ... ... ... ... ... ... ... ... ... ...
45126 VALMOD NormA Discords_marotta_valve_tek_17 Status.OK 0.536991 0.025921 0.025703 0.040605 26.353962 {"exclusion_zone": 0.5, "heap_size": 50, "max_...
45127 VALMOD NormA Discords_patient_respiration1 Status.OK 0.819914 0.223165 0.222676 0.159304 24.373522 {"exclusion_zone": 0.5, "heap_size": 50, "max_...
45128 VALMOD NormA SinusRW_Length_104000_AnomalyL_200_AnomalyN_20... Status.ERROR NaN NaN NaN NaN NaN {"exclusion_zone": 0.5, "heap_size": 50, "max_...
45129 VALMOD NormA SinusRW_Length_106000_AnomalyL_100_AnomalyN_60... Status.ERROR NaN NaN NaN NaN NaN {"exclusion_zone": 0.5, "heap_size": 50, "max_...
45130 VALMOD NormA SinusRW_Length_108000_AnomalyL_200_AnomalyN_40... Status.ERROR NaN NaN NaN NaN NaN {"exclusion_zone": 0.5, "heap_size": 50, "max_...

34703 rows × 10 columns

Error analysis

We first want to look at the ability of the algorithms to process the different datasets. Some of the algorithms are restricted by our time and memory constraints and others produce errors when specific invariants or implementation deficits are encountered.

Algorithm problems grouped by algorithm training type

Unsupervised:

status Status.ERROR Status.OK Status.TIMEOUT ALL
algo_input_dimensionality algorithm
UNIVARIATE S-H-ESD (Twitter) 441 298 0 739
SAND 186 505 48 739
VALMOD 168 559 12 739
Triple ES (Holt-Winter's) 79 524 136 739
Series2Graph 45 694 0 739
NormA 39 622 78 739
PST 36 703 0 739
HOT SAX 18 554 167 739
Left STAMPi 12 712 15 739
ARIMA 4 676 59 739
PhaseSpace-SVM 2 626 111 739
MedianMethod 1 738 0 739
NumentaHTM 1 735 3 739
DSPOT 0 686 53 739
DWT-MLEAD 0 739 0 739
FFT 0 739 0 739
GrammarViz 0 713 26 739
PCI 0 739 0 739
SSA 0 734 5 739
STAMP 0 701 38 739
STOMP 0 725 14 739
Spectral Residual (SR) 0 739 0 739
Subsequence IF 0 739 0 739
Subsequence LOF 0 721 18 739
TSBitmap 0 739 0 739
MULTIVARIATE DBStream 627 161 1 789
COF 239 550 0 789
k-Means 47 742 0 789
IF-LOF 25 762 2 789
CBLOF 3 786 0 789
Torsk 2 721 66 789
COPOD 0 789 0 789
Extended Isolation Forest (EIF) 0 789 0 789
HBOS 0 789 0 789
Isolation Forest (iForest) 0 789 0 789
KNN 0 789 0 789
LOF 0 789 0 789
PCC 0 789 0 789

Semi-supervised:

status Status.ERROR Status.OK Status.TIMEOUT ALL
algo_input_dimensionality algorithm
UNIVARIATE TARZAN 50 250 0 300
OceanWNN 45 255 0 300
Donut 13 283 4 300
Bagel 9 205 86 300
ImageEmbeddingCAE 6 294 0 300
SR-CNN 6 191 103 300
XGBoosting (RR) 2 298 0 300
Random Forest Regressor (RR) 0 246 54 300
MULTIVARIATE LSTM-AD 173 85 65 323
EncDec-AD 110 66 147 323
OmniAnomaly 27 296 0 323
Random Black Forest (RR) 23 281 19 323
LaserDBN 16 285 22 323
DeepAnT 14 309 0 323
Hybrid KNN 9 314 0 323
TAnoGan 4 103 216 323
HealthESN 2 23 298 323
RobustPCA 0 323 0 323
Telemanom 0 322 1 323

Supervised:

status Status.ERROR Status.OK Status.TIMEOUT ALL
algo_input_dimensionality algorithm
MULTIVARIATE MultiHMM 5 1 0 6
Normalizing Flows 2 1 3 6
Hybrid Isolation Forest (HIF) 0 5 1 6

As we can see in the above table, some algorithms are severly impacted by our time limit of 2 hours. They hit the time limit for a majority of datasets. This does not only apply to multivariate algorithms but also to univariate algorithms. In addition, many algorithms run into errors. The error count in the table includes memory errors due to the algorithm hitting our memory limit of 3GB. We highlight some outlying algorithms in the next sections.

In general, 87.43% of all experiments were successful, 5.39% were timeouts, and 7.18% were errors.

Very slow algorithms

Algorithms, for which at least 50% of all executions ran into the timeout.

status Status.ERROR Status.OK Status.TIMEOUT ALL
algo_training_type algo_input_dimensionality algorithm
SEMI_SUPERVISED MULTIVARIATE HealthESN 2 23 298 323
TAnoGan 4 103 216 323
SUPERVISED MULTIVARIATE Normalizing Flows 2 1 3 6

HealthESN and TAnoGan are the two algorithms that hit the time limit the most. Both algorithms are semi-supervised and require a training step. We used time limits of 2 hours for the training step and 2 hours for the testing step.

Normalizing Flows is a supervised algorithm that also hit the time limit for half of the datasets.

HealthESN, TAnoGan, and Normalizing Flows are also the algorithms with the most timeouts for our GutenTAG datasets (cf. GutenTAG result analysis).

Broken algorithms

Algorithms, which failed for at least 50% of the executions.

status Status.ERROR Status.OK Status.TIMEOUT ALL
algo_training_type algo_input_dimensionality algorithm
SEMI_SUPERVISED MULTIVARIATE LSTM-AD 173 85 65 323
SUPERVISED MULTIVARIATE MultiHMM 5 1 0 6
UNSUPERVISED MULTIVARIATE DBStream 627 161 1 789
UNIVARIATE S-H-ESD (Twitter) 441 298 0 739

Similar to the GutenTAG datasets, LSTM-AD, MultiHMM, and DBStream are the algorithms with the most errors (cf. GutenTAG result analysis). In addition to those three algorithms, S-H-ESD is having massive problems with the benchmark datasets and produces errors for about 60% of the experiments (compared to 0% for the GutenTAG datasets).

To get a better feeling for the reason of algorithm failures, we distinguish between different errors in the next section.

Categorization of errors

We categorize all observed errors into specific categories and then count the number of executions that had errors belonging to a category. The next table shows which errors were observed how often for which algorithm.

algorithm ALL (sum) ARIMA Bagel CBLOF COF COPOD DBStream DSPOT DWT-MLEAD DeepAnT Donut EncDec-AD Extended Isolation Forest (EIF) FFT GrammarViz HBOS HOT SAX HealthESN Hybrid Isolation Forest (HIF) Hybrid KNN IF-LOF ImageEmbeddingCAE Isolation Forest (iForest) KNN LOF LSTM-AD LaserDBN Left STAMPi MedianMethod MultiHMM NormA Normalizing Flows NumentaHTM OceanWNN OmniAnomaly PCC PCI PST PhaseSpace-SVM Random Black Forest (RR) Random Forest Regressor (RR) RobustPCA S-H-ESD (Twitter) SAND SR-CNN SSA STAMP STOMP Series2Graph Spectral Residual (SR) Subsequence IF Subsequence LOF TARZAN TAnoGan TSBitmap Telemanom Torsk Triple ES (Holt-Winter's) VALMOD XGBoosting (RR) k-Means
error_category                                                                                                                          
- OK - 30341 676 205 786 550 789 161 686 739 309 283 66 789 739 713 789 554 23 5 314 762 294 789 789 789 85 285 712 738 1 622 1 735 255 296 789 739 703 626 281 246 323 298 505 191 734 701 725 694 739 739 721 250 103 739 322 721 524 559 298 742
- OOM - 754 239 24 3 93 9 2 3 157 1 13 2 20 35 23 11 79 2 38
- TIMEOUT - 1871 59 86 1 53 4 147 26 167 298 1 2 65 22 15 78 3 3 111 19 54 48 103 5 38 14 18 216 1 66 136 12
Bug 847 9 377 11 10 17 9 6 1 6 16 15 9 1 43 1 165 31 20 2 89 9
Incompatible parameters 648 216 2 430
Invariance/assumption not met 122 10 11 6 12 4 79
LinAlgError 21 2 17 2
Max recursion depth exceeded 30 30
Model loading error 5 5
Not converged 8 3 5
TimeEval:IndexError 3 3
Wrong shape error 13 1 2 10
other 16 1 10 1 2 2
unexpected Inf or NaN 24 1 23

LSTM-ADs errors are dominated by OOMs, MultiHMM does not converge, DBStream has implementation errors that we could not fix, and S-H-ESD has assumptions that are not met by the datasets. The errors of S-H-ESD are categorized as Incompatible parameters. This is not entirely correct. S-H-ESD expects a timestamp index for the time series and the data should span multiple days or weeks to be able to remove seasonality. We use a heuristic to convert incremental indices to timestamps and set the correct time span. If the heuristic is not able to adapt the dataset to the S-H-ESD assumptions, we record this as an Incompatible parameters error, because the heuristic raises an exception.

Algorithm quality assessment based on ROC_AUC

The next table shows the min, mean, median, and max ROC_AUC metric score computed over all datasets for each algorithm:

algorithm Normalizing Flows Subsequence LOF Hybrid Isolation Forest (HIF) Donut GrammarViz DWT-MLEAD VALMOD LSTM-AD PCI Left STAMPi Telemanom Triple ES (Holt-Winter's) SAND ARIMA Random Forest Regressor (RR) NumentaHTM Series2Graph STOMP KNN STAMP Extended Isolation Forest (EIF) HealthESN Isolation Forest (iForest) Subsequence IF k-Means HBOS Spectral Residual (SR) NormA MedianMethod ImageEmbeddingCAE COPOD XGBoosting (RR) EncDec-AD COF Random Black Forest (RR) CBLOF IF-LOF LOF DBStream DeepAnT OceanWNN Torsk OmniAnomaly PCC PST PhaseSpace-SVM TSBitmap HOT SAX LaserDBN RobustPCA SSA Bagel DSPOT TAnoGan Hybrid KNN FFT TARZAN SR-CNN S-H-ESD (Twitter) MultiHMM
min 0.992443 0.025229 0.592981 0.071872 0.001746 0.016998 0.000000 0.013924 0.028019 0.092740 0.000014 0.106113 0.037324 0.007541 0.074326 0.290517 0.000106 0.000000 0.086367 0.000000 0.078512 0.443499 0.063202 0.000704 0.000000 0.067572 0.002497 0.000000 0.025507 0.003642 0.002113 0.078633 0.000937 0.023289 0.216832 0.071530 0.018925 0.019324 0.045533 0.000241 0.090600 0.117461 0.002294 0.031566 0.000000 0.002430 0.055675 0.156909 0.091204 0.000263 0.005590 0.000664 0.248658 0.001404 0.000004 0.007053 0.000162 0.134518 0.479870 0.374354
mean 0.992443 0.873463 0.858942 0.814245 0.814148 0.811523 0.801048 0.787004 0.747642 0.742096 0.741286 0.735233 0.734560 0.731612 0.729402 0.723747 0.723652 0.704349 0.699461 0.697303 0.694858 0.694434 0.693879 0.691789 0.690503 0.690325 0.689837 0.685234 0.682457 0.682375 0.680206 0.678807 0.674921 0.674018 0.670852 0.668766 0.666316 0.663431 0.654112 0.647727 0.640535 0.630693 0.620901 0.600068 0.597836 0.586604 0.577123 0.567419 0.560308 0.559578 0.557453 0.551537 0.550059 0.536777 0.524415 0.523552 0.516395 0.515006 0.512374 0.374354
median 0.992443 0.971229 0.856722 0.912582 0.895212 0.887902 0.955706 0.905740 0.797039 0.776048 0.861103 0.738627 0.723227 0.783483 0.737885 0.721098 0.770457 0.841000 0.661559 0.823992 0.664053 0.636674 0.659908 0.720508 0.749782 0.663484 0.665752 0.673790 0.652758 0.753238 0.647625 0.664087 0.725808 0.618044 0.656053 0.622689 0.607835 0.586261 0.561812 0.726475 0.630037 0.576963 0.673658 0.567187 0.623624 0.581505 0.572539 0.500000 0.544781 0.535274 0.514397 0.538822 0.500000 0.511289 0.505695 0.500000 0.554709 0.500000 0.500000 0.374354
max 0.992443 1.000000 0.999918 0.999993 1.000000 1.000000 1.000000 0.999996 1.000000 1.000000 1.000000 1.000000 1.000000 0.999905 1.000000 0.999969 1.000000 1.000000 1.000000 1.000000 1.000000 0.992204 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.999921 1.000000 0.999995 0.999701 1.000000 0.999992 1.000000 1.000000 1.000000 0.999997 0.999988 0.999524 1.000000 0.999499 1.000000 1.000000 0.996924 0.999296 1.000000 0.999941 0.999983 0.998834 0.981582 1.000000 0.999298 0.999999 1.000000 0.999991 0.955550 0.890020 0.374354

The following boxplots give a more visual picture of the score distributions. The algorithms are ordered by their mean ROC_AUC score (not included in the visualization) and the first and last 10 algorithms are shown by default. Use the legend on the right to display additional algorithms.

Best algorithms (based on mean ROC_AUC)

min mean median max
algorithm
Normalizing Flows 0.992443 0.992443 0.992443 0.992443
Subsequence LOF 0.025229 0.873463 0.971229 1.000000
Hybrid Isolation Forest (HIF) 0.592981 0.858942 0.856722 0.999918
Donut 0.071872 0.814245 0.912582 0.999993
GrammarViz 0.001746 0.814148 0.895212 1.000000

Worst algorithms (based on mean ROC_AUC)

min mean median max
algorithm
FFT 0.007053 0.523552 0.500000 1.000000
TARZAN 0.000162 0.516395 0.554709 0.999991
SR-CNN 0.134518 0.515006 0.500000 0.955550
S-H-ESD (Twitter) 0.479870 0.512374 0.500000 0.890020
MultiHMM 0.374354 0.374354 0.374354 0.374354

Scores of best algorithms

In the next figure, we show the scorings of the 4 best algorithms (excluding HIF and Normalizing Flows, because they are supervised and our selected dataset is not) on the dataset “004_UCR_Anomaly_DISTORTEDBIDMC1”:

/tmp/ipykernel_31322/1909309711.py:36: UserWarning:

No ROC_AUC score found! Probably Hybrid Isolation Forest (HIF) was not executed on 004_UCR_Anomaly_DISTORTEDBIDMC1.

/tmp/ipykernel_31322/1909309711.py:36: UserWarning:

No ROC_AUC score found! Probably Normalizing Flows was not executed on 004_UCR_Anomaly_DISTORTEDBIDMC1.

ROC_AUC over the number of successfully processed datasets

Similar to the reliability plot in our paper, we show the algorithm’s ROC_AUC score in relation to the relative number of successfully processed datasets in the next figure:

Runtime-weighted AUC_ROC scores

In the next figure, we try to combine the runtime and result quality of the algorithms into one metric by weighting the ROC_AUC score by the inverse scaled overall runtime. Algorithms that take exceptionally long to process the datasets are punished and have a smaller weighted ROC_AUC score. Algorithms that are very fast keep their original ROC_AUC score.

Best algorithm of algorithm family (based on ROC_AUC)

algorithm ROC_AUC
algo_family
trees Hybrid Isolation Forest (HIF) 0.858942
reconstruction Donut 0.814245
forecasting LSTM-AD 0.787004
encoding GrammarViz 0.814148
distribution Normalizing Flows 0.992443
distance Subsequence LOF 0.873463

Compared to the GutenTAG datasets, the algorithm families trees and distribution had a change in their best algorithm. For the other families, the best algorithm on the GutenTAG datasets is also the best algorithm on the benchmark datasets.

Please note that the results for HIF and Normalizing Flows are not reliable, because they are supervised and were executed on at most 6 datasets. In addition, Normalizing Flows could only process half of the datasets!

Algorithm quality assessment based on PR_AUC

In the next figure, we show the box plots of the different algorithms computed over the PR_AUC metric instead of ROC_AUC. The algorithms are sorted by their mean PR_AUC over all datasets.

Dataset assessment

In this section, we quickly aggregate the results of our evaluation on dataset collection level. This gives some preliminary insights into the dataset complexity and quality.

Dataset error overview

We first want to have a look at the error distribution over the dataset collections. We count the number of experiments that failed, were successful, or ran into a timeout for each dataset collection. The results are grouped by the dataset training type (if a training time series with (supervised) or without (semi-supervised) labeled anomalies is present or not (unsupervised)) and dimensionality.

Unsupervised:

status Status.ERROR Status.OK Status.TIMEOUT ALL
dataset_input_dimensionality collection
UNIVARIATE WebscopeS5 789 12885 6 13680
NAB 225 1860 43 2128
MGAB 29 311 40 380
NormA 28 299 15 342
MULTIVARIATE SVDB 19 184 5 208
MITDB 12 36 4 52
Genesis 2 10 1 13
Daphnet 1 37 1 39
CalIt2 0 13 0 13

Semi-supervised:

status Status.ERROR Status.OK Status.TIMEOUT ALL
dataset_input_dimensionality collection
UNIVARIATE KDD-TSAD 987 11644 1562 14193
NASA-SMAP 147 1795 53 1995
NASA-MSL 90 802 20 912
MULTIVARIATE SMD 136 314 102 552

Supervised:

status Status.ERROR Status.OK Status.TIMEOUT ALL
dataset_input_dimensionality collection
UNIVARIATE IOPS 16 131 17 164
MULTIVARIATE Exathlon 10 20 2 32

In the next figure, we show the relative number of experiments that either failed (ERROR), ran into the time limit (TIMEOUT), ran into the memory limit (OOM), or were successful (OK). The dataset collections are sorted by their percentage of failed (ERROR) experiments.

We can see that the dataset collections MITDB, Exathlon, SMD, and Genesis have a large percentage of experiments that hit the memory or time limits. Those datasets are either very long, very wide, or contain complicated time series patterns.

The dataset collections NAB, NASA-MSL, MITDB, and IOPS have more than 7% failing experiments. We are concerned about the high percentage of failing experiments in general. This might be due to bad dataset quality.

The next figure, shows the percentage of succesful experiments aggregated per dataset. We highlighted the datasets with less than 80% success rate (on the left side of the plot).

Datasets which all algorithms could process

status Status.ERROR Status.OK Status.TIMEOUT ALL
collection dataset
KDD-TSAD 162_UCR_Anomaly_WalkingAceleration5 0 57 0 57
Daphnet S09R01E4 0 13 0 13
CalIt2 CalIt2-traffic 0 13 0 13

Most broken datasets

Datasets, for which more than 40% of experiments failed.

status Status.ERROR Status.OK Status.TIMEOUT ALL
collection dataset
SMD machine-1-1 7 12 5 24
machine-1-2 6 14 4 24
machine-1-3 6 13 5 24
machine-1-8 6 14 4 24
machine-2-1 6 13 5 24
machine-2-4 6 14 4 24
machine-2-5 6 14 4 24
machine-2-8 6 14 4 24
machine-2-9 6 13 5 24
machine-3-1 6 14 4 24
machine-3-10 6 13 5 24
machine-3-11 6 14 4 24
machine-3-3 6 13 5 24
machine-3-4 6 14 4 24
machine-3-5 6 13 5 24
machine-3-6 6 14 4 24
machine-3-7 6 13 5 24
machine-3-8 6 13 5 24
machine-3-9 6 14 4 24
machine-2-6 5 14 5 24
machine-2-7 5 14 5 24
KDD-TSAD 108_UCR_Anomaly_NOISEresperation2 16 31 10 57
187_UCR_Anomaly_resperation2 16 33 8 57
079_UCR_Anomaly_DISTORTEDresperation2 13 33 11 57
239_UCR_Anomaly_taichidbS0715Master 11 32 14 57
240_UCR_Anomaly_taichidbS0715Master 11 32 14 57
218_UCR_Anomaly_STAFFIIIDatabase 10 30 17 57
220_UCR_Anomaly_STAFFIIIDatabase 8 34 15 57
246_UCR_Anomaly_tilt12755mtable 8 32 17 57
078_UCR_Anomaly_DISTORTEDresperation1 7 33 17 57
213_UCR_Anomaly_STAFFIIIDatabase 7 33 17 57
216_UCR_Anomaly_STAFFIIIDatabase 7 34 16 57
244_UCR_Anomaly_tilt12754table 7 30 20 57
245_UCR_Anomaly_tilt12754table 7 30 20 57
219_UCR_Anomaly_STAFFIIIDatabase 6 33 18 57
242_UCR_Anomaly_tilt12744mtable 6 33 18 57

Dataset quality assessment based on ROC_AUC

The next figure shows the ROC_AUC score box plots per dataset. The datasets are sorted by their median ROC_AUC score.

Note that the number of experiments differs for each dataset based on its training type and input dimensionality!

In the next figure, you can see the dataset with the worst median ROC_AUC and a selection of algorithm scores (DWT-MLEAD, STOMP, Series2Graph, and Subsequence LOF):