Experiment result analysis on the Benchmark datasets
On this website, we present detailed results of the experiments on the Benchmark datasets that we collected from various sources (see Datasets Overview. We show errors, qualitative results, and the runtime of the different algorithms.
Result Overview
In this analysis, we just look at the results of all 60 relevant algorithms with their best parameter configuration on the Benchmark datasets (excluding our synthetically generated GutenTAG datasets):
- Experiments: 34703
- Algorithms: 60
- Datasets: 789 in 15 collections
The number of experiments is smaller than \(\text{# Algos} \times \text{# Datasets}\) because univariate algorithms cannot process multivariate datasets and (semi-)supervised algorithms can process only (semi-)supervised datasets. Non-compliant combinations are excluded.
The next table shows an excerpt of the result table with 10 of 26 columns. The complete table with the (quality and runtime) results of all algorithms on all datasets can be downloaded here.
algorithm | collection | dataset | status | ROC_AUC | AVERAGE_PRECISION | PR_AUC | RANGE_PR_AUC | execute_main_time | hyper_params | |
---|---|---|---|---|---|---|---|---|---|---|
10428 | ARIMA | IOPS | 4d2af31a-9916-3d9f-8a8e-8a268a48c095 | Status.TIMEOUT | NaN | NaN | NaN | NaN | NaN | {"differencing_degree": 1, "distance_metric": ... |
10429 | ARIMA | IOPS | 42d6616d-c9c5-370a-a8ba-17ead74f3114 | Status.TIMEOUT | NaN | NaN | NaN | NaN | NaN | {"differencing_degree": 1, "distance_metric": ... |
10430 | ARIMA | IOPS | c02607e8-7399-3dde-9d28-8a8da5e5d251 | Status.OK | 0.838021 | 0.320436 | 0.522335 | 0.166973 | 1553.363329 | {"differencing_degree": 1, "distance_metric": ... |
10431 | ARIMA | IOPS | 301c70d8-1630-35ac-8f96-bc1b6f4359ea | Status.OK | 0.644397 | 0.086475 | 0.076127 | 0.072827 | 578.490987 | {"differencing_degree": 1, "distance_metric": ... |
10432 | ARIMA | KDD-TSAD | 001_UCR_Anomaly_DISTORTED1sddb40 | Status.OK | 0.162272 | 0.006566 | 0.004535 | 0.004952 | 2914.872097 | {"differencing_degree": 1, "distance_metric": ... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
45126 | VALMOD | NormA | Discords_marotta_valve_tek_17 | Status.OK | 0.536991 | 0.025921 | 0.025703 | 0.040605 | 26.353962 | {"exclusion_zone": 0.5, "heap_size": 50, "max_... |
45127 | VALMOD | NormA | Discords_patient_respiration1 | Status.OK | 0.819914 | 0.223165 | 0.222676 | 0.159304 | 24.373522 | {"exclusion_zone": 0.5, "heap_size": 50, "max_... |
45128 | VALMOD | NormA | SinusRW_Length_104000_AnomalyL_200_AnomalyN_20... | Status.ERROR | NaN | NaN | NaN | NaN | NaN | {"exclusion_zone": 0.5, "heap_size": 50, "max_... |
45129 | VALMOD | NormA | SinusRW_Length_106000_AnomalyL_100_AnomalyN_60... | Status.ERROR | NaN | NaN | NaN | NaN | NaN | {"exclusion_zone": 0.5, "heap_size": 50, "max_... |
45130 | VALMOD | NormA | SinusRW_Length_108000_AnomalyL_200_AnomalyN_40... | Status.ERROR | NaN | NaN | NaN | NaN | NaN | {"exclusion_zone": 0.5, "heap_size": 50, "max_... |
34703 rows × 10 columns
Error analysis
We first want to look at the ability of the algorithms to process the different datasets. Some of the algorithms are restricted by our time and memory constraints and others produce errors when specific invariants or implementation deficits are encountered.
Algorithm problems grouped by algorithm training type
Unsupervised:
status | Status.ERROR | Status.OK | Status.TIMEOUT | ALL | |
---|---|---|---|---|---|
algo_input_dimensionality | algorithm | ||||
UNIVARIATE | S-H-ESD (Twitter) | 441 | 298 | 0 | 739 |
SAND | 186 | 505 | 48 | 739 | |
VALMOD | 168 | 559 | 12 | 739 | |
Triple ES (Holt-Winter's) | 79 | 524 | 136 | 739 | |
Series2Graph | 45 | 694 | 0 | 739 | |
NormA | 39 | 622 | 78 | 739 | |
PST | 36 | 703 | 0 | 739 | |
HOT SAX | 18 | 554 | 167 | 739 | |
Left STAMPi | 12 | 712 | 15 | 739 | |
ARIMA | 4 | 676 | 59 | 739 | |
PhaseSpace-SVM | 2 | 626 | 111 | 739 | |
MedianMethod | 1 | 738 | 0 | 739 | |
NumentaHTM | 1 | 735 | 3 | 739 | |
DSPOT | 0 | 686 | 53 | 739 | |
DWT-MLEAD | 0 | 739 | 0 | 739 | |
FFT | 0 | 739 | 0 | 739 | |
GrammarViz | 0 | 713 | 26 | 739 | |
PCI | 0 | 739 | 0 | 739 | |
SSA | 0 | 734 | 5 | 739 | |
STAMP | 0 | 701 | 38 | 739 | |
STOMP | 0 | 725 | 14 | 739 | |
Spectral Residual (SR) | 0 | 739 | 0 | 739 | |
Subsequence IF | 0 | 739 | 0 | 739 | |
Subsequence LOF | 0 | 721 | 18 | 739 | |
TSBitmap | 0 | 739 | 0 | 739 | |
MULTIVARIATE | DBStream | 627 | 161 | 1 | 789 |
COF | 239 | 550 | 0 | 789 | |
k-Means | 47 | 742 | 0 | 789 | |
IF-LOF | 25 | 762 | 2 | 789 | |
CBLOF | 3 | 786 | 0 | 789 | |
Torsk | 2 | 721 | 66 | 789 | |
COPOD | 0 | 789 | 0 | 789 | |
Extended Isolation Forest (EIF) | 0 | 789 | 0 | 789 | |
HBOS | 0 | 789 | 0 | 789 | |
Isolation Forest (iForest) | 0 | 789 | 0 | 789 | |
KNN | 0 | 789 | 0 | 789 | |
LOF | 0 | 789 | 0 | 789 | |
PCC | 0 | 789 | 0 | 789 |
Semi-supervised:
status | Status.ERROR | Status.OK | Status.TIMEOUT | ALL | |
---|---|---|---|---|---|
algo_input_dimensionality | algorithm | ||||
UNIVARIATE | TARZAN | 50 | 250 | 0 | 300 |
OceanWNN | 45 | 255 | 0 | 300 | |
Donut | 13 | 283 | 4 | 300 | |
Bagel | 9 | 205 | 86 | 300 | |
ImageEmbeddingCAE | 6 | 294 | 0 | 300 | |
SR-CNN | 6 | 191 | 103 | 300 | |
XGBoosting (RR) | 2 | 298 | 0 | 300 | |
Random Forest Regressor (RR) | 0 | 246 | 54 | 300 | |
MULTIVARIATE | LSTM-AD | 173 | 85 | 65 | 323 |
EncDec-AD | 110 | 66 | 147 | 323 | |
OmniAnomaly | 27 | 296 | 0 | 323 | |
Random Black Forest (RR) | 23 | 281 | 19 | 323 | |
LaserDBN | 16 | 285 | 22 | 323 | |
DeepAnT | 14 | 309 | 0 | 323 | |
Hybrid KNN | 9 | 314 | 0 | 323 | |
TAnoGan | 4 | 103 | 216 | 323 | |
HealthESN | 2 | 23 | 298 | 323 | |
RobustPCA | 0 | 323 | 0 | 323 | |
Telemanom | 0 | 322 | 1 | 323 |
Supervised:
status | Status.ERROR | Status.OK | Status.TIMEOUT | ALL | |
---|---|---|---|---|---|
algo_input_dimensionality | algorithm | ||||
MULTIVARIATE | MultiHMM | 5 | 1 | 0 | 6 |
Normalizing Flows | 2 | 1 | 3 | 6 | |
Hybrid Isolation Forest (HIF) | 0 | 5 | 1 | 6 |
As we can see in the above table, some algorithms are severly impacted by our time limit of 2 hours. They hit the time limit for a majority of datasets. This does not only apply to multivariate algorithms but also to univariate algorithms. In addition, many algorithms run into errors. The error count in the table includes memory errors due to the algorithm hitting our memory limit of 3GB. We highlight some outlying algorithms in the next sections.
In general, 87.43% of all experiments were successful, 5.39% were timeouts, and 7.18% were errors.
Very slow algorithms
Algorithms, for which at least 50% of all executions ran into the timeout.
status | Status.ERROR | Status.OK | Status.TIMEOUT | ALL | ||
---|---|---|---|---|---|---|
algo_training_type | algo_input_dimensionality | algorithm | ||||
SEMI_SUPERVISED | MULTIVARIATE | HealthESN | 2 | 23 | 298 | 323 |
TAnoGan | 4 | 103 | 216 | 323 | ||
SUPERVISED | MULTIVARIATE | Normalizing Flows | 2 | 1 | 3 | 6 |
HealthESN and TAnoGan are the two algorithms that hit the time limit the most. Both algorithms are semi-supervised and require a training step. We used time limits of 2 hours for the training step and 2 hours for the testing step.
Normalizing Flows is a supervised algorithm that also hit the time limit for half of the datasets.
HealthESN, TAnoGan, and Normalizing Flows are also the algorithms with the most timeouts for our GutenTAG datasets (cf. GutenTAG result analysis).
Broken algorithms
Algorithms, which failed for at least 50% of the executions.
status | Status.ERROR | Status.OK | Status.TIMEOUT | ALL | ||
---|---|---|---|---|---|---|
algo_training_type | algo_input_dimensionality | algorithm | ||||
SEMI_SUPERVISED | MULTIVARIATE | LSTM-AD | 173 | 85 | 65 | 323 |
SUPERVISED | MULTIVARIATE | MultiHMM | 5 | 1 | 0 | 6 |
UNSUPERVISED | MULTIVARIATE | DBStream | 627 | 161 | 1 | 789 |
UNIVARIATE | S-H-ESD (Twitter) | 441 | 298 | 0 | 739 |
Similar to the GutenTAG datasets, LSTM-AD, MultiHMM, and DBStream are the algorithms with the most errors (cf. GutenTAG result analysis). In addition to those three algorithms, S-H-ESD is having massive problems with the benchmark datasets and produces errors for about 60% of the experiments (compared to 0% for the GutenTAG datasets).
To get a better feeling for the reason of algorithm failures, we distinguish between different errors in the next section.
Categorization of errors
We categorize all observed errors into specific categories and then count the number of executions that had errors belonging to a category. The next table shows which errors were observed how often for which algorithm.
algorithm | ALL (sum) | ARIMA | Bagel | CBLOF | COF | COPOD | DBStream | DSPOT | DWT-MLEAD | DeepAnT | Donut | EncDec-AD | Extended Isolation Forest (EIF) | FFT | GrammarViz | HBOS | HOT SAX | HealthESN | Hybrid Isolation Forest (HIF) | Hybrid KNN | IF-LOF | ImageEmbeddingCAE | Isolation Forest (iForest) | KNN | LOF | LSTM-AD | LaserDBN | Left STAMPi | MedianMethod | MultiHMM | NormA | Normalizing Flows | NumentaHTM | OceanWNN | OmniAnomaly | PCC | PCI | PST | PhaseSpace-SVM | Random Black Forest (RR) | Random Forest Regressor (RR) | RobustPCA | S-H-ESD (Twitter) | SAND | SR-CNN | SSA | STAMP | STOMP | Series2Graph | Spectral Residual (SR) | Subsequence IF | Subsequence LOF | TARZAN | TAnoGan | TSBitmap | Telemanom | Torsk | Triple ES (Holt-Winter's) | VALMOD | XGBoosting (RR) | k-Means |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
error_category | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
- OK - | 30341 | 676 | 205 | 786 | 550 | 789 | 161 | 686 | 739 | 309 | 283 | 66 | 789 | 739 | 713 | 789 | 554 | 23 | 5 | 314 | 762 | 294 | 789 | 789 | 789 | 85 | 285 | 712 | 738 | 1 | 622 | 1 | 735 | 255 | 296 | 789 | 739 | 703 | 626 | 281 | 246 | 323 | 298 | 505 | 191 | 734 | 701 | 725 | 694 | 739 | 739 | 721 | 250 | 103 | 739 | 322 | 721 | 524 | 559 | 298 | 742 |
- OOM - | 754 | 239 | 24 | 3 | 93 | 9 | 2 | 3 | 157 | 1 | 13 | 2 | 20 | 35 | 23 | 11 | 79 | 2 | 38 | ||||||||||||||||||||||||||||||||||||||||||
- TIMEOUT - | 1871 | 59 | 86 | 1 | 53 | 4 | 147 | 26 | 167 | 298 | 1 | 2 | 65 | 22 | 15 | 78 | 3 | 3 | 111 | 19 | 54 | 48 | 103 | 5 | 38 | 14 | 18 | 216 | 1 | 66 | 136 | 12 | |||||||||||||||||||||||||||||
Bug | 847 | 9 | 377 | 11 | 10 | 17 | 9 | 6 | 1 | 6 | 16 | 15 | 9 | 1 | 43 | 1 | 165 | 31 | 20 | 2 | 89 | 9 | |||||||||||||||||||||||||||||||||||||||
Incompatible parameters | 648 | 216 | 2 | 430 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Invariance/assumption not met | 122 | 10 | 11 | 6 | 12 | 4 | 79 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
LinAlgError | 21 | 2 | 17 | 2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Max recursion depth exceeded | 30 | 30 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Model loading error | 5 | 5 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Not converged | 8 | 3 | 5 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TimeEval:IndexError | 3 | 3 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Wrong shape error | 13 | 1 | 2 | 10 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
other | 16 | 1 | 10 | 1 | 2 | 2 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
unexpected Inf or NaN | 24 | 1 | 23 |
LSTM-ADs errors are dominated by OOMs, MultiHMM does not converge, DBStream has implementation errors that we could not fix, and S-H-ESD has assumptions that are not met by the datasets. The errors of S-H-ESD are categorized as Incompatible parameters. This is not entirely correct. S-H-ESD expects a timestamp index for the time series and the data should span multiple days or weeks to be able to remove seasonality. We use a heuristic to convert incremental indices to timestamps and set the correct time span. If the heuristic is not able to adapt the dataset to the S-H-ESD assumptions, we record this as an Incompatible parameters error, because the heuristic raises an exception.
Algorithm quality assessment based on ROC_AUC
The next table shows the min, mean, median, and max ROC_AUC metric score computed over all datasets for each algorithm:
algorithm | Normalizing Flows | Subsequence LOF | Hybrid Isolation Forest (HIF) | Donut | GrammarViz | DWT-MLEAD | VALMOD | LSTM-AD | PCI | Left STAMPi | Telemanom | Triple ES (Holt-Winter's) | SAND | ARIMA | Random Forest Regressor (RR) | NumentaHTM | Series2Graph | STOMP | KNN | STAMP | Extended Isolation Forest (EIF) | HealthESN | Isolation Forest (iForest) | Subsequence IF | k-Means | HBOS | Spectral Residual (SR) | NormA | MedianMethod | ImageEmbeddingCAE | COPOD | XGBoosting (RR) | EncDec-AD | COF | Random Black Forest (RR) | CBLOF | IF-LOF | LOF | DBStream | DeepAnT | OceanWNN | Torsk | OmniAnomaly | PCC | PST | PhaseSpace-SVM | TSBitmap | HOT SAX | LaserDBN | RobustPCA | SSA | Bagel | DSPOT | TAnoGan | Hybrid KNN | FFT | TARZAN | SR-CNN | S-H-ESD (Twitter) | MultiHMM |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
min | 0.992443 | 0.025229 | 0.592981 | 0.071872 | 0.001746 | 0.016998 | 0.000000 | 0.013924 | 0.028019 | 0.092740 | 0.000014 | 0.106113 | 0.037324 | 0.007541 | 0.074326 | 0.290517 | 0.000106 | 0.000000 | 0.086367 | 0.000000 | 0.078512 | 0.443499 | 0.063202 | 0.000704 | 0.000000 | 0.067572 | 0.002497 | 0.000000 | 0.025507 | 0.003642 | 0.002113 | 0.078633 | 0.000937 | 0.023289 | 0.216832 | 0.071530 | 0.018925 | 0.019324 | 0.045533 | 0.000241 | 0.090600 | 0.117461 | 0.002294 | 0.031566 | 0.000000 | 0.002430 | 0.055675 | 0.156909 | 0.091204 | 0.000263 | 0.005590 | 0.000664 | 0.248658 | 0.001404 | 0.000004 | 0.007053 | 0.000162 | 0.134518 | 0.479870 | 0.374354 |
mean | 0.992443 | 0.873463 | 0.858942 | 0.814245 | 0.814148 | 0.811523 | 0.801048 | 0.787004 | 0.747642 | 0.742096 | 0.741286 | 0.735233 | 0.734560 | 0.731612 | 0.729402 | 0.723747 | 0.723652 | 0.704349 | 0.699461 | 0.697303 | 0.694858 | 0.694434 | 0.693879 | 0.691789 | 0.690503 | 0.690325 | 0.689837 | 0.685234 | 0.682457 | 0.682375 | 0.680206 | 0.678807 | 0.674921 | 0.674018 | 0.670852 | 0.668766 | 0.666316 | 0.663431 | 0.654112 | 0.647727 | 0.640535 | 0.630693 | 0.620901 | 0.600068 | 0.597836 | 0.586604 | 0.577123 | 0.567419 | 0.560308 | 0.559578 | 0.557453 | 0.551537 | 0.550059 | 0.536777 | 0.524415 | 0.523552 | 0.516395 | 0.515006 | 0.512374 | 0.374354 |
median | 0.992443 | 0.971229 | 0.856722 | 0.912582 | 0.895212 | 0.887902 | 0.955706 | 0.905740 | 0.797039 | 0.776048 | 0.861103 | 0.738627 | 0.723227 | 0.783483 | 0.737885 | 0.721098 | 0.770457 | 0.841000 | 0.661559 | 0.823992 | 0.664053 | 0.636674 | 0.659908 | 0.720508 | 0.749782 | 0.663484 | 0.665752 | 0.673790 | 0.652758 | 0.753238 | 0.647625 | 0.664087 | 0.725808 | 0.618044 | 0.656053 | 0.622689 | 0.607835 | 0.586261 | 0.561812 | 0.726475 | 0.630037 | 0.576963 | 0.673658 | 0.567187 | 0.623624 | 0.581505 | 0.572539 | 0.500000 | 0.544781 | 0.535274 | 0.514397 | 0.538822 | 0.500000 | 0.511289 | 0.505695 | 0.500000 | 0.554709 | 0.500000 | 0.500000 | 0.374354 |
max | 0.992443 | 1.000000 | 0.999918 | 0.999993 | 1.000000 | 1.000000 | 1.000000 | 0.999996 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.999905 | 1.000000 | 0.999969 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.992204 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.999921 | 1.000000 | 0.999995 | 0.999701 | 1.000000 | 0.999992 | 1.000000 | 1.000000 | 1.000000 | 0.999997 | 0.999988 | 0.999524 | 1.000000 | 0.999499 | 1.000000 | 1.000000 | 0.996924 | 0.999296 | 1.000000 | 0.999941 | 0.999983 | 0.998834 | 0.981582 | 1.000000 | 0.999298 | 0.999999 | 1.000000 | 0.999991 | 0.955550 | 0.890020 | 0.374354 |
The following boxplots give a more visual picture of the score distributions. The algorithms are ordered by their mean ROC_AUC score (not included in the visualization) and the first and last 10 algorithms are shown by default. Use the legend on the right to display additional algorithms.
Best algorithms (based on mean ROC_AUC)
min | mean | median | max | |
---|---|---|---|---|
algorithm | ||||
Normalizing Flows | 0.992443 | 0.992443 | 0.992443 | 0.992443 |
Subsequence LOF | 0.025229 | 0.873463 | 0.971229 | 1.000000 |
Hybrid Isolation Forest (HIF) | 0.592981 | 0.858942 | 0.856722 | 0.999918 |
Donut | 0.071872 | 0.814245 | 0.912582 | 0.999993 |
GrammarViz | 0.001746 | 0.814148 | 0.895212 | 1.000000 |
Worst algorithms (based on mean ROC_AUC)
min | mean | median | max | |
---|---|---|---|---|
algorithm | ||||
FFT | 0.007053 | 0.523552 | 0.500000 | 1.000000 |
TARZAN | 0.000162 | 0.516395 | 0.554709 | 0.999991 |
SR-CNN | 0.134518 | 0.515006 | 0.500000 | 0.955550 |
S-H-ESD (Twitter) | 0.479870 | 0.512374 | 0.500000 | 0.890020 |
MultiHMM | 0.374354 | 0.374354 | 0.374354 | 0.374354 |
Scores of best algorithms
In the next figure, we show the scorings of the 4 best algorithms (excluding HIF and Normalizing Flows, because they are supervised and our selected dataset is not) on the dataset “004_UCR_Anomaly_DISTORTEDBIDMC1”:
/tmp/ipykernel_31322/1909309711.py:36: UserWarning:
No ROC_AUC score found! Probably Hybrid Isolation Forest (HIF) was not executed on 004_UCR_Anomaly_DISTORTEDBIDMC1.
/tmp/ipykernel_31322/1909309711.py:36: UserWarning:
No ROC_AUC score found! Probably Normalizing Flows was not executed on 004_UCR_Anomaly_DISTORTEDBIDMC1.
ROC_AUC over the number of successfully processed datasets
Similar to the reliability plot in our paper, we show the algorithm’s ROC_AUC score in relation to the relative number of successfully processed datasets in the next figure:
Runtime-weighted AUC_ROC scores
In the next figure, we try to combine the runtime and result quality of the algorithms into one metric by weighting the ROC_AUC score by the inverse scaled overall runtime. Algorithms that take exceptionally long to process the datasets are punished and have a smaller weighted ROC_AUC score. Algorithms that are very fast keep their original ROC_AUC score.
Best algorithm of algorithm family (based on ROC_AUC)
algorithm | ROC_AUC | |
---|---|---|
algo_family | ||
trees | Hybrid Isolation Forest (HIF) | 0.858942 |
reconstruction | Donut | 0.814245 |
forecasting | LSTM-AD | 0.787004 |
encoding | GrammarViz | 0.814148 |
distribution | Normalizing Flows | 0.992443 |
distance | Subsequence LOF | 0.873463 |
Compared to the GutenTAG datasets, the algorithm families trees and distribution had a change in their best algorithm. For the other families, the best algorithm on the GutenTAG datasets is also the best algorithm on the benchmark datasets.
Please note that the results for HIF and Normalizing Flows are not reliable, because they are supervised and were executed on at most 6 datasets. In addition, Normalizing Flows could only process half of the datasets!
Algorithm quality assessment based on PR_AUC
In the next figure, we show the box plots of the different algorithms computed over the PR_AUC metric instead of ROC_AUC. The algorithms are sorted by their mean PR_AUC over all datasets.
Dataset assessment
In this section, we quickly aggregate the results of our evaluation on dataset collection level. This gives some preliminary insights into the dataset complexity and quality.
Dataset error overview
We first want to have a look at the error distribution over the dataset collections. We count the number of experiments that failed, were successful, or ran into a timeout for each dataset collection. The results are grouped by the dataset training type (if a training time series with (supervised) or without (semi-supervised) labeled anomalies is present or not (unsupervised)) and dimensionality.
Unsupervised:
status | Status.ERROR | Status.OK | Status.TIMEOUT | ALL | |
---|---|---|---|---|---|
dataset_input_dimensionality | collection | ||||
UNIVARIATE | WebscopeS5 | 789 | 12885 | 6 | 13680 |
NAB | 225 | 1860 | 43 | 2128 | |
MGAB | 29 | 311 | 40 | 380 | |
NormA | 28 | 299 | 15 | 342 | |
MULTIVARIATE | SVDB | 19 | 184 | 5 | 208 |
MITDB | 12 | 36 | 4 | 52 | |
Genesis | 2 | 10 | 1 | 13 | |
Daphnet | 1 | 37 | 1 | 39 | |
CalIt2 | 0 | 13 | 0 | 13 |
Semi-supervised:
status | Status.ERROR | Status.OK | Status.TIMEOUT | ALL | |
---|---|---|---|---|---|
dataset_input_dimensionality | collection | ||||
UNIVARIATE | KDD-TSAD | 987 | 11644 | 1562 | 14193 |
NASA-SMAP | 147 | 1795 | 53 | 1995 | |
NASA-MSL | 90 | 802 | 20 | 912 | |
MULTIVARIATE | SMD | 136 | 314 | 102 | 552 |
Supervised:
status | Status.ERROR | Status.OK | Status.TIMEOUT | ALL | |
---|---|---|---|---|---|
dataset_input_dimensionality | collection | ||||
UNIVARIATE | IOPS | 16 | 131 | 17 | 164 |
MULTIVARIATE | Exathlon | 10 | 20 | 2 | 32 |
In the next figure, we show the relative number of experiments that either failed (ERROR), ran into the time limit (TIMEOUT), ran into the memory limit (OOM), or were successful (OK). The dataset collections are sorted by their percentage of failed (ERROR) experiments.
We can see that the dataset collections MITDB, Exathlon, SMD, and Genesis have a large percentage of experiments that hit the memory or time limits. Those datasets are either very long, very wide, or contain complicated time series patterns.
The dataset collections NAB, NASA-MSL, MITDB, and IOPS have more than 7% failing experiments. We are concerned about the high percentage of failing experiments in general. This might be due to bad dataset quality.
The next figure, shows the percentage of succesful experiments aggregated per dataset. We highlighted the datasets with less than 80% success rate (on the left side of the plot).
Datasets which all algorithms could process
status | Status.ERROR | Status.OK | Status.TIMEOUT | ALL | |
---|---|---|---|---|---|
collection | dataset | ||||
KDD-TSAD | 162_UCR_Anomaly_WalkingAceleration5 | 0 | 57 | 0 | 57 |
Daphnet | S09R01E4 | 0 | 13 | 0 | 13 |
CalIt2 | CalIt2-traffic | 0 | 13 | 0 | 13 |
Most broken datasets
Datasets, for which more than 40% of experiments failed.
status | Status.ERROR | Status.OK | Status.TIMEOUT | ALL | |
---|---|---|---|---|---|
collection | dataset | ||||
SMD | machine-1-1 | 7 | 12 | 5 | 24 |
machine-1-2 | 6 | 14 | 4 | 24 | |
machine-1-3 | 6 | 13 | 5 | 24 | |
machine-1-8 | 6 | 14 | 4 | 24 | |
machine-2-1 | 6 | 13 | 5 | 24 | |
machine-2-4 | 6 | 14 | 4 | 24 | |
machine-2-5 | 6 | 14 | 4 | 24 | |
machine-2-8 | 6 | 14 | 4 | 24 | |
machine-2-9 | 6 | 13 | 5 | 24 | |
machine-3-1 | 6 | 14 | 4 | 24 | |
machine-3-10 | 6 | 13 | 5 | 24 | |
machine-3-11 | 6 | 14 | 4 | 24 | |
machine-3-3 | 6 | 13 | 5 | 24 | |
machine-3-4 | 6 | 14 | 4 | 24 | |
machine-3-5 | 6 | 13 | 5 | 24 | |
machine-3-6 | 6 | 14 | 4 | 24 | |
machine-3-7 | 6 | 13 | 5 | 24 | |
machine-3-8 | 6 | 13 | 5 | 24 | |
machine-3-9 | 6 | 14 | 4 | 24 | |
machine-2-6 | 5 | 14 | 5 | 24 | |
machine-2-7 | 5 | 14 | 5 | 24 | |
KDD-TSAD | 108_UCR_Anomaly_NOISEresperation2 | 16 | 31 | 10 | 57 |
187_UCR_Anomaly_resperation2 | 16 | 33 | 8 | 57 | |
079_UCR_Anomaly_DISTORTEDresperation2 | 13 | 33 | 11 | 57 | |
239_UCR_Anomaly_taichidbS0715Master | 11 | 32 | 14 | 57 | |
240_UCR_Anomaly_taichidbS0715Master | 11 | 32 | 14 | 57 | |
218_UCR_Anomaly_STAFFIIIDatabase | 10 | 30 | 17 | 57 | |
220_UCR_Anomaly_STAFFIIIDatabase | 8 | 34 | 15 | 57 | |
246_UCR_Anomaly_tilt12755mtable | 8 | 32 | 17 | 57 | |
078_UCR_Anomaly_DISTORTEDresperation1 | 7 | 33 | 17 | 57 | |
213_UCR_Anomaly_STAFFIIIDatabase | 7 | 33 | 17 | 57 | |
216_UCR_Anomaly_STAFFIIIDatabase | 7 | 34 | 16 | 57 | |
244_UCR_Anomaly_tilt12754table | 7 | 30 | 20 | 57 | |
245_UCR_Anomaly_tilt12754table | 7 | 30 | 20 | 57 | |
219_UCR_Anomaly_STAFFIIIDatabase | 6 | 33 | 18 | 57 | |
242_UCR_Anomaly_tilt12744mtable | 6 | 33 | 18 | 57 |
Dataset quality assessment based on ROC_AUC
The next figure shows the ROC_AUC score box plots per dataset. The datasets are sorted by their median ROC_AUC score.
Note that the number of experiments differs for each dataset based on its training type and input dimensionality!
In the next figure, you can see the dataset with the worst median ROC_AUC and a selection of algorithm scores (DWT-MLEAD, STOMP, Series2Graph, and Subsequence LOF):