Experiment result analysis on the Benchmark datasets

On this website, we present detailed results of the experiments on the Benchmark datasets that we collected from various sources (see Datasets Overview. We show errors, qualitative results, and the runtime of the different algorithms.

Result Overview

In this analysis, we just look at the results of all 60 relevant algorithms with their best parameter configuration on the Benchmark datasets (excluding our synthetically generated GutenTAG datasets):

Experiments: 34703
Algorithms: 60
Datasets: 789 in 15 collections

The number of experiments is smaller than \(\text{# Algos} \times \text{# Datasets}\) because univariate algorithms cannot process multivariate datasets and (semi-)supervised algorithms can process only (semi-)supervised datasets. Non-compliant combinations are excluded.

The next table shows an excerpt of the result table with 10 of 26 columns. The complete table with the (quality and runtime) results of all algorithms on all datasets can be downloaded here.

	algorithm	collection	dataset	status	ROC_AUC	AVERAGE_PRECISION	PR_AUC	RANGE_PR_AUC	execute_main_time	hyper_params
10428	ARIMA	IOPS	4d2af31a-9916-3d9f-8a8e-8a268a48c095	Status.TIMEOUT	NaN	NaN	NaN	NaN	NaN	{"differencing_degree": 1, "distance_metric": ...
10429	ARIMA	IOPS	42d6616d-c9c5-370a-a8ba-17ead74f3114	Status.TIMEOUT	NaN	NaN	NaN	NaN	NaN	{"differencing_degree": 1, "distance_metric": ...
10430	ARIMA	IOPS	c02607e8-7399-3dde-9d28-8a8da5e5d251	Status.OK	0.838021	0.320436	0.522335	0.166973	1553.363329	{"differencing_degree": 1, "distance_metric": ...
10431	ARIMA	IOPS	301c70d8-1630-35ac-8f96-bc1b6f4359ea	Status.OK	0.644397	0.086475	0.076127	0.072827	578.490987	{"differencing_degree": 1, "distance_metric": ...
10432	ARIMA	KDD-TSAD	001_UCR_Anomaly_DISTORTED1sddb40	Status.OK	0.162272	0.006566	0.004535	0.004952	2914.872097	{"differencing_degree": 1, "distance_metric": ...
...	...	...	...	...	...	...	...	...	...	...
45126	VALMOD	NormA	Discords_marotta_valve_tek_17	Status.OK	0.536991	0.025921	0.025703	0.040605	26.353962	{"exclusion_zone": 0.5, "heap_size": 50, "max_...
45127	VALMOD	NormA	Discords_patient_respiration1	Status.OK	0.819914	0.223165	0.222676	0.159304	24.373522	{"exclusion_zone": 0.5, "heap_size": 50, "max_...
45128	VALMOD	NormA	SinusRW_Length_104000_AnomalyL_200_AnomalyN_20...	Status.ERROR	NaN	NaN	NaN	NaN	NaN	{"exclusion_zone": 0.5, "heap_size": 50, "max_...
45129	VALMOD	NormA	SinusRW_Length_106000_AnomalyL_100_AnomalyN_60...	Status.ERROR	NaN	NaN	NaN	NaN	NaN	{"exclusion_zone": 0.5, "heap_size": 50, "max_...
45130	VALMOD	NormA	SinusRW_Length_108000_AnomalyL_200_AnomalyN_40...	Status.ERROR	NaN	NaN	NaN	NaN	NaN	{"exclusion_zone": 0.5, "heap_size": 50, "max_...

34703 rows × 10 columns

Error analysis

We first want to look at the ability of the algorithms to process the different datasets. Some of the algorithms are restricted by our time and memory constraints and others produce errors when specific invariants or implementation deficits are encountered.

Algorithm problems grouped by algorithm training type

Unsupervised:

	status	Status.ERROR	Status.OK	Status.TIMEOUT	ALL
algo_input_dimensionality	algorithm
UNIVARIATE	S-H-ESD (Twitter)	441	298	0	739
	SAND	186	505	48	739
	VALMOD	168	559	12	739
	Triple ES (Holt-Winter's)	79	524	136	739
	Series2Graph	45	694	0	739
	NormA	39	622	78	739
	PST	36	703	0	739
	HOT SAX	18	554	167	739
	Left STAMPi	12	712	15	739
	ARIMA	4	676	59	739
	PhaseSpace-SVM	2	626	111	739
	MedianMethod	1	738	0	739
	NumentaHTM	1	735	3	739
	DSPOT	0	686	53	739
	DWT-MLEAD	0	739	0	739
	FFT	0	739	0	739
	GrammarViz	0	713	26	739
	PCI	0	739	0	739
	SSA	0	734	5	739
	STAMP	0	701	38	739
	STOMP	0	725	14	739
	Spectral Residual (SR)	0	739	0	739
	Subsequence IF	0	739	0	739
	Subsequence LOF	0	721	18	739
	TSBitmap	0	739	0	739
MULTIVARIATE	DBStream	627	161	1	789
	COF	239	550	0	789
	k-Means	47	742	0	789
	IF-LOF	25	762	2	789
	CBLOF	3	786	0	789
	Torsk	2	721	66	789
	COPOD	0	789	0	789
	Extended Isolation Forest (EIF)	0	789	0	789
	HBOS	0	789	0	789
	Isolation Forest (iForest)	0	789	0	789
	KNN	0	789	0	789
	LOF	0	789	0	789
	PCC	0	789	0	789

Semi-supervised:

	status	Status.ERROR	Status.OK	Status.TIMEOUT	ALL
algo_input_dimensionality	algorithm
UNIVARIATE	TARZAN	50	250	0	300
	OceanWNN	45	255	0	300
	Donut	13	283	4	300
	Bagel	9	205	86	300
	ImageEmbeddingCAE	6	294	0	300
	SR-CNN	6	191	103	300
	XGBoosting (RR)	2	298	0	300
	Random Forest Regressor (RR)	0	246	54	300
MULTIVARIATE	LSTM-AD	173	85	65	323
	EncDec-AD	110	66	147	323
	OmniAnomaly	27	296	0	323
	Random Black Forest (RR)	23	281	19	323
	LaserDBN	16	285	22	323
	DeepAnT	14	309	0	323
	Hybrid KNN	9	314	0	323
	TAnoGan	4	103	216	323
	HealthESN	2	23	298	323
	RobustPCA	0	323	0	323
	Telemanom	0	322	1	323

Supervised:

	status	Status.ERROR	Status.OK	Status.TIMEOUT	ALL
algo_input_dimensionality	algorithm
MULTIVARIATE	MultiHMM	5	1	0	6
	Normalizing Flows	2	1	3	6
	Hybrid Isolation Forest (HIF)	0	5	1	6

As we can see in the above table, some algorithms are severly impacted by our time limit of 2 hours. They hit the time limit for a majority of datasets. This does not only apply to multivariate algorithms but also to univariate algorithms. In addition, many algorithms run into errors. The error count in the table includes memory errors due to the algorithm hitting our memory limit of 3GB. We highlight some outlying algorithms in the next sections.

In general, 87.43% of all experiments were successful, 5.39% were timeouts, and 7.18% were errors.

Very slow algorithms

Algorithms, for which at least 50% of all executions ran into the timeout.

		status	Status.ERROR	Status.OK	Status.TIMEOUT	ALL
algo_training_type	algo_input_dimensionality	algorithm
SEMI_SUPERVISED	MULTIVARIATE	HealthESN	2	23	298	323
SEMI_SUPERVISED	MULTIVARIATE	TAnoGan	4	103	216	323
SUPERVISED	MULTIVARIATE	Normalizing Flows	2	1	3	6

HealthESN and TAnoGan are the two algorithms that hit the time limit the most. Both algorithms are semi-supervised and require a training step. We used time limits of 2 hours for the training step and 2 hours for the testing step.

Normalizing Flows is a supervised algorithm that also hit the time limit for half of the datasets.

HealthESN, TAnoGan, and Normalizing Flows are also the algorithms with the most timeouts for our GutenTAG datasets (cf. GutenTAG result analysis).

Broken algorithms

Algorithms, which failed for at least 50% of the executions.

		status	Status.ERROR	Status.OK	Status.TIMEOUT	ALL
algo_training_type	algo_input_dimensionality	algorithm
SEMI_SUPERVISED	MULTIVARIATE	LSTM-AD	173	85	65	323
SUPERVISED	MULTIVARIATE	MultiHMM	5	1	0	6
UNSUPERVISED	MULTIVARIATE	DBStream	627	161	1	789
UNSUPERVISED	UNIVARIATE	S-H-ESD (Twitter)	441	298	0	739

Similar to the GutenTAG datasets, LSTM-AD, MultiHMM, and DBStream are the algorithms with the most errors (cf. GutenTAG result analysis). In addition to those three algorithms, S-H-ESD is having massive problems with the benchmark datasets and produces errors for about 60% of the experiments (compared to 0% for the GutenTAG datasets).

To get a better feeling for the reason of algorithm failures, we distinguish between different errors in the next section.

Categorization of errors

We categorize all observed errors into specific categories and then count the number of executions that had errors belonging to a category. The next table shows which errors were observed how often for which algorithm.

algorithm	ALL (sum)	ARIMA	Bagel	CBLOF	COF	COPOD	DBStream	DSPOT	DWT-MLEAD	DeepAnT	Donut	EncDec-AD	Extended Isolation Forest (EIF)	FFT	GrammarViz	HBOS	HOT SAX	HealthESN	Hybrid Isolation Forest (HIF)	Hybrid KNN	IF-LOF	ImageEmbeddingCAE	Isolation Forest (iForest)	KNN	LOF	LSTM-AD	LaserDBN	Left STAMPi	MedianMethod	MultiHMM	NormA	Normalizing Flows	NumentaHTM	OceanWNN	OmniAnomaly	PCC	PCI	PST	PhaseSpace-SVM	Random Black Forest (RR)	Random Forest Regressor (RR)	RobustPCA	S-H-ESD (Twitter)	SAND	SR-CNN	SSA	STAMP	STOMP	Series2Graph	Spectral Residual (SR)	Subsequence IF	Subsequence LOF	TARZAN	TAnoGan	TSBitmap	Telemanom	Torsk	Triple ES (Holt-Winter's)	VALMOD	XGBoosting (RR)	k-Means
error_category
- OK -	30341	676	205	786	550	789	161	686	739	309	283	66	789	739	713	789	554	23	5	314	762	294	789	789	789	85	285	712	738	1	622	1	735	255	296	789	739	703	626	281	246	323	298	505	191	734	701	725	694	739	739	721	250	103	739	322	721	524	559	298	742
- OOM -	754				239		24				3	93					9	2		3						157	1				13	2			20			35		23				11															79	2	38
- TIMEOUT -	1871	59	86				1	53			4	147			26		167	298	1		2					65	22	15			78	3	3						111	19	54			48	103	5	38	14				18		216		1	66	136	12
Bug	847		9				377			11	10	17					9			6	1	6				16	15				9		1	43				1						165					31				20				2		89		9
Incompatible parameters	648						216																					2															430
Invariance/assumption not met	122																											10															11		6				12					4				79
LinAlgError	21	2																													17																		2
Max recursion depth exceeded	30																																																				30
Model loading error	5																																		5
Not converged	8			3																										5
TimeEval:IndexError	3									3
Wrong shape error	13																				1																		2					10
other	16	1					10																						1					2	2
unexpected Inf or NaN	24	1																			23

LSTM-ADs errors are dominated by OOMs, MultiHMM does not converge, DBStream has implementation errors that we could not fix, and S-H-ESD has assumptions that are not met by the datasets. The errors of S-H-ESD are categorized as Incompatible parameters. This is not entirely correct. S-H-ESD expects a timestamp index for the time series and the data should span multiple days or weeks to be able to remove seasonality. We use a heuristic to convert incremental indices to timestamps and set the correct time span. If the heuristic is not able to adapt the dataset to the S-H-ESD assumptions, we record this as an Incompatible parameters error, because the heuristic raises an exception.

Algorithm quality assessment based on ROC_AUC

The next table shows the min, mean, median, and max ROC_AUC metric score computed over all datasets for each algorithm:

algorithm	Normalizing Flows	Subsequence LOF	Hybrid Isolation Forest (HIF)	Donut	GrammarViz	DWT-MLEAD	VALMOD	LSTM-AD	PCI	Left STAMPi	Telemanom	Triple ES (Holt-Winter's)	SAND	ARIMA	Random Forest Regressor (RR)	NumentaHTM	Series2Graph	STOMP	KNN	STAMP	Extended Isolation Forest (EIF)	HealthESN	Isolation Forest (iForest)	Subsequence IF	k-Means	HBOS	Spectral Residual (SR)	NormA	MedianMethod	ImageEmbeddingCAE	COPOD	XGBoosting (RR)	EncDec-AD	COF	Random Black Forest (RR)	CBLOF	IF-LOF	LOF	DBStream	DeepAnT	OceanWNN	Torsk	OmniAnomaly	PCC	PST	PhaseSpace-SVM	TSBitmap	HOT SAX	LaserDBN	RobustPCA	SSA	Bagel	DSPOT	TAnoGan	Hybrid KNN	FFT	TARZAN	SR-CNN	S-H-ESD (Twitter)	MultiHMM
min	0.992443	0.025229	0.592981	0.071872	0.001746	0.016998	0.000000	0.013924	0.028019	0.092740	0.000014	0.106113	0.037324	0.007541	0.074326	0.290517	0.000106	0.000000	0.086367	0.000000	0.078512	0.443499	0.063202	0.000704	0.000000	0.067572	0.002497	0.000000	0.025507	0.003642	0.002113	0.078633	0.000937	0.023289	0.216832	0.071530	0.018925	0.019324	0.045533	0.000241	0.090600	0.117461	0.002294	0.031566	0.000000	0.002430	0.055675	0.156909	0.091204	0.000263	0.005590	0.000664	0.248658	0.001404	0.000004	0.007053	0.000162	0.134518	0.479870	0.374354
mean	0.992443	0.873463	0.858942	0.814245	0.814148	0.811523	0.801048	0.787004	0.747642	0.742096	0.741286	0.735233	0.734560	0.731612	0.729402	0.723747	0.723652	0.704349	0.699461	0.697303	0.694858	0.694434	0.693879	0.691789	0.690503	0.690325	0.689837	0.685234	0.682457	0.682375	0.680206	0.678807	0.674921	0.674018	0.670852	0.668766	0.666316	0.663431	0.654112	0.647727	0.640535	0.630693	0.620901	0.600068	0.597836	0.586604	0.577123	0.567419	0.560308	0.559578	0.557453	0.551537	0.550059	0.536777	0.524415	0.523552	0.516395	0.515006	0.512374	0.374354
median	0.992443	0.971229	0.856722	0.912582	0.895212	0.887902	0.955706	0.905740	0.797039	0.776048	0.861103	0.738627	0.723227	0.783483	0.737885	0.721098	0.770457	0.841000	0.661559	0.823992	0.664053	0.636674	0.659908	0.720508	0.749782	0.663484	0.665752	0.673790	0.652758	0.753238	0.647625	0.664087	0.725808	0.618044	0.656053	0.622689	0.607835	0.586261	0.561812	0.726475	0.630037	0.576963	0.673658	0.567187	0.623624	0.581505	0.572539	0.500000	0.544781	0.535274	0.514397	0.538822	0.500000	0.511289	0.505695	0.500000	0.554709	0.500000	0.500000	0.374354
max	0.992443	1.000000	0.999918	0.999993	1.000000	1.000000	1.000000	0.999996	1.000000	1.000000	1.000000	1.000000	1.000000	0.999905	1.000000	0.999969	1.000000	1.000000	1.000000	1.000000	1.000000	0.992204	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	0.999921	1.000000	0.999995	0.999701	1.000000	0.999992	1.000000	1.000000	1.000000	0.999997	0.999988	0.999524	1.000000	0.999499	1.000000	1.000000	0.996924	0.999296	1.000000	0.999941	0.999983	0.998834	0.981582	1.000000	0.999298	0.999999	1.000000	0.999991	0.955550	0.890020	0.374354

The following boxplots give a more visual picture of the score distributions. The algorithms are ordered by their mean ROC_AUC score (not included in the visualization) and the first and last 10 algorithms are shown by default. Use the legend on the right to display additional algorithms.

Best algorithms (based on mean ROC_AUC)

	min	mean	median	max
algorithm
Normalizing Flows	0.992443	0.992443	0.992443	0.992443
Subsequence LOF	0.025229	0.873463	0.971229	1.000000
Hybrid Isolation Forest (HIF)	0.592981	0.858942	0.856722	0.999918
Donut	0.071872	0.814245	0.912582	0.999993
GrammarViz	0.001746	0.814148	0.895212	1.000000

Worst algorithms (based on mean ROC_AUC)

	min	mean	median	max
algorithm
FFT	0.007053	0.523552	0.500000	1.000000
TARZAN	0.000162	0.516395	0.554709	0.999991
SR-CNN	0.134518	0.515006	0.500000	0.955550
S-H-ESD (Twitter)	0.479870	0.512374	0.500000	0.890020
MultiHMM	0.374354	0.374354	0.374354	0.374354

Scores of best algorithms

In the next figure, we show the scorings of the 4 best algorithms (excluding HIF and Normalizing Flows, because they are supervised and our selected dataset is not) on the dataset “004_UCR_Anomaly_DISTORTEDBIDMC1”:

/tmp/ipykernel_31322/1909309711.py:36: UserWarning:

No ROC_AUC score found! Probably Hybrid Isolation Forest (HIF) was not executed on 004_UCR_Anomaly_DISTORTEDBIDMC1.

/tmp/ipykernel_31322/1909309711.py:36: UserWarning:

No ROC_AUC score found! Probably Normalizing Flows was not executed on 004_UCR_Anomaly_DISTORTEDBIDMC1.

ROC_AUC over the number of successfully processed datasets

Similar to the reliability plot in our paper, we show the algorithm’s ROC_AUC score in relation to the relative number of successfully processed datasets in the next figure:

Runtime-weighted AUC_ROC scores

In the next figure, we try to combine the runtime and result quality of the algorithms into one metric by weighting the ROC_AUC score by the inverse scaled overall runtime. Algorithms that take exceptionally long to process the datasets are punished and have a smaller weighted ROC_AUC score. Algorithms that are very fast keep their original ROC_AUC score.

Best algorithm of algorithm family (based on ROC_AUC)

	algorithm	ROC_AUC
algo_family
trees	Hybrid Isolation Forest (HIF)	0.858942
reconstruction	Donut	0.814245
forecasting	LSTM-AD	0.787004
encoding	GrammarViz	0.814148
distribution	Normalizing Flows	0.992443
distance	Subsequence LOF	0.873463

Compared to the GutenTAG datasets, the algorithm families trees and distribution had a change in their best algorithm. For the other families, the best algorithm on the GutenTAG datasets is also the best algorithm on the benchmark datasets.

Please note that the results for HIF and Normalizing Flows are not reliable, because they are supervised and were executed on at most 6 datasets. In addition, Normalizing Flows could only process half of the datasets!

Algorithm quality assessment based on PR_AUC

In the next figure, we show the box plots of the different algorithms computed over the PR_AUC metric instead of ROC_AUC. The algorithms are sorted by their mean PR_AUC over all datasets.

Dataset assessment

In this section, we quickly aggregate the results of our evaluation on dataset collection level. This gives some preliminary insights into the dataset complexity and quality.

Dataset error overview

We first want to have a look at the error distribution over the dataset collections. We count the number of experiments that failed, were successful, or ran into a timeout for each dataset collection. The results are grouped by the dataset training type (if a training time series with (supervised) or without (semi-supervised) labeled anomalies is present or not (unsupervised)) and dimensionality.

Unsupervised:

	status	Status.ERROR	Status.OK	Status.TIMEOUT	ALL
dataset_input_dimensionality	collection
UNIVARIATE	WebscopeS5	789	12885	6	13680
	NAB	225	1860	43	2128
	MGAB	29	311	40	380
	NormA	28	299	15	342
MULTIVARIATE	SVDB	19	184	5	208
	MITDB	12	36	4	52
	Genesis	2	10	1	13
	Daphnet	1	37	1	39
	CalIt2	0	13	0	13

Semi-supervised:

	status	Status.ERROR	Status.OK	Status.TIMEOUT	ALL
dataset_input_dimensionality	collection
UNIVARIATE	KDD-TSAD	987	11644	1562	14193
	NASA-SMAP	147	1795	53	1995
	NASA-MSL	90	802	20	912
MULTIVARIATE	SMD	136	314	102	552

Supervised:

	status	Status.ERROR	Status.OK	Status.TIMEOUT	ALL
dataset_input_dimensionality	collection
UNIVARIATE	IOPS	16	131	17	164
MULTIVARIATE	Exathlon	10	20	2	32

In the next figure, we show the relative number of experiments that either failed (ERROR), ran into the time limit (TIMEOUT), ran into the memory limit (OOM), or were successful (OK). The dataset collections are sorted by their percentage of failed (ERROR) experiments.

We can see that the dataset collections MITDB, Exathlon, SMD, and Genesis have a large percentage of experiments that hit the memory or time limits. Those datasets are either very long, very wide, or contain complicated time series patterns.

The dataset collections NAB, NASA-MSL, MITDB, and IOPS have more than 7% failing experiments. We are concerned about the high percentage of failing experiments in general. This might be due to bad dataset quality.

The next figure, shows the percentage of succesful experiments aggregated per dataset. We highlighted the datasets with less than 80% success rate (on the left side of the plot).

Datasets which all algorithms could process

	status	Status.ERROR	Status.OK	Status.TIMEOUT	ALL
collection	dataset
KDD-TSAD	162_UCR_Anomaly_WalkingAceleration5	0	57	0	57
Daphnet	S09R01E4	0	13	0	13
CalIt2	CalIt2-traffic	0	13	0	13

Most broken datasets

Datasets, for which more than 40% of experiments failed.

	status	Status.ERROR	Status.OK	Status.TIMEOUT	ALL
collection	dataset
SMD	machine-1-1	7	12	5	24
	machine-1-2	6	14	4	24
	machine-1-3	6	13	5	24
	machine-1-8	6	14	4	24
	machine-2-1	6	13	5	24
	machine-2-4	6	14	4	24
	machine-2-5	6	14	4	24
	machine-2-8	6	14	4	24
	machine-2-9	6	13	5	24
	machine-3-1	6	14	4	24
	machine-3-10	6	13	5	24
	machine-3-11	6	14	4	24
	machine-3-3	6	13	5	24
	machine-3-4	6	14	4	24
	machine-3-5	6	13	5	24
	machine-3-6	6	14	4	24
	machine-3-7	6	13	5	24
	machine-3-8	6	13	5	24
	machine-3-9	6	14	4	24
	machine-2-6	5	14	5	24
	machine-2-7	5	14	5	24
KDD-TSAD	108_UCR_Anomaly_NOISEresperation2	16	31	10	57
	187_UCR_Anomaly_resperation2	16	33	8	57
	079_UCR_Anomaly_DISTORTEDresperation2	13	33	11	57
	239_UCR_Anomaly_taichidbS0715Master	11	32	14	57
	240_UCR_Anomaly_taichidbS0715Master	11	32	14	57
	218_UCR_Anomaly_STAFFIIIDatabase	10	30	17	57
	220_UCR_Anomaly_STAFFIIIDatabase	8	34	15	57
	246_UCR_Anomaly_tilt12755mtable	8	32	17	57
	078_UCR_Anomaly_DISTORTEDresperation1	7	33	17	57
	213_UCR_Anomaly_STAFFIIIDatabase	7	33	17	57
	216_UCR_Anomaly_STAFFIIIDatabase	7	34	16	57
	244_UCR_Anomaly_tilt12754table	7	30	20	57
	245_UCR_Anomaly_tilt12754table	7	30	20	57
	219_UCR_Anomaly_STAFFIIIDatabase	6	33	18	57
	242_UCR_Anomaly_tilt12744mtable	6	33	18	57

Dataset quality assessment based on ROC_AUC

The next figure shows the ROC_AUC score box plots per dataset. The datasets are sorted by their median ROC_AUC score.

Note that the number of experiments differs for each dataset based on its training type and input dimensionality!

In the next figure, you can see the dataset with the worst median ROC_AUC and a selection of algorithm scores (DWT-MLEAD, STOMP, Series2Graph, and Subsequence LOF):

Benchmark result analysis

This page presents additional experiment results on the "real-world" benchmark datasets.