Experiment result analysis on the GutenTAG datasets

On this website, we present detailed results of the experiments on our synthetically generated datasets (with GutenTAG). We show errors, qualitative results, and the runtime of the different algorithms.

Result Overview

In this analysis, we just look at the results of all 60 relevant algorithms with their best parameter configuration on the GutenTAG datasets. Those datasets are generated synthetically and were also used to find the best parameter configurations for the algorithms:

Experiments: 10428
Algorithms: 60
Datasets: 187

The number of experiments is smaller than \(\text{# Algos} \times \text{# Datasets}\) because univariate algorithms cannot process multivariate datasets and those combinations are excluded.

The next table shows an excerpt of the result table with 9 of 26 columns. The complete table with the (quality and runtime) results of all algorithms on all datasets can be downloaded here.

	algorithm	dataset	status	ROC_AUC	AVERAGE_PRECISION	PR_AUC	RANGE_PR_AUC	execute_main_time	hyper_params
0	ARIMA	cbf-combined-diff-1	Status.OK	0.815319	0.454742	0.465248	0.453215	71.414111	{"differencing_degree": 1, "distance_metric": ...
1	ARIMA	cbf-combined-diff-3	Status.OK	0.955978	0.241877	0.127965	0.136431	129.666755	{"differencing_degree": 1, "distance_metric": ...
2	ARIMA	cbf-diff-count-1	Status.OK	0.439091	0.014368	0.008516	0.016521	72.992341	{"differencing_degree": 1, "distance_metric": ...
3	ARIMA	cbf-diff-count-3	Status.OK	0.868527	0.129214	0.090548	0.053913	75.303179	{"differencing_degree": 1, "distance_metric": ...
4	ARIMA	cbf-diff-count-4	Status.OK	0.626002	0.082363	0.054644	0.034841	183.925331	{"differencing_degree": 1, "distance_metric": ...
...	...	...	...	...	...	...	...	...	...
10423	k-Means	sinus-type-pattern	Status.OK	0.999999	0.999901	0.999900	0.577762	67.510581	{"anomaly_window_size": 100, "n_clusters": 50,...
10424	k-Means	sinus-type-pattern-shift	Status.OK	0.999738	0.957231	0.956725	0.544578	53.177865	{"anomaly_window_size": 100, "n_clusters": 50,...
10425	k-Means	sinus-type-platform	Status.OK	0.998038	0.738244	0.735666	0.555714	59.745593	{"anomaly_window_size": 100, "n_clusters": 50,...
10426	k-Means	sinus-type-trend	Status.OK	0.999994	0.999410	0.999407	0.560816	48.915376	{"anomaly_window_size": 100, "n_clusters": 50,...
10427	k-Means	sinus-type-variance	Status.OK	0.999990	0.999019	0.999014	0.579041	82.035531	{"anomaly_window_size": 100, "n_clusters": 50,...

10428 rows × 9 columns

Error analysis

We first want to look at the ability of the algorithms to process the different datasets. Some of the algorithms are restricted by our time and memory constraints and others produce errors when specific invariants or implementation deficits are encountered.

Algorithm problems grouped by algorithm training type

Unsupervised:

	status	Status.ERROR	Status.OK	Status.TIMEOUT	ALL
algo_input_dimensionality	algorithm
UNIVARIATE	SAND	26	137	0	163
	VALMOD	6	157	0	163
	Series2Graph	3	160	0	163
	Left STAMPi	1	162	0	163
	ARIMA	0	163	0	163
	DSPOT	0	160	3	163
	DWT-MLEAD	0	163	0	163
	FFT	0	163	0	163
	GrammarViz	0	163	0	163
	HOT SAX	0	114	49	163
	MedianMethod	0	163	0	163
	NormA	0	153	10	163
	NumentaHTM	0	163	0	163
	PCI	0	163	0	163
	PST	0	163	0	163
	PhaseSpace-SVM	0	163	0	163
	S-H-ESD (Twitter)	0	163	0	163
	SSA	0	163	0	163
	STAMP	0	163	0	163
	STOMP	0	163	0	163
	Spectral Residual (SR)	0	163	0	163
	Subsequence IF	0	163	0	163
	Subsequence LOF	0	163	0	163
	TSBitmap	0	163	0	163
	Triple ES (Holt-Winter's)	0	163	0	163
MULTIVARIATE	DBStream	155	32	0	187
	CBLOF	0	187	0	187
	COF	0	187	0	187
	COPOD	0	187	0	187
	Extended Isolation Forest (EIF)	0	187	0	187
	HBOS	0	187	0	187
	IF-LOF	0	187	0	187
	Isolation Forest (iForest)	0	187	0	187
	KNN	0	187	0	187
	LOF	0	187	0	187
	PCC	0	187	0	187
	Torsk	0	180	7	187
	k-Means	0	187	0	187

Semi-supervised:

	status	Status.ERROR	Status.OK	Status.TIMEOUT	ALL
algo_input_dimensionality	algorithm
UNIVARIATE	TARZAN	32	131	0	163
	Bagel	0	163	0	163
	Donut	0	163	0	163
	ImageEmbeddingCAE	0	163	0	163
	OceanWNN	0	163	0	163
	Random Forest Regressor (RR)	0	163	0	163
	SR-CNN	0	163	0	163
	XGBoosting (RR)	0	163	0	163
MULTIVARIATE	LSTM-AD	98	81	8	187
	EncDec-AD	39	17	131	187
	LaserDBN	23	164	0	187
	DeepAnT	10	177	0	187
	OmniAnomaly	4	183	0	187
	HealthESN	0	150	37	187
	Hybrid KNN	0	187	0	187
	Random Black Forest (RR)	0	174	13	187
	RobustPCA	0	187	0	187
	TAnoGan	0	73	114	187
	Telemanom	0	187	0	187

Supervised:

	status	Status.ERROR	Status.OK	Status.TIMEOUT	ALL
algo_input_dimensionality	algorithm
MULTIVARIATE	MultiHMM	95	92	0	187
	Normalizing Flows	9	66	112	187
	Hybrid Isolation Forest (HIF)	0	187	0	187

As we can see in the above tables, most algorithms can process almost all of the datasets. In the next subsections, we highlight some outlying algorithms.

Very slow algorithms

Algorithms, for which more than 50% of all executions ran into the timeout:

		status	Status.ERROR	Status.OK	Status.TIMEOUT	ALL
algo_training_type	algo_input_dimensionality	algorithm
SEMI_SUPERVISED	MULTIVARIATE	EncDec-AD	39	17	131	187
SEMI_SUPERVISED	MULTIVARIATE	TAnoGan	0	73	114	187
SUPERVISED	MULTIVARIATE	Normalizing Flows	9	66	112	187

All time series in the GutenTAG collection have the same length (of \( 10000\) points). The algorithms EncDec-AD, TAnoGan, and Normalizing Flows are large deep learning models that take a long time to train and execute. This forces them either into the 2h training time limit or the 2h test time limit.

All unsupervised algorithms besides HOT SAX and NormA are fast enough to finish within our time limit for all datasets.

Broken algorithms

Algorithms, which failed for at least 50% of the executions:

		status	Status.ERROR	Status.OK	Status.TIMEOUT	ALL
algo_training_type	algo_input_dimensionality	algorithm
SEMI_SUPERVISED	MULTIVARIATE	LSTM-AD	98	81	8	187
SUPERVISED	MULTIVARIATE	MultiHMM	95	92	0	187
UNSUPERVISED	MULTIVARIATE	DBStream	155	32	0	187

Errors exist independent of the algorithm’s learning type. Prominent algorithms in this category are LSTM-AD, MultiHMM, and DBStream that failed for more than 50% of their executions. To get a better feeling for the reason of algorithm failures, we distinguish between different errors in the next section.

Categorization of errors

We categorize all observed errors into specific categories and then count the number of executions that had errors belonging to a category. The next table shows which errors were observed how often for which algorithm.

algorithm	ALL (sum)	ARIMA	Bagel	CBLOF	COF	COPOD	DBStream	DSPOT	DWT-MLEAD	DeepAnT	Donut	EncDec-AD	Extended Isolation Forest (EIF)	FFT	GrammarViz	HBOS	HOT SAX	HealthESN	Hybrid Isolation Forest (HIF)	Hybrid KNN	IF-LOF	ImageEmbeddingCAE	Isolation Forest (iForest)	KNN	LOF	LSTM-AD	LaserDBN	Left STAMPi	MedianMethod	MultiHMM	NormA	Normalizing Flows	NumentaHTM	OceanWNN	OmniAnomaly	PCC	PCI	PST	PhaseSpace-SVM	Random Black Forest (RR)	Random Forest Regressor (RR)	RobustPCA	S-H-ESD (Twitter)	SAND	SR-CNN	SSA	STAMP	STOMP	Series2Graph	Spectral Residual (SR)	Subsequence IF	Subsequence LOF	TARZAN	TAnoGan	TSBitmap	Telemanom	Torsk	Triple ES (Holt-Winter's)	VALMOD	XGBoosting (RR)	k-Means
error_category
- OK -	9443	163	163	187	187	187	32	160	163	177	163	17	187	163	163	187	114	150	187	187	187	163	187	187	187	81	164	162	163	92	153	66	163	163	183	187	163	163	163	174	163	187	163	137	163	163	163	163	160	163	163	163	131	73	163	187	180	163	157	163	187
- OOM -	146											39														98						9
- TIMEOUT -	484							3				131					49	37								8					10	112								13														114			7
Bug	177						98			10																	23																	25					3				12						6
Incompatible parameters	55						55
Invariance/assumption not met	1																											1
Max recursion depth exceeded	20																																																				20
Model loading error	4																																		4
Not converged	95																													95
Wrong shape error	1																																											1
other	2						2

We can for example see that the high error rate of LSTM-AD is mostly due to hitting the memory limit of 3GB. However, the errors of MultiHMM are due to it’s model not reaching a converged state during training. We assume that some assumptions of the MultiHMM-approach are not met for the datasets with errors.

In general, our GutenTAG dataset are easy to process and well defined, because 91% of all experiments were successful and 6% of the remaining experiments were timeouts or OOMs.

Algorithm quality assessment based on ROC_AUC

The next table shows the min, mean, median, and max ROC_AUC metric score computed over all datasets for each algorithm:

algorithm	LSTM-AD	Subsequence LOF	PhaseSpace-SVM	DWT-MLEAD	SAND	Donut	GrammarViz	Torsk	Left STAMPi	EncDec-AD	STOMP	STAMP	k-Means	Normalizing Flows	Telemanom	Series2Graph	Random Forest Regressor (RR)	VALMOD	XGBoosting (RR)	HealthESN	ImageEmbeddingCAE	Random Black Forest (RR)	ARIMA	PST	NormA	SSA	Subsequence IF	OceanWNN	HOT SAX	DeepAnT	DBStream	PCI	Triple ES (Holt-Winter's)	NumentaHTM	LaserDBN	MedianMethod	FFT	OmniAnomaly	TSBitmap	KNN	Extended Isolation Forest (EIF)	CBLOF	Isolation Forest (iForest)	HBOS	Hybrid Isolation Forest (HIF)	IF-LOF	LOF	Spectral Residual (SR)	S-H-ESD (Twitter)	COF	DSPOT	COPOD	PCC	Bagel	RobustPCA	SR-CNN	TAnoGan	MultiHMM	TARZAN	Hybrid KNN
min	0.123730	0.341819	0.307866	0.125859	0.167172	0.151962	0.207808	0.077172	0.126879	0.344264	0.009910	0.009910	0.000000	0.004679	0.092208	0.069038	0.405782	0.055046	0.373735	0.107862	0.106465	0.130278	0.050505	0.016049	0.013301	0.114735	0.000020	0.156114	0.147374	0.000095	0.102123	0.022453	0.239234	0.377848	0.119141	0.004040	0.014141	0.077511	0.132703	0.000000	0.000000	0.039293	0.000051	0.144394	0.054343	0.000101	0.164697	0.002450	0.473684	0.000000	0.273283	0.000051	0.055375	0.058306	0.000000	0.500000	0.000960	0.047605	0.000571	0.000003
mean	0.965738	0.941804	0.920328	0.907602	0.898257	0.894965	0.894852	0.885825	0.880459	0.877664	0.874267	0.874142	0.872913	0.869716	0.863892	0.861379	0.860457	0.858050	0.856619	0.853132	0.851142	0.818027	0.816814	0.803791	0.786847	0.771233	0.765155	0.734238	0.731207	0.726896	0.719925	0.696556	0.673339	0.670671	0.655523	0.648901	0.644080	0.644023	0.637278	0.614195	0.609879	0.606390	0.603377	0.599450	0.599160	0.587309	0.577457	0.568847	0.559200	0.555700	0.554605	0.543097	0.532033	0.525684	0.514437	0.502331	0.481889	0.478073	0.474698	0.449687
median	0.996443	0.995904	0.980000	0.972041	0.984132	0.973340	0.991579	0.979313	0.981922	0.999900	0.988399	0.988399	0.997220	0.994933	0.977484	0.942775	0.883773	0.971650	0.886839	0.915416	0.944112	0.843654	0.895639	0.871631	0.954595	0.845423	0.841325	0.752219	0.760240	0.853177	0.783729	0.662587	0.668647	0.645183	0.659910	0.567188	0.593000	0.658707	0.624381	0.623641	0.594593	0.558942	0.589781	0.585596	0.584366	0.559124	0.534933	0.544846	0.500000	0.521308	0.501351	0.526410	0.508369	0.550798	0.500000	0.500000	0.481301	0.488418	0.486515	0.444118
max	1.000000	1.000000	0.999928	0.999992	1.000000	1.000000	1.000000	0.999990	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	0.998586	1.000000	1.000000	1.000000	1.000000	0.999800	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	0.999600	0.999650	1.000000	1.000000	0.998544	0.998600	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	0.934293	1.000000	0.880000	0.999596	1.000000	0.999784	1.000000

The following boxplots give a more visual picture of the score distributions. The algorithms are ordered by their mean ROC_AUC score (not included in the visualization) and the first and last 10 algorithms are shown by default. Use the legend on the right to display additional algorithms.

Best algorithms (based on mean ROC_AUC)

	min	mean	median	max
algorithm
LSTM-AD	0.123730	0.965738	0.996443	1.000000
Subsequence LOF	0.341819	0.941804	0.995904	1.000000
PhaseSpace-SVM	0.307866	0.920328	0.980000	0.999928
DWT-MLEAD	0.125859	0.907602	0.972041	0.999992
SAND	0.167172	0.898257	0.984132	1.000000

Worst algorithms (based on mean ROC_AUC)

	min	mean	median	max
algorithm
SR-CNN	0.500000	0.502331	0.500000	0.880000
TAnoGan	0.000960	0.481889	0.481301	0.999596
MultiHMM	0.047605	0.478073	0.488418	1.000000
TARZAN	0.000571	0.474698	0.486515	0.999784
Hybrid KNN	0.000003	0.449687	0.444118	1.000000

Scores of best algorithms

In the next figure, we show the scorings of the 4 best algorithms on the dataset “sinus-diff-count-2”:

Runtime-weighted AUC_ROC scores

In the next figure, we try to combine the runtime and result quality of the algorithms into one metric by weighting the ROC_AUC score by the inverse scaled overall runtime. Algorithms that take exceptionally long to process the datasets are punished and have a smaller weighted ROC_AUC score. Algorithms that are very fast keep their original ROC_AUC score.

Algorithm runtime assessment

This section looks at the runtime of the algorithms. We distinguish between training and execution runtime in our paper. The following figures just look at the combined (overall) runtime.

The next table shows the min, mean, median, and max overall runtime aggregated over all GutenTAG datasets for each algorithm.

Keep in mind that all GutenTAG datasets have the same length of \(10000\) points and most datasets just contain a single channel. Only 25 datasets are multivariate.

algorithm	DBStream	MedianMethod	TSBitmap	Spectral Residual (SR)	FFT	PCI	Extended Isolation Forest (EIF)	DWT-MLEAD	PCC	KNN	LOF	COPOD	NormA	IF-LOF	TARZAN	HBOS	LaserDBN	Isolation Forest (iForest)	STOMP	Subsequence IF	S-H-ESD (Twitter)	CBLOF	Subsequence LOF	SSA	GrammarViz	Series2Graph	PST	MultiHMM	RobustPCA	XGBoosting (RR)	STAMP	COF	Left STAMPi	VALMOD	PhaseSpace-SVM	SAND	NumentaHTM	k-Means	OceanWNN	Donut	Hybrid Isolation Forest (HIF)	EncDec-AD	ImageEmbeddingCAE	Normalizing Flows	HOT SAX	Torsk	ARIMA	DSPOT	Random Black Forest (RR)	SR-CNN	Telemanom	Hybrid KNN	Random Forest Regressor (RR)	Triple ES (Holt-Winter's)	Bagel	DeepAnT	HealthESN	LSTM-AD	TAnoGan	OmniAnomaly
min	0.000000	2.167620	2.092893	2.978843	2.768183	4.014369	5.236932	5.203950	5.205858	5.194909	5.189140	6.203962	0.000000	6.299420	0.000000	7.425389	0.000000	8.048582	10.053667	9.187731	10.025252	8.141591	6.378969	9.810762	2.366554	0.000000	7.068686	0.000000	11.206035	29.423896	2.546820	27.155001	0.000000	0.000000	24.383494	0.000000	72.370925	5.902524	147.090236	238.261296	298.579914	0.000000	19.853276	0.000000	0.000000	0.000000	71.414111	0.000000	0.000000	734.100854	214.989800	332.475525	1081.153213	1662.362775	1730.130522	0.000000	0.000000	0.000000	0.000000	0.000000
mean	1.485884	3.646354	4.204380	4.448068	4.683874	6.203468	6.599313	6.958002	7.298360	7.352563	7.476626	8.305900	8.640119	8.946259	9.336935	9.983120	10.513870	11.019563	11.900743	13.441904	13.829732	14.455307	18.856173	18.941656	22.910533	23.010118	24.472725	33.024754	35.213852	36.827480	38.341096	39.020412	53.121713	53.299238	82.067981	85.901954	91.581986	98.254350	250.583056	339.671690	485.455303	504.099002	654.139068	680.576232	927.950220	1244.610911	1253.151379	1295.773654	1477.619365	1478.505805	1836.957278	1929.474488	2118.757325	2487.212241	2771.290942	3128.522926	3180.770797	3269.571687	3839.777410	7113.762881
median	0.000000	2.796953	4.882090	3.889712	3.599423	5.278512	6.065740	6.412318	6.820794	6.541939	6.669530	7.604242	6.737077	8.490258	10.090874	9.324363	10.716289	10.572865	11.528972	13.585095	13.678194	9.720897	21.301228	19.145623	21.936855	21.989442	29.936323	0.000000	14.051364	35.782439	27.600829	39.622869	53.934283	59.732887	61.484677	47.611511	90.902945	74.116743	219.784961	355.775571	483.189570	0.000000	574.251242	0.000000	597.805815	1103.504400	698.592820	125.862472	1424.658453	1580.434546	1572.787677	1480.946807	1934.163334	2361.207052	2378.648069	2938.454040	3259.633157	0.000000	0.000000	7276.879391
max	13.955889	7.528286	7.265769	7.908754	21.075458	9.260761	10.825573	11.201316	12.648174	32.120757	12.498324	14.220524	34.183225	18.329226	19.855369	15.828537	19.874952	14.398077	16.301006	22.905463	22.759653	105.374497	74.296122	41.621639	79.390667	53.055090	40.106281	392.575913	519.506858	43.991729	377.011789	65.967019	60.658616	101.013242	310.056020	595.575047	135.339259	605.901906	618.353864	459.234046	725.922349	7732.446555	1803.073701	7228.146324	4702.380252	6335.579315	6603.765796	6931.930536	7014.062132	3366.463148	7286.619047	7235.540268	3809.431023	5115.088311	9596.165659	7397.980260	7226.635755	8320.575837	13536.386278	7304.766755

The following boxplots give a more visual picture of the runtime distributions. The algorithms are ordered by their mean overall runtime and the first and last 10 algorithms are shown by default. Use the legend on the right to display additional algorithms.

In the next figure, we show the algorithm mean runtime in relation to the achieved mean ROC_AUC score. We distinguish between the different learning types because the runtime of an algorithm depends on its learning procedure.

Attention

The following figure does not accommodate for OOM or TIMEOUT errors. This is especially visible for Normalizing Flows (supervised), which ran into the time limit for most of the datasets but has a relatively small runtime in the figure below! For algorithms with many errors (cf. Section Error analysis), the aggregated runtimes and metric scores are not meaningful.

Detailed analysis of certain algorithm or dataset aspects

Best algorithms for base oscillations

Find dataset names:

Sine

ECG

Random Walk

CBF

Poly

Best algorithms for anomaly type

Extremum

Frequency

Mean Shift

Pattern

Pattern Shift

Platform

Variance

Amplitude

Trend

Most fluctuating algorithms based on anomaly type

Best algorithms for single/multiple-same/multiple-different anomalies

Single anomaly datasets

Multiple same anomalies datasets

Multiple different anomaly datasets

Best algorithm of algorithm family

	algorithm	ROC_AUC
algo_family
trees	PST	0.803791
reconstruction	Donut	0.894965
forecasting	LSTM-AD	0.965738
encoding	GrammarViz	0.894852
distribution	DWT-MLEAD	0.907602
distance	Subsequence LOF	0.941804

GutenTAG result analysis

This page presents additional experiment results on the synthetically generated test case datasets (GutenTAG).