Case studies

On this page, we present some preliminary results in form of three short case studies:

No anomalies
High contamination
Parameter sensitivity

No anomalies

In this case study, we want to see how the algorithms behave on datasets that do not contain any known anomalies. For this, we included the following datasets (all without known/labelled anomalies) as examples in our experiments:

	collection	dataset	input dimensionality	learning type
232	KDD-TSAD	079_UCR_Anomaly_DISTORTEDresperation2	univariate	semi-supervised
261	KDD-TSAD	108_UCR_Anomaly_NOISEresperation2	univariate	semi-supervised
340	KDD-TSAD	187_UCR_Anomaly_resperation2	univariate	semi-supervised
498	NAB	art_daily_no_noise	univariate	unsupervised
500	NAB	art_daily_perfect_square_wave	univariate	unsupervised
501	NAB	art_daily_small_noise	univariate	unsupervised
502	NAB	art_flatline	univariate	unsupervised
505	NAB	art_noisy	univariate	unsupervised
513	NAB	ec2_cpu_utilization_c6585a	univariate	unsupervised

Note that we do not have any supervised datasets without known anomalies in this list. Therefore, we cannot inspect supervised algorithms and exclude them from further analysis. In addition, all of the above datasets are univariate. Since multivariate algorithms are also executed on univariate datasets, this does not limit the number of considered algorithms further.

The following table lists the number of successfully processed, erroneous, or timed-out datasets for each of the remaining 57 algorithms:

	status	Status.ERROR	Status.OK	Status.TIMEOUT
algo_training_type	algorithm
SEMI_SUPERVISED	Bagel	0	0	3
	DeepAnT	3	0	0
	EncDec-AD	3	0	0
	HealthESN	0	0	3
	Hybrid KNN	3	0	0
	ImageEmbeddingCAE	3	0	0
	LSTM-AD	1	0	2
	SR-CNN	0	0	3
	TARZAN	3	0	0
	TAnoGan	0	0	3
	OmniAnomaly	2	1	0
	Random Forest Regressor (RR)	0	1	2
	Donut	0	2	1
	LaserDBN	0	3	0
	OceanWNN	0	3	0
	Random Black Forest (RR)	0	3	0
	RobustPCA	0	3	0
	Telemanom	0	3	0
	XGBoosting (RR)	0	3	0
UNSUPERVISED	HOT SAX	9	0	0
	Left STAMPi	9	0	0
	NormA	9	0	0
	SAND	9	0	0
	k-Means	9	0	0
	DBStream	8	1	0
	VALMOD	6	2	1
	S-H-ESD (Twitter)	6	3	0
	Triple ES (Holt-Winter's)	3	5	1
	COF	3	6	0
	PST	3	6	0
	PhaseSpace-SVM	0	6	3
	Series2Graph	3	6	0
	ARIMA	0	7	2
	CBLOF	2	7	0
	STAMP	0	7	2
	Subsequence LOF	0	7	2
	IF-LOF	1	8	0
	NumentaHTM	1	8	0
	Torsk	0	8	1
	COPOD	0	9	0
	DSPOT	0	9	0
	DWT-MLEAD	0	9	0
	Extended Isolation Forest (EIF)	0	9	0
	FFT	0	9	0
	GrammarViz	0	9	0
	HBOS	0	9	0
	Isolation Forest (iForest)	0	9	0
	KNN	0	9	0
	LOF	0	9	0
	MedianMethod	0	9	0
	PCC	0	9	0
	PCI	0	9	0
	SSA	0	9	0
	STOMP	0	9	0
	Spectral Residual (SR)	0	9	0
	Subsequence IF	0	9	0
	TSBitmap	0	9	0

As we can see in the table, the algorithms Bagel, DeepAnT, EncDec-AD, HealthESN, Hybrid KNN, ImageEmbeddingCAE, LSTM-AD, SR-CNN, TARZAN, TAnoGan, HOT SAX, Left STAMPi, NormA, SAND, and k-Means already fail to process the datasets entirely. Our preliminary finding is that these algorithms make assumptions about the dataset that are not met if there are no anomalies present. In contrast, the algorithms LaserDBN, OceanWNN, Random Black Forest (RR), RobustPCA, Telemanom, XGBoosting (RR), COPOD, DSPOT, DWT-MLEAD, Extended Isolation Forest (EIF), FFT, GrammarViz, HBOS, Isolation Forest (iForest), KNN, LOF, MedianMethod, PCC, PCI, SSA, STOMP, Spectral Residual (SR), Subsequence IF, and TSBitmap are able to process the datasets successfully. All other algorithms can process just some of the datasets and fail for others. The reason for those failures are a task for further analysis.

Our evaluation metrics are not defined for time series with all-zeros-labels (no single anomaly annotated), which means that we cannot present those metrics here. Alternatively, we show an example dataset and the scorings of selected algorithms. You can use the legend on the right side to enable/disable the display of individual algorithm scorings.

High contamination

In this case study, we want to see how the algorithms behave on datasets that have a high contamination. The contamination of a time series with length \(n\) is the ratio of anomalous points to all points (\(\text{contamination} = \frac{\text{# anomalous points} }{n}\)). Most algorithms assume that anomalies are rare and, thus, expect/assume a low contamination. Datasets with a high contamination have many or very large anomalous regions and pose a particular challenge for those algorithms. The following datasets have a \(\text{contamination} > 0.1\):

	collection	dataset	input dimensionality	learning type	contamination	length	# dimensions
50	Exathlon	4_1_100000_61-29	multivariate	semi-supervised	0.124655	129197	19
51	Exathlon	4_1_100000_61-30	multivariate	semi-supervised	0.124655	129197	19
59	Exathlon	5_1_100000_63-64	multivariate	supervised	0.126248	43066	31
483	NAB	TravelTime_451	univariate	unsupervised	0.100370	2162	1
486	NAB	Twitter_volume_CRM	univariate	unsupervised	0.100176	15902	1
490	NAB	Twitter_volume_IBM	univariate	unsupervised	0.100044	15893	1
491	NAB	Twitter_volume_KO	univariate	unsupervised	0.100120	15851	1
492	NAB	Twitter_volume_PFE	univariate	unsupervised	0.100139	15858	1
514	NAB	ec2_cpu_utilization_fe7f93	univariate	unsupervised	0.100446	4032	1
516	NAB	ec2_disk_write_bytes_c0d644	univariate	unsupervised	0.100446	4032	1
518	NAB	ec2_network_in_5abac7	univariate	unsupervised	0.100211	4730	1
521	NAB	exchange-2_cpc_results	univariate	unsupervised	0.100369	1624	1
525	NAB	exchange-4_cpc_results	univariate	unsupervised	0.100426	1643	1
527	NAB	grok_asg_anomaly	univariate	unsupervised	0.100628	4621	1
528	NAB	iio_us-east-1_i-a2eb1cd9_NetworkIn	univariate	unsupervised	0.101368	1243	1
530	NAB	nyc_taxi	univariate	unsupervised	0.100291	10320	1
531	NAB	occupancy_6005	univariate	unsupervised	0.100420	2380	1
535	NAB	rogue_agent_key_hold	univariate	unsupervised	0.100956	1882	1
538	NAB	speed_7578	univariate	unsupervised	0.102928	1127	1
539	NAB	speed_t4013	univariate	unsupervised	0.100200	2495	1

Please note that the Exathlon datasets are quite large and might not be processable by all algorithms within our time and memory limits.

Failures and timeouts

The next table lists all algorithms that could not process at least half of the matching (regarding learning type and input dimensionality) datasets:

		status	Status.ERROR	Status.OK	Status.TIMEOUT	ALL
algo_input_dimensionality	algo_training_type	algorithm
UNIVARIATE	UNSUPERVISED	S-H-ESD (Twitter)	17	0	0	17
		SAND	17	0	0	17
		NormA	0	3	14	17
MULTIVARIATE	UNSUPERVISED	DBStream	17	3	0	20
	SUPERVISED	MultiHMM	1	0	0	1
	SUPERVISED	Normalizing Flows	1	0	0	1
	SEMI_SUPERVISED	DeepAnT	2	0	0	2
		Hybrid KNN	2	0	0	2
		LSTM-AD	2	0	0	2
		Random Black Forest (RR)	2	0	0	2
		Telemanom	2	0	0	2
		EncDec-AD	0	0	2	2
		HealthESN	0	0	2	2
		LaserDBN	0	0	2	2
		TAnoGan	0	0	2	2

The algorithms NormA, EncDec-AD, HealthESN, LaserDBN, and TAnoGan cannot process the selected datasets because they hit the time limit. This is no suprise for the supervised and semi-supervised approaches because they have available only the large Exathlon datasets. However, the timeouts for the NormA algorithm are suprising, especially because its runtime for similarly large datasets is usually in the seconds-range. We assume that its runtime depends on the structure of the dataset and increases when there are multiple frequent patterns in the dataset.

The algorithms S-H-ESD (Twitter), SAND, DBStream, MultiHMM, Normalizing Flows, DeepAnT, Hybrid KNN, LSTM-AD, Random Black Forest (RR), and Telemanom encounter bugs when processing high contamination datasets. The reason for those failures are a task for further analysis.

We exclude all above algorithms from our further analysis.

Algorithm performance

In the next table, we show the qualitative performance of the remaining algorithms according to our evaluation metrics on the selected high contamination datasets:

	ROC_AUC	PR_AUC	RANGE_PR_AUC	AVERAGE_PRECISION
algorithm
Hybrid Isolation Forest (HIF)	0.999918	0.999469	0.971839	0.999469
OmniAnomaly	0.957648	0.699779	0.628670	0.699828
RobustPCA	0.773257	0.268411	0.771254	0.268467
DWT-MLEAD	0.747859	0.451350	0.406806	0.449522
k-Means	0.746775	0.419962	0.432895	0.420930
GrammarViz	0.702291	0.355281	0.394412	0.363837
Subsequence IF	0.630121	0.238262	0.229408	0.239008
ARIMA	0.621403	0.231054	0.215995	0.254633
Isolation Forest (iForest)	0.618316	0.276577	0.244185	0.275727
Torsk	0.615219	0.275777	0.256936	0.255155
Subsequence LOF	0.614126	0.246476	0.237810	0.247044
COPOD	0.610549	0.280198	0.247957	0.279020
SSA	0.604314	0.347935	0.320848	0.181269
Extended Isolation Forest (EIF)	0.593870	0.200386	0.233842	0.199645
HBOS	0.592026	0.251733	0.212789	0.223852
PCC	0.578449	0.260759	0.242622	0.255479
Series2Graph	0.570307	0.215902	0.199628	0.216754
PCI	0.568946	0.160600	0.128982	0.161539
FFT	0.551464	0.398682	0.411709	0.162788
MedianMethod	0.546102	0.129978	0.115209	0.130054
KNN	0.544524	0.191582	0.204380	0.167162
Left STAMPi	0.543362	0.201372	0.231741	0.202124
CBLOF	0.541039	0.182654	0.205906	0.169130
PhaseSpace-SVM	0.538457	0.205488	0.211348	0.131119
IF-LOF	0.534835	0.157076	0.158743	0.133459
NumentaHTM	0.531800	0.130608	0.134919	0.121894
LOF	0.525029	0.150761	0.169535	0.123988
COF	0.517095	0.126022	0.109826	0.121233
TSBitmap	0.511606	0.141991	0.184770	0.149793
DSPOT	0.504938	0.269787	0.250136	0.105735
VALMOD	0.501475	0.168835	0.179224	0.158639
Spectral Residual (SR)	0.497764	0.106476	0.104096	0.106886
Triple ES (Holt-Winter's)	0.488773	0.131470	0.126784	0.138829
HOT SAX	0.484899	0.172027	0.176918	0.180242
STAMP	0.423169	0.138710	0.207461	0.139272
PST	0.420949	0.143024	0.164422	0.116969
STOMP	0.410927	0.144260	0.215710	0.132876

The Hybrid Isolation Forest and OmniAnomaly methods seem to cope very well with the high number of anomalies, while most of the other methods really struggle to get good results. To contrast the algorithm performance on the high contamination datasets with their overall performance, we plot the change in their metric-scores (\( mean(\text{metric on high contamination}) - mean(\text{metric on all}) \)) in the following figure. You can change the displayed metric by clicking on the metric name in the legend.

In the above figure, we can see that only HIF, OmniAnomaly, RobustPCA, k-Means, FFT, and SSA have a higher ROC_AUC score on high contamination datasets compared to all datasets. All other methods perform worse on the high contamination datasets. The discord-based methods STAMP, VALMOD, and STOMP, which are very good methods overall, struggle the most with the high contamination datasets and their performance degrades to random guessing (ROC_AUC \( \simeq 0.5 \); cf. previous table).

We must note here that the datasets of the NAB collection with a high contamination seem to be labeled poorly. The following example shows one of the time series. We cannot find a reason, why the third anomalous region should be labeled anomalous. If this region actually is anomalous, then no indication in the time series exist or more time context is needed to find this anomaly.

The first three algorithms Hybrid Isolation Forest (HIF), OmniAnomaly, and RobustPCA were not executed on the NAB datasets because they need training data.

Parameter sensitivity

In this case study, we look at the parameter sensitivity of the SSA algorithm in detail. SSA is very susceptible to small changes in its window size parameter. We show this sensitivity using a small experiment with a single dataset varying the window size parameter.

Experiment setup: We use one of our synthetically generated GutenTAG datasets, called “ecg-diff-count-5”, with a dominant frequency of 20 (average peak-to-peak distance of the ECG signal). We vary the window_size parameter of SSA and fix all other parameters. We start with setting window_size to \(2*period = 40\) and then look one step left (\(39\)) and right (\(41\)) respectively.

Parameters: The following snippet shows the parameter values used for the experiment:

{
  "rf_method": "alpha",
  "alpha": 0.2,
  "e": 3,
  "random_state": 42,
  "window_size": [39, 40, 41]
}

Dataset: The following plot shows the “ecg-diff-count-5” dataset with the ground truth labels:

Results

Window size 39: Setting the window_size parameter to \(39\) leads to SSA producing very high anomaly scores in general. For the anomalous regions, however, the scores drop below their previous level. This means that SSA does not correctly detect those regions as anomalous!

SSA scoring with window size 39

Window size 40: Setting the window_size parameter to a multiple of the datasets period size allows SSA to detect all anomalies very well. The anomaly score is relatively small for the normal parts of the time series and has noticable peaks at anomalous regions. The anomalous regions can clearly be separated from the normal behavior.

Note that SSA primarly detects the beginning and the end of an anomalous region (= the short-term pattern change).

SSA scoring with window size 40

Window size 41: Increasing the window_size parameter by \(1\) (smallest possible increment) to \(41\) renders SSA’s results useless again. The picture is very similar to the window size 39 case.

SSA scoring with window size 41

We can further show that only whole multiples of the perfect period size lead to good results. The following two figures show the SSA anomaly scores when setting the window size to \( 1 * period = 20 \) and \( 1.5 * period = 30 \):

Window size 20 (\( 1 * period = 20 \)): Good anomaly scores with clear separation between anomalies and normal behavior.

SSA scoring with window size 20

Window size 30 (\( 1.5 * period = 30 \)): Useless anomaly scores.

SSA scoring with window size 30

For this example dataset, we know the correct period size because we supplied it as an input to the generation process of the synthetic dataset in GutenTAG. Other ways to retrieve the period size are careful manual inspection or the estimation using statistical methods, which are both prone to errors and costly. Setting the window_size parameter for SSA is, thus, no easy task and higher parameter value robustness is preferred and necessary.

Case Studies

Preliminary insights into some specific aspects.