Statistical, Subspace based, and Classifier Unsupervised Anomaly Detection Algorithms

4A) Histogram Based Outlier Score (HBOS)

This is a statistical algorithm

create a histogram for each feature in the data
multiply the inverse height of the feature bin that each bin resides in to get a feeling of the density a similar idea to the Naive Bayes algorithm

often used for fast semi-supervised anomaly detection

the speed comes from ignoring the interdependencies of each feature with each other. This assumption is particularly safe when there’s a lot of one-off features that many other data points don’t have (a sparse and large feature space).

ex: HBOS can take 1 min where k-nn can take >23 hours

bins can be created with

static bin sizes w/ fixed bin width
dynamic bins so that the number of bins per feature is about the same (but different bin widths). This method is more robust when there are large outlying values.

5A) robust Principal Component Analysis (rPCA)

this is a modification of Principal Component Analysis

The robustness comes from the covariance matrix being computed twice, a similar optimization method to CMGOS from my second post.

6A) One-Class Support Vector Machine (SVM)

often used for semi-supervised learning

train on anomaly free training data
SVM later detects normal v anomaly

but is inherently an unsupervised algorithm when using a soft marigin.

Since SVMs are better explained visually, check out this video for a more intuitive understanding.

in the paper, η, a value that adjusts the normality of an instance is added to optimize the algorithm.

Comparisons

Below are the comparative tables of this research we’ve been summarizing for reference. Hope you enjoyed the read!

Table 1. The 10 datasets used for comparative evaluation of the unsupervised anomaly detection algorithms from different application domains.A broad spectrum of size, dimensionality and anomaly percentage is covered. They also differ in difficulty and cover local and global anomaly detection tasks.https://doi.org/10.1371/journal.pone.0152173.t001

Fig 8. The AUC values for the nearest-neighbor based algorithms on the breast-cancer dataset.It can be seen that k values smaller than 10 tend to result in poor estimates, especially when considering local anomaly detection algorithms. Please note that the AUC axis is cut off at 0.5.https://doi.org/10.1371/journal.pone.0152173.g008

Table 2. The results of the nearest-neighbor based algorithms showing the AUC and the standard deviation for 10 ≤ k ≤ 50 for all of the 10 datasets.Due to the computational complexity, LOCI could not be computed for larger datasets.https://doi.org/10.1371/journal.pone.0152173.t002

Fig 9. The AUC values for the large kdd99 dataset for 0 < k < 100.It can be easily seen that the performance of local anomaly detection algorithms is poor for this global anomaly detection challenge.https://doi.org/10.1371/journal.pone.0152173.g009

Table 4. The AUC results of the remaining unsupervised anomaly detection algorithms.Four different strategies for keeping the components have been used for rPCA, while for HBOS the number of different bins was altered.https://doi.org/10.1371/journal.pone.0152173.t004

Table 5. Comparing the computation time of the different algorithm show huge differences, especially for the larger datasets.The unit of the table is seconds for the first nine columns and minutes for the last dataset (kdd99).https://doi.org/10.1371/journal.pone.0152173.t005

Table 6. Recommendations for algorithm selection.Qualitatively judgments are given from very bad (− −) over average (o) to very good (++).https://doi.org/10.1371/journal.pone.0152173.t006