Statistical, Subspace based, and Classifier Unsupervised Anomaly Detection Algorithms
4A) Histogram Based Outlier Score (HBOS)
This is a statistical algorithm
create a histogram for each feature in the data
multiply the inverse height of the feature bin that each bin resides in to get a feeling of the density a similar idea to the Naive Bayes algorithm
often used for fast semi-supervised anomaly detection
the speed comes from ignoring the interdependencies of each feature with each other. This assumption is particularly safe when there’s a lot of one-off features that many other data points don’t have (a sparse and large feature space).
ex: HBOS can take 1 min where k-nn can take >23 hours
bins can be created with
static bin sizes w/ fixed bin width
dynamic bins so that the number of bins per feature is about the same (but different bin widths). This method is more robust when there are large outlying values.
The robustness comes from the covariance matrix being computed twice, a similar optimization method to CMGOS from my second post.
6A) One-Class Support Vector Machine (SVM)
often used for semi-supervised learning
train on anomaly free training data
SVM later detects normal v anomaly
but is inherently an unsupervised algorithm when using a soft marigin.
Since SVMs are better explained visually, check out this video for a more intuitive understanding.
in the paper, η, a value that adjusts the normality of an instance is added to optimize the algorithm.
Comparisons
Below are the comparative tables of this research we’ve been summarizing for reference. Hope you enjoyed the read!
Table 1. The 10 datasets used for comparative evaluation of the unsupervised anomaly detection algorithms from different application domains.A broad spectrum of size, dimensionality and anomaly percentage is covered. They also differ in difficulty and cover local and global anomaly detection tasks.https://doi.org/10.1371/journal.pone.0152173.t001Fig 8. The AUC values for the nearest-neighbor based algorithms on the breast-cancer dataset.It can be seen that k values smaller than 10 tend to result in poor estimates, especially when considering local anomaly detection algorithms. Please note that the AUC axis is cut off at 0.5.https://doi.org/10.1371/journal.pone.0152173.g008Table 2. The results of the nearest-neighbor based algorithms showing the AUC and the standard deviation for 10 ≤ k ≤ 50 for all of the 10 datasets.Due to the computational complexity, LOCI could not be computed for larger datasets.https://doi.org/10.1371/journal.pone.0152173.t002Fig 9. The AUC values for the large kdd99 dataset for 0 < k < 100.It can be easily seen that the performance of local anomaly detection algorithms is poor for this global anomaly detection challenge.https://doi.org/10.1371/journal.pone.0152173.g009Table 4. The AUC results of the remaining unsupervised anomaly detection algorithms.Four different strategies for keeping the components have been used for rPCA, while for HBOS the number of different bins was altered.https://doi.org/10.1371/journal.pone.0152173.t004Table 5. Comparing the computation time of the different algorithm show huge differences, especially for the larger datasets.The unit of the table is seconds for the first nine columns and minutes for the last dataset (kdd99).https://doi.org/10.1371/journal.pone.0152173.t005Table 6. Recommendations for algorithm selection.Qualitatively judgments are given from very bad (− −) over average (o) to very good (++).https://doi.org/10.1371/journal.pone.0152173.t006