# Statistical, Subspace based, and Classifier Unsupervised Anomaly Detection Algorithms

By [0xAbaki](https://paragraph.com/@0xabaki) · 2024-02-22

---

4A) Histogram Based Outlier Score (HBOS)
----------------------------------------

This is a statistical algorithm

1.  create a histogram for each feature in the data
    
2.  multiply the inverse height of the feature bin that each bin resides in to get a feeling of the density a similar idea to the [Naive Bayes algorithm](https://www.geeksforgeeks.org/naive-bayes-classifiers/)
    

often used for fast semi-supervised anomaly detection

the speed comes from ignoring the interdependencies of each feature with each other. This assumption is particularly safe when there’s a lot of one-off features that many other data points don’t have (a sparse and large feature space).

ex: HBOS can take 1 min where k-nn can take >23 hours

bins can be created with

1.  static bin sizes w/ fixed bin width
    
2.  dynamic bins so that the number of bins per feature is about the same (but different bin widths). This method is more robust when there are large outlying values.
    

5A) robust Principal Component Analysis (rPCA)
----------------------------------------------

this is a modification of [Principal Component Analysis](https://www.youtube.com/watch?v=FD4DeN81ODY&pp=ygUccHJpbmNpcGFsIGNvbXBvbmVudCBhbmFseXNpcw%3D%3D)

The robustness comes from the covariance matrix being computed twice, a similar optimization method to CMGOS from my second post.

6A) One-Class Support Vector Machine (SVM)
------------------------------------------

often used for semi-supervised learning

1.  train on anomaly free training data
    
2.  SVM later detects normal v anomaly
    

but is inherently an unsupervised algorithm when using a soft marigin.

Since SVMs are better explained visually, check out [this video](https://www.youtube.com/watch?v=_YPScrckx28&t=3s&pp=ygUWc3VwcG9ydCB2ZWN0b3IgbWFjaGluZQ%3D%3D) for a more intuitive understanding.

in the paper, _η,_ a value that adjusts the normality of an instance is added to optimize the algorithm.

Comparisons
-----------

Below are the comparative tables of [this research](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152173#pone.0152173.ref057) we’ve been summarizing for reference. Hope you enjoyed the read!

![Table 1. The 10 datasets used for comparative evaluation of the unsupervised anomaly detection algorithms from different application domains.A broad spectrum of size, dimensionality and anomaly percentage is covered. They also differ in difficulty and cover local and global anomaly detection tasks.https://doi.org/10.1371/journal.pone.0152173.t001](https://storage.googleapis.com/papyrus_images/7b15c3518c3027ee44b159432b26b1560b76ec791e4db817d6477b95bb2123c5.png)

Table 1. The 10 datasets used for comparative evaluation of the unsupervised anomaly detection algorithms from different application domains.A broad spectrum of size, dimensionality and anomaly percentage is covered. They also differ in difficulty and cover local and global anomaly detection tasks.https://doi.org/10.1371/journal.pone.0152173.t001

![Fig 8. The AUC values for the nearest-neighbor based algorithms on the breast-cancer dataset.It can be seen that k values smaller than 10 tend to result in poor estimates, especially when considering local anomaly detection algorithms. Please note that the AUC axis is cut off at 0.5.https://doi.org/10.1371/journal.pone.0152173.g008](https://storage.googleapis.com/papyrus_images/28be391e8d23e9fc2a8594bf62b52a21bb751942fb7748f9bcf0f321e0b7c0bc.png)

Fig 8. The AUC values for the nearest-neighbor based algorithms on the breast-cancer dataset.It can be seen that k values smaller than 10 tend to result in poor estimates, especially when considering local anomaly detection algorithms. Please note that the AUC axis is cut off at 0.5.https://doi.org/10.1371/journal.pone.0152173.g008

![Table 2. The results of the nearest-neighbor based algorithms showing the AUC and the standard deviation for 10 ≤ k ≤ 50 for all of the 10 datasets.Due to the computational complexity, LOCI could not be computed for larger datasets.https://doi.org/10.1371/journal.pone.0152173.t002](https://storage.googleapis.com/papyrus_images/e18e8c3862e6d73696ec8b642db8143998624fe96744b03cd08f430d51d55bd2.png)

Table 2. The results of the nearest-neighbor based algorithms showing the AUC and the standard deviation for 10 ≤ k ≤ 50 for all of the 10 datasets.Due to the computational complexity, LOCI could not be computed for larger datasets.https://doi.org/10.1371/journal.pone.0152173.t002

![Fig 9. The AUC values for the large kdd99 dataset for 0 < k < 100.It can be easily seen that the performance of local anomaly detection algorithms is poor for this global anomaly detection challenge.https://doi.org/10.1371/journal.pone.0152173.g009](https://storage.googleapis.com/papyrus_images/dae8e3cff18137f92a7bd17ace503fd265f3683940080572939d672121f3f172.png)

Fig 9. The AUC values for the large kdd99 dataset for 0 < k < 100.It can be easily seen that the performance of local anomaly detection algorithms is poor for this global anomaly detection challenge.https://doi.org/10.1371/journal.pone.0152173.g009

![Table 4. The AUC results of the remaining unsupervised anomaly detection algorithms.Four different strategies for keeping the components have been used for rPCA, while for HBOS the number of different bins was altered.https://doi.org/10.1371/journal.pone.0152173.t004](https://storage.googleapis.com/papyrus_images/ac100d52c8f2e30cb7fde42675c504b1a971aedfe5d4ec3b4831175f7eadf637.png)

Table 4. The AUC results of the remaining unsupervised anomaly detection algorithms.Four different strategies for keeping the components have been used for rPCA, while for HBOS the number of different bins was altered.https://doi.org/10.1371/journal.pone.0152173.t004

![Table 5. Comparing the computation time of the different algorithm show huge differences, especially for the larger datasets.The unit of the table is seconds for the first nine columns and minutes for the last dataset (kdd99).https://doi.org/10.1371/journal.pone.0152173.t005](https://storage.googleapis.com/papyrus_images/5005e3c7fa572425520e703723d6b5643801b83159c4eba41acf4ede980d2d53.png)

Table 5. Comparing the computation time of the different algorithm show huge differences, especially for the larger datasets.The unit of the table is seconds for the first nine columns and minutes for the last dataset (kdd99).https://doi.org/10.1371/journal.pone.0152173.t005

![Table 6. Recommendations for algorithm selection.Qualitatively judgments are given from very bad (− −) over average (o) to very good (++).https://doi.org/10.1371/journal.pone.0152173.t006](https://storage.googleapis.com/papyrus_images/566d354da7993c2ea72c748c4cbf85f14c4e1ca438ba5f0d3a71a4d1f702c145.png)

Table 6. Recommendations for algorithm selection.Qualitatively judgments are given from very bad (− −) over average (o) to very good (++).https://doi.org/10.1371/journal.pone.0152173.t006

---

*Originally published on [0xAbaki](https://paragraph.com/@0xabaki/statistical-subspace-based-and-classifier-unsupervised-anomaly-detection-algorithms)*