# Clustering Based Unsupervised Anomaly Detection Algorithms

By [0xAbaki](https://paragraph.com/@0xabaki) · 2024-02-17

---

![algorithm tree](https://storage.googleapis.com/papyrus_images/574a38dd287cdd2b11ff4b499b1cbb35d1281bf74e9a0cdbd4cd5a8a53eb612a.png)

algorithm tree

Part 2 of my fun casual summary unsupervised anomaly detection algorithms.

read part 1 and the [intro here!](https://www.linkedin.com/feed/update/urn:li:activity:7162818738071752704?utm_source=share&utm_medium=member_desktop)

now we want to remove the k-value by trying clustering based approaches instead of nearest neighbor based.

in clustering the general idea is to make clusters in our data points to make groups and see which groups are outliers.

again, we have global and local outliers to be focused on depending on different focuses

2A) Cluster Based Local Outlier Factor (CBLOF)
----------------------------------------------

here the idea is to cluster first, then calculate the density of points.

we use k-means clustering for the O(n) runtime as opposed to O(n\*\*2) nearest neighbor search we saw before.

then we use a heuristic to classify small and big clusters

an anomaly score is given where

large cluster score = distance of each cluster to it’s center \* number of points in the group

small cluster group = distance to the closest large cluster

doing “\* number of points in the group“ was supposed to account for the local density and scale the value, but it seems this does not work well.

So the authors propose an augmented unweighted CBLOF (uCBLOF) where we say

large cluster score = distance of each cluster to it’s center

to take away the weight

![Fig 7. A visualization of the results for the uCBLOF algorithm.The anomaly score is represented by the bubble size, whereas the color corresponds to the clustering result of the preceded k-means clustering algorithm. Local anomalies are obviously not detected using uCBLOF.https://doi.org/10.1371/journal.pone.0152173.g007](https://storage.googleapis.com/papyrus_images/98059b088a7d7650ef5d6c293033e524f167934dda10094650b8185469a327ea.png)

Fig 7. A visualization of the results for the uCBLOF algorithm.The anomaly score is represented by the bubble size, whereas the color corresponds to the clustering result of the preceded k-means clustering algorithm. Local anomalies are obviously not detected using uCBLOF.https://doi.org/10.1371/journal.pone.0152173.g007

the k-values are tested many times and the stables results are chosen since k-values are very sensitive in k-means clustering.

unfortunately since we took out the density of the cluster uCBLOF is no longer a local anomaly detection method ;(

2B) Local Density Cluster-based Outlier Factor (LDCOF)
------------------------------------------------------

Here we estimate the density assuming a spherical distribution of a cluster to bring back the locality to uCBLOF.

the procedure is very similar to uCBLOF:

*   k-means clustering to find clusters
    
*   separate small and large clusters
    
*   calculate avg dist to centroid of cluster per cluster
    
*   calculate LDCOF score = distance(instance to cluster center)/average distance
    

here a score of 1 or lower is an anomaly, much like LOF again :)

2C) Clustering-based Multivariate Gaussian Outlier Score (CMGOS)
----------------------------------------------------------------

we estimate a bell curve like underlying existance of data, and use the generalized distance formula ([mahalanobois distance](https://www.wallstreetmojo.com/mahalanobis-distance/)) for the anomaly score.

the steps are as follows:

*   k-means cluster and separate into small and large clusters
    
*   calculate covariance matrix per cluster
    
*   CMGOS score = Genaralized euclidian distance(instance to it’s nearest cluster) / [chi-squared distribution](https://en.wikipedia.org/wiki/Chi-squared_distribution) w/ a certain confidence interval
    
    *   this last step is for normalizing the multivariate distance by the size to make it so that a score of 1 or below is an anomaly again ;)
        

![Mahalanobois equation (generalized euclidian pythagorean theorem)](https://storage.googleapis.com/papyrus_images/c2b6a4c0fbbe7e9abfb31af768229a25188278d9d2fe8bc36198469b2f07e6a3.png)

Mahalanobois equation (generalized euclidian pythagorean theorem)

there are various ways to robustly compute the Covariance matrix, each with their own algorithm name:

CMGOS-Red - Reduction to remove outliers (in second pass), similar to [Grubb’s Test](https://en.wikipedia.org/wiki/Grubbs%27s_test)

CMGOS-Reg - Regularization by weighing the COV matrix

CMGOS-MCD - brute force computational high usage method to find the Minimum Covariance Determinant

next we will dive into the rest for part 3!

note that this algorithm is also a part of Subspace Based as it uses distance in a space as a concept in its algorithm

---

*Originally published on [0xAbaki](https://paragraph.com/@0xabaki/clustering-based-unsupervised-anomaly-detection-algorithms)*