Clustering Based Unsupervised Anomaly Detection Algorithms

Part 2 of my fun casual summary unsupervised anomaly detection algorithms.

read part 1 and the intro here!

now we want to remove the k-value by trying clustering based approaches instead of nearest neighbor based.

in clustering the general idea is to make clusters in our data points to make groups and see which groups are outliers.

again, we have global and local outliers to be focused on depending on different focuses

2A) Cluster Based Local Outlier Factor (CBLOF)

here the idea is to cluster first, then calculate the density of points.

we use k-means clustering for the O(n) runtime as opposed to O(n**2) nearest neighbor search we saw before.

then we use a heuristic to classify small and big clusters

an anomaly score is given where

large cluster score = distance of each cluster to it’s center * number of points in the group

small cluster group = distance to the closest large cluster

doing “* number of points in the group“ was supposed to account for the local density and scale the value, but it seems this does not work well.

So the authors propose an augmented unweighted CBLOF (uCBLOF) where we say

large cluster score = distance of each cluster to it’s center

to take away the weight

the k-values are tested many times and the stables results are chosen since k-values are very sensitive in k-means clustering.

unfortunately since we took out the density of the cluster uCBLOF is no longer a local anomaly detection method ;(

Here we estimate the density assuming a spherical distribution of a cluster to bring back the locality to uCBLOF.

the procedure is very similar to uCBLOF:

here a score of 1 or lower is an anomaly, much like LOF again :)

we estimate a bell curve like underlying existance of data, and use the generalized distance formula (mahalanobois distance) for the anomaly score.

the steps are as follows:

k-means cluster and separate into small and large clusters
calculate covariance matrix per cluster
CMGOS score = Genaralized euclidian distance(instance to it’s nearest cluster) / chi-squared distribution w/ a certain confidence interval
- this last step is for normalizing the multivariate distance by the size to make it so that a score of 1 or below is an anomaly again ;)

there are various ways to robustly compute the Covariance matrix, each with their own algorithm name:

CMGOS-Red - Reduction to remove outliers (in second pass), similar to Grubb’s Test

CMGOS-Reg - Regularization by weighing the COV matrix

CMGOS-MCD - brute force computational high usage method to find the Minimum Covariance Determinant

next we will dive into the rest for part 3!

note that this algorithm is also a part of Subspace Based as it uses distance in a space as a concept in its algorithm