Writings on Medium, music on spotify, other projects on the liketree in the URL ML enthusiast, dancer, wanting to do some web3 尾張旭市、愛知生まれ
Writings on Medium, music on spotify, other projects on the liketree in the URL ML enthusiast, dancer, wanting to do some web3 尾張旭市、愛知生まれ

Subscribe to 0xAbaki

Subscribe to 0xAbaki
Share Dialog
Share Dialog



Part 2 of my fun casual summary unsupervised anomaly detection algorithms.
read part 1 and the intro here!
now we want to remove the k-value by trying clustering based approaches instead of nearest neighbor based.
in clustering the general idea is to make clusters in our data points to make groups and see which groups are outliers.
again, we have global and local outliers to be focused on depending on different focuses
here the idea is to cluster first, then calculate the density of points.
we use k-means clustering for the O(n) runtime as opposed to O(n**2) nearest neighbor search we saw before.
then we use a heuristic to classify small and big clusters
an anomaly score is given where
large cluster score = distance of each cluster to it’s center * number of points in the group
small cluster group = distance to the closest large cluster
doing “* number of points in the group“ was supposed to account for the local density and scale the value, but it seems this does not work well.
So the authors propose an augmented unweighted CBLOF (uCBLOF) where we say
large cluster score = distance of each cluster to it’s center
to take away the weight

the k-values are tested many times and the stables results are chosen since k-values are very sensitive in k-means clustering.
unfortunately since we took out the density of the cluster uCBLOF is no longer a local anomaly detection method ;(
Here we estimate the density assuming a spherical distribution of a cluster to bring back the locality to uCBLOF.
the procedure is very similar to uCBLOF:
k-means clustering to find clusters
separate small and large clusters
calculate avg dist to centroid of cluster per cluster
calculate LDCOF score = distance(instance to cluster center)/average distance
here a score of 1 or lower is an anomaly, much like LOF again :)
we estimate a bell curve like underlying existance of data, and use the generalized distance formula (mahalanobois distance) for the anomaly score.
the steps are as follows:
k-means cluster and separate into small and large clusters
calculate covariance matrix per cluster
CMGOS score = Genaralized euclidian distance(instance to it’s nearest cluster) / chi-squared distribution w/ a certain confidence interval
this last step is for normalizing the multivariate distance by the size to make it so that a score of 1 or below is an anomaly again ;)

there are various ways to robustly compute the Covariance matrix, each with their own algorithm name:
CMGOS-Red - Reduction to remove outliers (in second pass), similar to Grubb’s Test
CMGOS-Reg - Regularization by weighing the COV matrix
CMGOS-MCD - brute force computational high usage method to find the Minimum Covariance Determinant
next we will dive into the rest for part 3!
note that this algorithm is also a part of Subspace Based as it uses distance in a space as a concept in its algorithm

Part 2 of my fun casual summary unsupervised anomaly detection algorithms.
read part 1 and the intro here!
now we want to remove the k-value by trying clustering based approaches instead of nearest neighbor based.
in clustering the general idea is to make clusters in our data points to make groups and see which groups are outliers.
again, we have global and local outliers to be focused on depending on different focuses
here the idea is to cluster first, then calculate the density of points.
we use k-means clustering for the O(n) runtime as opposed to O(n**2) nearest neighbor search we saw before.
then we use a heuristic to classify small and big clusters
an anomaly score is given where
large cluster score = distance of each cluster to it’s center * number of points in the group
small cluster group = distance to the closest large cluster
doing “* number of points in the group“ was supposed to account for the local density and scale the value, but it seems this does not work well.
So the authors propose an augmented unweighted CBLOF (uCBLOF) where we say
large cluster score = distance of each cluster to it’s center
to take away the weight

the k-values are tested many times and the stables results are chosen since k-values are very sensitive in k-means clustering.
unfortunately since we took out the density of the cluster uCBLOF is no longer a local anomaly detection method ;(
Here we estimate the density assuming a spherical distribution of a cluster to bring back the locality to uCBLOF.
the procedure is very similar to uCBLOF:
k-means clustering to find clusters
separate small and large clusters
calculate avg dist to centroid of cluster per cluster
calculate LDCOF score = distance(instance to cluster center)/average distance
here a score of 1 or lower is an anomaly, much like LOF again :)
we estimate a bell curve like underlying existance of data, and use the generalized distance formula (mahalanobois distance) for the anomaly score.
the steps are as follows:
k-means cluster and separate into small and large clusters
calculate covariance matrix per cluster
CMGOS score = Genaralized euclidian distance(instance to it’s nearest cluster) / chi-squared distribution w/ a certain confidence interval
this last step is for normalizing the multivariate distance by the size to make it so that a score of 1 or below is an anomaly again ;)

there are various ways to robustly compute the Covariance matrix, each with their own algorithm name:
CMGOS-Red - Reduction to remove outliers (in second pass), similar to Grubb’s Test
CMGOS-Reg - Regularization by weighing the COV matrix
CMGOS-MCD - brute force computational high usage method to find the Minimum Covariance Determinant
next we will dive into the rest for part 3!
note that this algorithm is also a part of Subspace Based as it uses distance in a space as a concept in its algorithm
<100 subscribers
<100 subscribers
No activity yet