I want to compare the ROCK clustering algorithm to a distance based algorithm. Let say we have (m) training examples and (n) features
ROCK:
From what I understand ROCK does is that
1. It calculates a similarity matrix (m*m) using Jaccard cooficients.
2. Then a threshold value is provided by the user.
3. Based on the threshold value it then links the data points and the data-points that have more neighbors in common are said to be in same cluster.
For Example lets see the below png file,
The above picture shows the similarity matrix and let threshold_value =0.2.
4. The algorithm then computes links between the points, which flows in as below.
for A- A (Only A exceeds the threshold value)
B- BCD (because bb, bc and bd exceeds the threshold value)
C- BCD
D- BCD
Now since B, C and D each have common neighbor of 3 they are grouped into the same cluster
Therefore we get two clusters {A}, {BCD}
A DISTANCE BASED APPROACH:
1. I take a different approach, but like ROCK even I create the similarity matrix.
2. Even I compute the initial links like,
for A- A (Only A exceeds the threshold value)
B- BCD (because bb, bc and bd exceeds the threshold value)
C- BCD
D- BCD
3. Now Instead of finding neighbors, I perform some mojo and find the best centroids.
4. After finding the centroid, I run the k-means clustering algorithm over the similarity matrix(m*m)
5. Since I find the centroids before hand, the time taken by the algorithm reduces by not running the k-means algorithm multiple times for randomly chosen centroids.
Problem Statement:
The problem I see is the space complexity, since the similarity matrix is a (m*m) matrix, if the value of m is too large say 1 million, storing such large matrix would be difficult, also due to the matrix size the euclidean distance computation takes time.
In ROCK I believe however, there is absolutely no need to store the matrix because the links can be build on the fly when the Jaccard cooficient is calculated between the dataset.
I ran the distance based algorithm approach over the similarity matrix for the mushroom dataset available in (uci.org) and the output results were very similar to ROCK and even better for some other datasets.
Qestions:
1. Is my understanding of ROCK correct.
2. Is it even worth considering to create such large similarity matrix and store is in disk and use it later to calculate distances.
3. I would really appreciate if someone could provide the big O complexity for the distance based approach.
Thanks:)