Clustering Categorical data-set with distance based approach

Question

I want to compare the ROCK clustering algorithm to a distance based algorithm. Let say we have (m) training examples and (n) features

ROCK:

From what I understand ROCK does is that

1. It calculates a similarity matrix (m*m) using Jaccard cooficients.
2. Then a threshold value is provided by the user.
3. Based on the threshold value it then links the data points and the data-points that have more neighbors in common are said to be in same cluster. 
  For Example lets see the below png file,

enter image description here

   The above picture shows the similarity matrix and let threshold_value =0.2. 
4. The algorithm then computes links between the points, which flows in as below.
for A- A   (Only A exceeds the threshold value) 
    B- BCD (because bb, bc and bd exceeds the threshold value)
    C- BCD
    D- BCD
Now since B, C and D each have common neighbor of 3 they are grouped into the same cluster

Therefore we get two clusters {A}, {BCD}

A DISTANCE BASED APPROACH:

1. I take a different approach, but like ROCK even I create the similarity matrix.
2. Even I compute the initial links like,
   for A- A   (Only A exceeds the threshold value) 
       B- BCD (because bb, bc and bd exceeds the threshold value)
       C- BCD
       D- BCD
3. Now Instead of finding neighbors, I perform some mojo and find the best centroids.
4. After finding the centroid, I run the k-means clustering algorithm over the similarity matrix(m*m)
5. Since I find the centroids before hand, the time taken by the algorithm reduces by not running the k-means algorithm multiple times for randomly chosen centroids.

Problem Statement:

The problem I see is the space complexity, since the similarity matrix is a (m*m) matrix, if the value of m is too large say 1 million, storing such large matrix would be difficult, also due to the matrix size the euclidean distance computation takes time.

In ROCK I believe however, there is absolutely no need to store the matrix because the links can be build on the fly when the Jaccard cooficient is calculated between the dataset.

I ran the distance based algorithm approach over the similarity matrix for the mushroom dataset available in (uci.org) and the output results were very similar to ROCK and even better for some other datasets.

Qestions:

1. Is my understanding of ROCK correct.
2. Is it even worth considering to create such large similarity matrix and store is in disk and use it later to calculate distances. 
3. I would really appreciate if someone could provide the big O complexity for the distance based approach.

Thanks:)

Siddarth Siddarth · Accepted Answer · 2015-05-14T00:18:09

As per my knowledge clustering becomes very memory intensive as the size increases, you will have to figure out a way to reduce the dimensionality of your data.

I am not familiar with ROCK but I've worked on clustering problems before wherein I had to cluster millions of documents.

Distance Calculation Metric : levenshtein distance
Clustering Algorithm : DBSCAN

Coming back to your questions

Question 2 : 
Is it even worth considering to create such large similarity matrix and store is in disk and use it later to calculate distances.

I would never recommend constructing a large matrix. For example, constructing a distance matrix on 1 million words would require 4 terabytes of space. You will have to use some kind of blocking technique to group somewhat similar documents and then apply a clustering algorithm on top.

3. I would really appreciate if someone could provide the big O complexity for the distance based approach.

Usually time complexity for calculating distance between two words is trivial since words are not too long. Your complexity would be the number of comparisons, so if you have n words then the time complexity would be O(n*n)