Clustering Categorical data using jaccard similarity

Question

I am trying to build a clustering algorithm for categorical data.

I have read about different algorithm's like k-modes, ROCK, LIMBO, however I would like to build one of mine and compare the accuracy and cost to others.

I have (m) training set and (n=22) features

Approach

My approach is simple:

Step 1: I calculate the jaccard similarity between each of my training data forming a (m*m) similarity matrix.
Step 2: Then I perform some operations to find the best centroids and find the clusters by using a simple k-means approach.

The similarity matrix I create in step 1 would be used while performing the k-means algorithm

Matrix creation:

total_columns=22
for i in range(0,data_set):
    for j in range(0,data_set):
        if j>=i:
            # Calculating jaccard similarity between two data rows i and j 
            for column in data_set.columns:    
                if data_orig[column][j]==data_new[column][i]:
                    common_count=common_count+1
            probability=common_count/float(total_columns)    
            fnl_matrix[i][j] =probability  
            fnl_matrix[j][i] =probability

Partial snapshot of my fnl_matrix (for 6 rows) is given below:

enter image description here

Problem Statement:

The problem I face is that when I create the (m*m) matrix, for a larger data set my performance goes for a toss. Even for a smaller dataset with 8000 rows the creation of the similarity matrix takes unbearable time. Is there any way I could tune my code or do something to the matrix that would be cost efficient.

In case you've already seen my answer, I've just added some dramatic improvements and further recommendations. — Eli Korvigo

Has QUIT--Anony-Mousse Has QUIT--Anony-Mousse · Accepted Answer · 2015-05-09T14:13:38

Interpreted Python code is slow. Really slow.

That is why the good python toolkits contain plenty of Cython code and even C and Fortran code (e.g. matrix operations in numpy), and only use Python for driving the overall process.

You may be able to speed up your code substantially if you try to use as much numpy as possible. Or if you use Cython instead.

Instead of fighting centroids, consider using an distance-based clustering algorithm:

Hierarchical Agglomerative Clustering (HAC), which expects a distance matrix
DBSCAN, which can work with arbitrary distances. It does not even need a distance matrix, only a list of similar items for some threshold.
K-medoids/PAM of course is worth a try, too; but usually not very fast.

Clustering Categorical data using jaccard similarity

Approach

Matrix creation:

Problem Statement:

2 Answers