I am trying to build a clustering algorithm for categorical data.
I have read about different algorithm's like k-modes, ROCK, LIMBO, however I would like to build one of mine and compare the accuracy and cost to others.
I have (m) training set and (n=22) features
Approach
My approach is simple:
- Step 1: I calculate the jaccard similarity between each of my training data forming a (m*m) similarity matrix.
- Step 2: Then I perform some operations to find the best centroids and find the clusters by using a simple k-means approach.
The similarity matrix I create in step 1 would be used while performing the k-means algorithm
Matrix creation:
total_columns=22
for i in range(0,data_set):
for j in range(0,data_set):
if j>=i:
# Calculating jaccard similarity between two data rows i and j
for column in data_set.columns:
if data_orig[column][j]==data_new[column][i]:
common_count=common_count+1
probability=common_count/float(total_columns)
fnl_matrix[i][j] =probability
fnl_matrix[j][i] =probability
Partial snapshot of my fnl_matrix
(for 6 rows) is given below:
Problem Statement:
The problem I face is that when I create the (m*m) matrix, for a larger data set my performance goes for a toss. Even for a smaller dataset with 8000 rows the creation of the similarity matrix takes unbearable time. Is there any way I could tune my code or do something to the matrix that would be cost efficient.