0
votes

scikit-learn==0.21.2

Hierarchal Agglomerative Clustering algorithm response time is increasing exponentially when increasing the dataset.

My Data set is textual. Each Document is 7-10 words long.

Using the following code to perform the Clustering.

hac_model = AgglomerativeClustering(affinity=consine,
                                            linkage=complete,
                                            compute_full_tree=True,
                                            connectivity=None, memory=None,
                                            n_clusters=None,
                                            distance_threshold=0.7)
cluster_matrix = hac_model.fit_predict(matrix)

where the matrix of size are:

  • 5000x1500 taking 17 seconds
  • 10000*2000 taking 113 seconds
  • 13000*2418 taking 228 seconds

I can't control 5000, 10000, 15000 as that is the size of input, or the feature set size(i.e 1500,2000,2418) since I am using BOW model(TFIDF).

I end up using all the unique words(after removing stopwords) as my feature list. this list grows as the input size increases.

So two questions.

  1. How do I avoid increase in feature set size irrespective of increase in the size of input data set
  2. Is there a way I can improve on the performance of the Algorithm without compromising on the quality?
1

1 Answers

0
votes

Standard AGNES hierarchical clustering is O(n³+n²d) in complexity. So the number of instances is much more a problem than the number of features.

There are approaches that typically run in O(n²d), although the worst case remains the same, so they will be much faster than this. With these you'll usually run into memory limits first... Unfortunately, this isn't implemented in sklearn for all I know, so you'll have to use other clustering tools - or write the algorithm yourself.