scikit-learn==0.21.2
Hierarchal Agglomerative Clustering algorithm response time is increasing exponentially when increasing the dataset.
My Data set is textual. Each Document is 7-10 words long.
Using the following code to perform the Clustering.
hac_model = AgglomerativeClustering(affinity=consine,
linkage=complete,
compute_full_tree=True,
connectivity=None, memory=None,
n_clusters=None,
distance_threshold=0.7)
cluster_matrix = hac_model.fit_predict(matrix)
where the matrix of size are:
- 5000x1500 taking 17 seconds
- 10000*2000 taking 113 seconds
- 13000*2418 taking 228 seconds
I can't control 5000, 10000, 15000 as that is the size of input, or the feature set size(i.e 1500,2000,2418) since I am using BOW model(TFIDF).
I end up using all the unique words(after removing stopwords) as my feature list. this list grows as the input size increases.
So two questions.
- How do I avoid increase in feature set size irrespective of increase in the size of input data set
- Is there a way I can improve on the performance of the Algorithm without compromising on the quality?