I have a set of documents as given in the example below.
doc1 = {'Science': 0, 'History': 0, 'Politics': 0.15,... 'Sports': 0}
doc2 = {'Science': 0.3, 'History': 0.5, 'Politics': 0.1,... 'Sports': 0}
I clustered these documents using DBSCAN by using the aforementioned vectors (my vectors are mostly sparse vectors). I get to know that 'cosine similarity' is extremely efficient to compute for sparse vectors. However, according to the sklearn.DBSCAN fit documentation you should use a distance matrix as input to DBSCAN. Hence, I want to know if it is wrong if I used 'cosine similarity' instead of 'cosine distance'.
Please let me know what is the most suitable approach for my problem. Is it DBSCAN using cosine distance or DBSCAN using cosine similarity?
# Fit DBSCAN using cosine distance
db = DBSCAN(min_samples=1, metric='precomputed').fit(pairwise_distances(feature_matrix, metric='cosine'))
OR
# Fit DBSCAN using cosine similarity
db = DBSCAN(min_samples=1, metric='precomputed').fit(1-pairwise_distances(feature_matrix, metric='cosine'))