2
votes

I have a set of documents as given in the example below.

doc1 = {'Science': 0, 'History': 0, 'Politics': 0.15,... 'Sports': 0}
doc2 = {'Science': 0.3, 'History': 0.5, 'Politics': 0.1,... 'Sports': 0}

I clustered these documents using DBSCAN by using the aforementioned vectors (my vectors are mostly sparse vectors). I get to know that 'cosine similarity' is extremely efficient to compute for sparse vectors. However, according to the sklearn.DBSCAN fit documentation you should use a distance matrix as input to DBSCAN. Hence, I want to know if it is wrong if I used 'cosine similarity' instead of 'cosine distance'.

Please let me know what is the most suitable approach for my problem. Is it DBSCAN using cosine distance or DBSCAN using cosine similarity?

# Fit DBSCAN using cosine distance
db = DBSCAN(min_samples=1, metric='precomputed').fit(pairwise_distances(feature_matrix, metric='cosine'))

OR

# Fit DBSCAN using cosine similarity
    db = DBSCAN(min_samples=1, metric='precomputed').fit(1-pairwise_distances(feature_matrix, metric='cosine'))
1

1 Answers

0
votes

If you pass a distance matrix it will be O(n²).

If you pass the actual data, the code could use an index to make it faster than this. So I'd rather try metric="cosine".

DBSCAN can trivially be implemented with a similarity rather than a distance (c.f. Generalized DBSCAN). I believe I saw this supported in ELKI, but not in sklearn. In sklearn, you can use cosine_distance with aforementioned drawbacks.