My goal is to find outliers in a dataset that contains many near-duplicate points and I want to use ELKI implementation of DBSCAN for this task.
As I don't care about the clusters themselves just the outliers (which I assume are relatively far from the clusters), I want to speed up the runtime by aggregating/binning points on a grid and using the concept implemented in scikit-learn as sample_weight.
Can you please show minimum code to do similar analysis in ELKI?
Let's assume my dataset contains two columns of features
(aggregated/binned points' coordinates on the x-y grid) and third column of sample_weights sample_weight_feature
(number of original dataset points in the neighbourhood of the aggregated/binned points). In scikit-learn the answer I expect would be -- call function fit
in the following way: fit(self, features, y=None, sample_weight=sample_weight_feature)
km = new DBSCAN(dist, eps*eps, minpts)
but didn't try to implement sample_weight functionality for now. – user1541776