I am working with GPS data (latitude, longitude). For density based clustering I have used DBSCAN in R.
Advantages of DBSCAN in my case:
- I don't have to predefine numbers of clusters
I can calculate a distance matrix (using Haversine Distance Formula) and use that as input in dbscan
library(fossil) dist<- earth.dist(df, dist=T) #df is dataset containing lat long values library(fpc) dens<-dbscan(dist,MinPts=25,eps=0.43,method="dist")
Now, when I look at the clusters, they are not meaningful. Some clusters have points which are more than 1km apart. I want dense clusters but not that big in size.
Different values of MinPts
and eps are taken care of and I have also used k nearest neighbor distance graph to get an optimum value of eps
for MinPts
=25
What dbscan
is doing is going to every point in my dataset and if point p has MinPts
in its eps
neighborhood it will make a cluster but at the same time it is also joining the clusters which are density reachable (which I guess are creating a problem for me).
It really is a big question, particularly "how to reduce size of a cluster without affecting its information too much", but I will write it down as the following points:
- How to remove border points in a cluster? I know which points are in
which cluster using
dens$cluster
, but how would I know if a particular point is core or border? - Is cluster 0 always noise?
- I was under the impression that the size of a cluster would be
comparable to
eps
. But that's not the case because density reachable clusters are combined together. - Is there any other clustering method which has the advantage of
dbscan
but can give me more meaningful clusters?
OPTICS
is another alternative but will it solve my issue?
Note:
By meaningful I want to say closer points should be in a cluster. But points which are 1km or more apart should not be in the same cluster.
R
code for half a dozen methods for estimating suitable numbers of clusters for k-means over here. You do not really need to 'know' how many clusters your data have. – Ben