Which clustering algorithm to use here?

Question

I am completely new to Machine learning algorithms and want some direction to move towards.

General Scenario

I have two parameters Location and Time. I need to group the posts that were uploaded from a particular location at a particular time. There is no specific preset location or time. A random interesting event can happen at any time at any location in the location plane. A group of users can start uploading as soon as they find something interesting at a particular location at a particular time. The algorithm needs to detect these posts that were uploaded at that time in that place.

Accepted scene

An event happens at point A(1,3) at 4:00. People start taking photos of the event and start uploading from around the point A,say, some sample upload locations are (1,2),(1.5,3),(1,2.5)) around the time 4:00 like 4:01, 4:10, 4:22 etc

Unaccepted scene

An event happens at point A(1,3) at 4:00. People start taking photos of the event and start uploading from around the point A,say, some sample upload locations are (9,2),(1.5,15),(21,62.5)) around the time 4:00 like 4:01, 4:10, 4:22 etc
An event happens at point A(1,3) at 4:00. People start taking photos of the event and start uploading from around the point A,say, some sample upload locations are (1,2),(1.5,3),(1,2.5)) around the time 4:00 like 14:01, 08:10, 13:22 etc

Understanding so far

From my understand, I see that if this was only based on location, it could have been done using K-Means clustering algorithm. But, since we have Time dimension as well, I need another algorithm that clusters in 3D. I think DBSCAN might do it, but I'm not sure, as I have a really vague understanding.

So, which algorithm might help me here? If not direct answer, I would like some direction towards which I can research, because it's a very vast field to go through every single algo.

EDIT

I tried the following test implementation

from sklearn.cluster import KMeans, MeanShift, DBSCAN
import numpy as np

# Scene one, similar timestamp (turn timestamp into decimal so that the difference is not too large)
# First three are from event 1,
# next 2 are from event 2,
# 3rd one is a random post.
X = np.array([[12.975466, 77.639363, 149794.3292], [12.975273, 77.639358, 149794.3311], [12.975317, 77.639562, 149794.3314],
              [12.973567, 77.635589, 149794.3328], [12.973525, 77.635685, 149794.3336], [12.969739, 77.620912, 149794.3349]])

kmeans = KMeans(n_clusters=3, random_state=0).fit(X)
print('K-Means cluster:', kmeans.labels_)

meanshift = MeanShift().fit(X)
print('MeanShift cluster:', meanshift.labels_)

dbscan = DBSCAN(eps=1).fit(X)
print('DBSCAN cluster:', dbscan.labels_)

Output

K-Means cluster: [2 2 2 0 0 1]
MeanShift cluster: [2 0 0 1 1 1]
DBSCAN cluster: [0 0 0 0 0 0]

Here, K-means clustering works really well at clustering the right points. But, as mentioned in the answers, the disadvantage is that I need to mention the number of clusters in the input, which is not possible because there is practically no idea of knowing it in my application logic.

I tried a MeanShift and DBSCAN as well, but, since I don't know what the proper input to them should be, probably, that's why I am not getting the desired result.

So, how do I get the same result as K-Means clustering without passing the number of output clusters, using the other algorithms?

P.S. If you think this question need to be improved/closed, just comment the reason, and I would love to improve it with more details. Just don't go on down voting. I would never know how to improve, which completely defies the existence of Stack Overflow.

k-means works in n dimensions, you just need a way of converting time to distance (e.g. 1 minute = 1 metre) — c2huc2hu
@user3080953 I see, can i get a reference link where I can further dig into your point? — Parthapratim Neog
I don't have one in mind, sorry. I can try to explain fully in an answer though — c2huc2hu
there's nothing you can do to get around the fact that k means needs you to know the number of clusters a priori. you could try several different values of k and see what gives the best result, but if it will vary, you'd be better off with a different algorithm — c2huc2hu

c2huc2hu c2huc2hu · Accepted Answer · 2017-06-19T13:42:33

The k-means algorithm is general and works in any number of dimensions. Thus we need to convert the (time, space) data to three dimensional space.

Say your data is in the format:

data = [
    location: (1, 1), time: "4:00", 
    location: (1, 2), time: "4:01", 
    ... 
]

We need to convert the time axis to a space axis:

def get3DCoordinate(point):
    "tau is a hyper parameter"
    return (location[0], location[1], tau * time.convertToDist())
map(get3DCoordinate, data)

This lets you convert your data to:

data = [
    (1, 1, 960),
    (1, 2, 961),
    ... 
]

These points can be used directly by k means.

Which clustering algorithm to use here?

So, how do I get the same result as K-Means clustering without passing the number of output clusters, using the other algorithms?

3 Answers