I am completely new to Machine learning algorithms and want some direction to move towards.
General Scenario
I have two parameters Location
and Time
. I need to group the posts that were uploaded from a particular location at a particular time. There is no specific preset location or time. A random interesting event can happen at any time at any location in the location plane. A group of users can start uploading as soon as they find something interesting at a particular location at a particular time. The algorithm needs to detect these posts that were uploaded at that time in that place.
Accepted scene
An event happens at point A(1,3)
at 4:00
. People start taking photos of the event and start uploading from around the point A,say, some sample upload locations are (1,2)
,(1.5,3)
,(1,2.5)
) around the time 4:00
like 4:01
, 4:10
, 4:22
etc
Unaccepted scene
- An event happens at point
A(1,3)
at4:00
. People start taking photos of the event and start uploading from around the point A,say, some sample upload locations are(9,2)
,(1.5,15)
,(21,62.5)
) around the time4:00
like4:01
,4:10
,4:22
etc - An event happens at point
A(1,3)
at4:00
. People start taking photos of the event and start uploading from around the point A,say, some sample upload locations are(1,2)
,(1.5,3)
,(1,2.5)
) around the time4:00
like14:01
,08:10
,13:22
etc
Understanding so far
From my understand, I see that if this was only based on location, it could have been done using K-Means clustering algorithm. But, since we have Time dimension as well, I need another algorithm that clusters in 3D. I think DBSCAN might do it, but I'm not sure, as I have a really vague understanding.
So, which algorithm might help me here? If not direct answer, I would like some direction towards which I can research, because it's a very vast field to go through every single algo.
EDIT
I tried the following test implementation
from sklearn.cluster import KMeans, MeanShift, DBSCAN
import numpy as np
# Scene one, similar timestamp (turn timestamp into decimal so that the difference is not too large)
# First three are from event 1,
# next 2 are from event 2,
# 3rd one is a random post.
X = np.array([[12.975466, 77.639363, 149794.3292], [12.975273, 77.639358, 149794.3311], [12.975317, 77.639562, 149794.3314],
[12.973567, 77.635589, 149794.3328], [12.973525, 77.635685, 149794.3336], [12.969739, 77.620912, 149794.3349]])
kmeans = KMeans(n_clusters=3, random_state=0).fit(X)
print('K-Means cluster:', kmeans.labels_)
meanshift = MeanShift().fit(X)
print('MeanShift cluster:', meanshift.labels_)
dbscan = DBSCAN(eps=1).fit(X)
print('DBSCAN cluster:', dbscan.labels_)
Output
K-Means cluster: [2 2 2 0 0 1]
MeanShift cluster: [2 0 0 1 1 1]
DBSCAN cluster: [0 0 0 0 0 0]
Here, K-means clustering works really well at clustering the right points. But, as mentioned in the answers, the disadvantage is that I need to mention the number of clusters in the input, which is not possible because there is practically no idea of knowing it in my application logic.
I tried a MeanShift and DBSCAN as well, but, since I don't know what the proper input to them should be, probably, that's why I am not getting the desired result.
So, how do I get the same result as K-Means clustering without passing the number of output clusters, using the other algorithms?
P.S. If you think this question need to be improved/closed, just comment the reason, and I would love to improve it with more details. Just don't go on down voting. I would never know how to improve, which completely defies the existence of Stack Overflow.