0
votes

I am completely new to Machine learning algorithms and want some direction to move towards.

General Scenario

I have two parameters Location and Time. I need to group the posts that were uploaded from a particular location at a particular time. There is no specific preset location or time. A random interesting event can happen at any time at any location in the location plane. A group of users can start uploading as soon as they find something interesting at a particular location at a particular time. The algorithm needs to detect these posts that were uploaded at that time in that place.

Accepted scene

An event happens at point A(1,3) at 4:00. People start taking photos of the event and start uploading from around the point A,say, some sample upload locations are (1,2),(1.5,3),(1,2.5)) around the time 4:00 like 4:01, 4:10, 4:22 etc

Unaccepted scene

  1. An event happens at point A(1,3) at 4:00. People start taking photos of the event and start uploading from around the point A,say, some sample upload locations are (9,2),(1.5,15),(21,62.5)) around the time 4:00 like 4:01, 4:10, 4:22 etc
  2. An event happens at point A(1,3) at 4:00. People start taking photos of the event and start uploading from around the point A,say, some sample upload locations are (1,2),(1.5,3),(1,2.5)) around the time 4:00 like 14:01, 08:10, 13:22 etc

Understanding so far

From my understand, I see that if this was only based on location, it could have been done using K-Means clustering algorithm. But, since we have Time dimension as well, I need another algorithm that clusters in 3D. I think DBSCAN might do it, but I'm not sure, as I have a really vague understanding.

So, which algorithm might help me here? If not direct answer, I would like some direction towards which I can research, because it's a very vast field to go through every single algo.

EDIT

I tried the following test implementation

from sklearn.cluster import KMeans, MeanShift, DBSCAN
import numpy as np

# Scene one, similar timestamp (turn timestamp into decimal so that the difference is not too large)
# First three are from event 1,
# next 2 are from event 2,
# 3rd one is a random post.
X = np.array([[12.975466, 77.639363, 149794.3292], [12.975273, 77.639358, 149794.3311], [12.975317, 77.639562, 149794.3314],
              [12.973567, 77.635589, 149794.3328], [12.973525, 77.635685, 149794.3336], [12.969739, 77.620912, 149794.3349]])

kmeans = KMeans(n_clusters=3, random_state=0).fit(X)
print('K-Means cluster:', kmeans.labels_)

meanshift = MeanShift().fit(X)
print('MeanShift cluster:', meanshift.labels_)

dbscan = DBSCAN(eps=1).fit(X)
print('DBSCAN cluster:', dbscan.labels_)

Output

K-Means cluster: [2 2 2 0 0 1]
MeanShift cluster: [2 0 0 1 1 1]
DBSCAN cluster: [0 0 0 0 0 0]

Here, K-means clustering works really well at clustering the right points. But, as mentioned in the answers, the disadvantage is that I need to mention the number of clusters in the input, which is not possible because there is practically no idea of knowing it in my application logic.

I tried a MeanShift and DBSCAN as well, but, since I don't know what the proper input to them should be, probably, that's why I am not getting the desired result.

So, how do I get the same result as K-Means clustering without passing the number of output clusters, using the other algorithms?

P.S. If you think this question need to be improved/closed, just comment the reason, and I would love to improve it with more details. Just don't go on down voting. I would never know how to improve, which completely defies the existence of Stack Overflow.

3
k-means works in n dimensions, you just need a way of converting time to distance (e.g. 1 minute = 1 metre)c2huc2hu
@user3080953 I see, can i get a reference link where I can further dig into your point?Parthapratim Neog
I don't have one in mind, sorry. I can try to explain fully in an answer thoughc2huc2hu
@user3080953 Please do, I would love to know more.Parthapratim Neog
there's nothing you can do to get around the fact that k means needs you to know the number of clusters a priori. you could try several different values of k and see what gives the best result, but if it will vary, you'd be better off with a different algorithmc2huc2hu

3 Answers

1
votes

The k-means algorithm is general and works in any number of dimensions. Thus we need to convert the (time, space) data to three dimensional space.

Say your data is in the format:

data = [
    location: (1, 1), time: "4:00", 
    location: (1, 2), time: "4:01", 
    ... 
]

We need to convert the time axis to a space axis:

def get3DCoordinate(point):
    "tau is a hyper parameter"
    return (location[0], location[1], tau * time.convertToDist())
map(get3DCoordinate, data)

This lets you convert your data to:

data = [
    (1, 1, 960),
    (1, 2, 961),
    ... 
]

These points can be used directly by k means.

1
votes

There are two aspects to your question.

The first is how to incorporate time as another dimension. To answer this, most clustering algorithms (including k-means) work on multi-dimensional datasets. You can convert your time to a number, and then include it as a third dimension in your data. While doing this, you need to consider what units to use, and how the units of time relate to the units of space. Eg: If your location data units are kilometers, then what should be the time equivalent? Say you arrive at 15 minutes. Then you should scale your time dimension such that 1 unit = 15 minutes. (This can be handled later on in some algorithms as well, but you should think of it nonetheless).

The second is what would be a suitable clustering algorithm to use in this scenario. While k-means is the default algorithm to use, it has the drawback that you need to specify the number of clusters. As the number of data points grow/shrink on a day to day basis in your system, it is not intuitive to think of a fixed number of clusters, and not easy to figure out the relationship between the number of clusters and the number of data points either.

You could try the mean shift algorithm for this use-case. Here, you don't have to specify the number of clusters, and the algorithm will discover this as it goes along. However, you do need to specify the "bandwidth" parameter, which roughly decides whether two points at a certain distance from each other are merged to a single cluster, or stay in their own cluster. You may need some iterations to figure out the right bandwidth as well, but this is likely to stay stable for a given application, unlike the number of clusters.

In general, you will need to try out some runs of the clustering and see what you get, and further tune the parameters.

0
votes

it is possible to cluster time series of multimedia data using k-means. this paper explains how