Methods to clustering huge data sets of time series

Question

Most of the clustering algorithms requires a distance matrix. Creating a distance matrix is easy if the data has lower dimension. But what about a time series with around 8000 points to consider?

for i in range(total_series):
    for j in range(total_series):
        dis[i][j] = distance(series[i],series[j])

It's clear that the minimum time required to create this matrix will be of order O(n^2). Now, if we compare all the 8000 points of two time series, the time complexity is going to be very high. I am just talking about aligned distance(euclidean) and not some edit distance here.

Since we have around 50,000 time series to cluster, O(n^2) is going to very high for just those for loops. I need to calculate the distance function in minimal time though some indexing or pre-processing technique. Note that the distance function is going to point by point comparison.

Can somebody suggest some techniques so that we can find distance between two time series in less than O(length of time series) through some pre-processing? Or suggest some methods to cluster without creating the distance matrix with time complexity O(n^2)?

paolof89 paolof89 · Accepted Answer · 2018-01-31T14:25:38

Due symmetric property of euclidean distance you can compute a triangular matrix that has a O(n^2/2) complexity

Methods to clustering huge data sets of time series

1 Answers