4
votes

I have time series data of different length of series. I want to cluster based upon DTW distance but could not find ant library regarding it. sklearn give straight error while tslearn kmeans gave wrong answer.

My problem is solving if I pad it with zeros but I am not sure if this is correct to pad time-series data while clustering.

The suggestion about other clustering technique about time series data are welcomed.

max_length = 0

for i in train_1:
    if(len(i)>max_length):
        max_length = len(i)
print(max_length)

train_1 = sequence.pad_sequences(train_1, maxlen=max_length)
km3 = TimeSeriesKMeans(n_clusters = 4, metric="dtw",verbose = False,random_state = 0).fit(train_1)

print(km3.labels_)
2
I am the one asked the question on analysis reach to the conclusion that padding is not the solution as it gives different answers from more than 2 class dataYash Gupta

2 Answers

3
votes

You can try custom made k-means(clustering algorithm) or other. Source code is easily available at the sklearn library. Padding is really not a great option as it will change the question problem itself. You can also use tslearn and pyclustering(for optimal clusters) as an alternative, but remember to use DTW distance rather than Euclidean distance.

-1
votes

I had the same issue because my data does not have the same length. I used zeros at the end of each series to have the maximum length. I tested a few cluster types with the data, and the "partitional" worked surprisingly well compared with other ones. I'm not an expert, but this worked well enough for my needs.

Let me know if you found a better way.

data_clusters_results <-
  tsclust(
    series = data_ts_,
    type = "partitional", ## options: "partitional", "hierarchical", 'fuzzy'
    k = 2:max_clusters,
    preproc = NULL,
    distance = "gak", ## options: "dtw", "dtw2", "dtw_basic", "gak"
    trace = TRUE
  )