0
votes

I am trying out different variations of K in K-means clustering on a set with time series data. For each experiment I want to sum up the time series for each cluster label and perform predictions on them.

So for example: If I cluster the time series into 3 clusters I want to sum all the time series (column-wise) belonging to cluster 1 and all the times series belonging to cluster 2, and the same for cluster 3. After that I will make predictions on each aggregated time-series cluster, but I do not need help on the prediction part.

I was thinking to add the cluster labels to the original dataframe and then use .loc and a loop to extract time series corresponding to the same clusters. But I am wondering if there is a more efficient way?

import pandas as pd
from datetime import datetime
import numpy as np
from sklearn.cluster import KMeans

#create dataframe with time series
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(date_rng, columns=['date'])
for i in range(20):1
    df['ts' + str(i)] = np.random.randint(0,100,size=(len(date_rng)))
df_pivot = df.pivot_table(columns = 'date', values = df.columns)

#cluster
K = range(1,10,2)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(df_pivot)

    print(km.labels_)

    #sum/aggregate all ts in each cluster column-wise


    #forecast next step for each cluster(dont need help with this part)

`

1

1 Answers

0
votes

You can access data points for every cluster and then sum their values. Something like this:

labels = km.labels_
centroids = km.cluster_centers_
cluster_sums_dict = {} # cluster number: sum of elements
for i in range(k):
   # select 
   temp_cluster = df_pivot[np.where(labels==i)]
   cluster_sums_dict[i] = temp_cluster['ts'].sum() 

Also on a side note, instead of aggregating a cluster_values, can you use centroids of each cluster for prediction?