I am trying out different variations of K in K-means clustering on a set with time series data. For each experiment I want to sum up the time series for each cluster label and perform predictions on them.
So for example: If I cluster the time series into 3 clusters I want to sum all the time series (column-wise) belonging to cluster 1 and all the times series belonging to cluster 2, and the same for cluster 3. After that I will make predictions on each aggregated time-series cluster, but I do not need help on the prediction part.
I was thinking to add the cluster labels to the original dataframe and then use .loc and a loop to extract time series corresponding to the same clusters. But I am wondering if there is a more efficient way?
import pandas as pd
from datetime import datetime
import numpy as np
from sklearn.cluster import KMeans
#create dataframe with time series
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(date_rng, columns=['date'])
for i in range(20):1
df['ts' + str(i)] = np.random.randint(0,100,size=(len(date_rng)))
df_pivot = df.pivot_table(columns = 'date', values = df.columns)
#cluster
K = range(1,10,2)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(df_pivot)
print(km.labels_)
#sum/aggregate all ts in each cluster column-wise
#forecast next step for each cluster(dont need help with this part)
`