I wasn't able to find information I am looking for so I will post my question here. I am just venturing into machine learning. I did my first multiple regression for a time series using scikit learn library. My code is as shown below
X = df[feature_cols]
y = df[['scheduled_amount']]
index= y.reset_index().drop('scheduled_amount', axis=1)
linreg = LinearRegression()
tscv = TimeSeriesSplit(max_train_size=None, n_splits=11)
li=[]
for train_index, test_index in tscv.split(X):
train = index.iloc[train_index]
train_start, train_end = train.iloc[0,0], train.iloc[-1,0]
test = index.iloc[test_index]
test_start, test_end = test.iloc[0,0], test.iloc[-1,0]
X_train, X_test = X[train_start:train_end], X[test_start:test_end]
y_train, y_test = y[train_start:train_end], y[test_start:test_end]
linreg.fit(X_train, y_train)
y_predict = linreg.predict(X_test)
print('RSS:' + str(linreg.score(X_test, y_test)))
y_test['predictec_amount'] = y_predict
y_test.plot()
Not that my data is a time series data and I want to keep the datetime index in my Dataframe when I'm fitting my model. I am using the TimeSeriesSplit for cross-validation. I still don't really understand the cross validation thing. First is there a need for a cross-validation in a time series dataset. Second should I use the last linear_coeff_ or should I get the average of all of them to use for my future prediction.