So I´m trying to do power consumption forecast with time series data with Apache Spark. Sample of the data is :
03.01.15;22:30;236,25
03.01.15;22:15;240
04.01.15;16:00;243,775
and so on for two year. I have Observations every 15 minutes
What is the best way to predict the power consumption ?
I try LinearRegression, Decision trees etc. I always get huge MSEs (788).
I try pass months, days, hours, minutes as OneHotEncoder. I try to do a forecast only for weeks etc.
Means of the data year, month, day:
[2014.3996710526321,5.726973684210525,15.713815789473673]
Variance of the data year, month, day:
[0.2403293809070049,10.218579294199253,77.46326844706495]
Test Mean Squared Error
788.2397552290726
If I pass the values direct like LabeledPoint(236.25, 2015.0,1.0,3.0,22.0,30.0)) the MSE goes up to 1280.8. If I pass to the model only one observation per day being this the max value it does not help to.
But If I use KNIME and try the example for time series but with my data they do not take the dates and time in consideration, instead they lagged the power consumptions for each observation.
I see that Cloudera has a library for Time Series but I do not understand why do I need it.
Can someone describe the process of doing a good forecast on time series data ? At the end I want to input a date and time and get a prediction.