Time Series forecast on Spark

Question

So I´m trying to do power consumption forecast with time series data with Apache Spark. Sample of the data is :

03.01.15;22:30;236,25
03.01.15;22:15;240
04.01.15;16:00;243,775

and so on for two year. I have Observations every 15 minutes

What is the best way to predict the power consumption ?

I try LinearRegression, Decision trees etc. I always get huge MSEs (788). I try pass months, days, hours, minutes as OneHotEncoder. I try to do a forecast only for weeks etc.

Means of the data year, month, day:

[2014.3996710526321,5.726973684210525,15.713815789473673]

Variance of the data year, month, day:

[0.2403293809070049,10.218579294199253,77.46326844706495]

Test Mean Squared Error

788.2397552290726

If I pass the values direct like LabeledPoint(236.25, 2015.0,1.0,3.0,22.0,30.0)) the MSE goes up to 1280.8. If I pass to the model only one observation per day being this the max value it does not help to.

But If I use KNIME and try the example for time series but with my data they do not take the dates and time in consideration, instead they lagged the power consumptions for each observation.

I see that Cloudera has a library for Time Series but I do not understand why do I need it.

Can someone describe the process of doing a good forecast on time series data ? At the end I want to input a date and time and get a prediction.

@JorgeMachado jorge i also want to use the ARIMA model to forecast the things. Could you help me with the same if your problem is solved — Anshul Kalra
Hi, So until now I have this update you need to pass the last seven days of usage for your actual usage. To predict the next days just pass the last day as a feature. I´m still trying but it does not work as I expected.. — Jorge Machado

lohithbb lohithbb · Accepted Answer · 2017-03-31T14:01:29

I have multiple questions about your issue, but let me try to work with what you've given me.

First, let's generalise the problem.

You have data in the form <timestamp>, <value>. Given the data was collected every 15 minutes for 2 years, you have a sample size of (4 x 24 x 365 x 2) 70080 observations (rows)

And let's say you want to develop a regression model to predict its behaviour.

First things first, you need to partition your dataset into training and test sets. You develop a model using the training set and you test the model on the test set. I would suggest a 9:1 split.

Q1. Is this the kind of approach you have taken?

MSE is calculated using the predicted values against the actual values in the test set.

Q2. You mention an MSE - I have no idea how this was calculated (only 3 data points provided) or the range of values you are working with. Can you please affirm how your error is calculated and what the minimization criteria is?

If linear regression fails (cannot predict the movements succesfully), it may be the case that your model is too simple - try using an MLP or combine the regression model with a Bayesian model (as power consumption is a continuous function).

Time Series forecast on Spark

1 Answers