Train machine learning model with scikit learn for time-series prediction

Question

I need to train a model with scikit-learn to predict possible time for less people in a room.

Here is how my dataset looks like:

Time                          PeopleCount
---------------------------------------------
2019-12-29 12:40:10               50
2019-12-29 12:42:10               30
2019-12-29 12:44:10               10
2019-12-29 12:46:10               10
2019-12-29 12:48:10               80
and so on...

This data will be available for 30 days.

Once the model is trained, I will query the model to get the possible time when there will be fewer people in the room between 10.AM and 8.PM. I expect the machine learning model to respond back with the 30-minute accuracy, ie. "3.00 PM to 3.30PM"

What algorithm can I use for this problem and how can I achieve the goal? Or are there any other Python libraries than SciKit-Learn which can be used for this purpose?

I am new to machine learning, sorry for a naive question.

mac13k mac13k · Accepted Answer · 2020-01-21T15:21:24

SciKitLearn models do not recognize timestamps, so you will have to break down your timestamp column into a number of features, ie. day of week, hour, etc. If you need 30-minute accuracy then you will have to aggregate your data from the PeopleCount column somehow, ie. record average, minimum or maximum number of people within each 30-minute time interval. It may be a good idea to also create lagged features, ie. what was the people count in a previous time slot or even 2, 3 or n time slots ago.

Once you have you have your time features and labels (corresponding people counts) ready you can start training your models in standard way:

split your data into training and validation sets,
train each model that you want to try and compare the results.

Any regressor should be suitable for this task, ie. Ridge, Lasso, DecisionTreeRegressor, SVR etc. Note however that if you need to get the best time slot from the given range you will need to make predictions for every slot from the range and pick the one which fits the criteria, although there may be cases where the smallest predicted value is not smaller then value you compare it with.

If you do not get satisfying results with regressors, ie. every time the mean or median squared errors are too high, you could come up with a classification case, ie. instead of training a regressor to predict the number of people you can train a classifier to predict whether the count is greater than 50 or not.

There are many ways to approach this problem. Once try different models and examine the results you will come up with ways to optimize the parameters, engineer features, pre-process the input etc.

Train machine learning model with scikit learn for time-series prediction

2 Answers