2
votes

I want to use time series with Pandas. I read multiple time series one by one, from a csv file which has the date in the column named "Date" as (YYYY-MM-DD):

Date,Business,Education,Holiday
2005-01-01,6665,8511,86397
2005-02-01,8910,12043,92453
2005-03-01,8834,12720,78846
2005-04-01,8127,11667,52644
2005-05-01,7762,11092,33789
2005-06-01,7652,10898,34245
2005-07-01,7403,12787,42020
2005-08-01,7968,13235,36190
2005-09-01,8345,12141,36038
2005-10-01,8553,12067,41089
2005-11-01,8880,11603,59415
2005-12-01,8331,9175,70736


df = pd.read_csv(csv_file, index_col = 'Date',header=0)
Series_list = df.keys()

The time series can have different frequencies: day, week, month, quarter, year and I want to index the time series according to a frequency I decide before I generate the Arima model. Could someone please explain how can I define the frequency of the series?

stepwise_fit = auto_arima(df[Series_name]....
1
It sounds like you want resample pandas.pydata.org/pandas-docs/stable/generated/…. Hard to say with out an example of you data thoughjohnchase
No, I just want to make sure that the time series 'understands' the frequency in the 'Date' column in order for auto.arima to identify the correct seasonal difference factor.Florian C
I am having a hard time understanding exactly what you need to do. Are you getting an error? Or is the result you are getting unexpected? I would take a look at the MCVE as it will likely result in getting better answers to your problemjohnchase
What I want is to define the data frequency, right from the beginning when I import from csv and define the timeseries. Why this: In order for the auto.arima to compute the best arima model, it needs to know the frequency of the input data. Otherwise, often it will not take into account an Arima model with seasonal difference. Comparing to R: if when I define the time series, I do not specify the frequency than auto.arima result will not be the best model. Hope is clear now.Florian C
How do you formally define the frequency value for an individual row? How do you plan to compute it based on Date values in the previous and the following rows?Dmitry Duplyakin

1 Answers

3
votes

pandas has a built in function pandas.infer_freq()

import pandas as pd
df = pd.DataFrame({'Date': ['2005-01-01', '2005-02-01', '2005-03-01', '2005-04-01'],
                  'Date1': ['2005-01-01', '2005-01-02', '2005-01-03', '2005-01-04'],
                  'Date2': ['2006-01-01', '2007-01-01', '2008-01-01', '2009-01-01'],
                  'Date3': ['2006-01-01', '2006-02-06', '2006-03-11', '2006-04-01']})
df['Date'] = pd.to_datetime(df['Date'])
df['Date1'] = pd.to_datetime(df['Date1'])
df['Date2'] = pd.to_datetime(df['Date2'])
df['Date3'] = pd.to_datetime(df['Date3'])

pd.infer_freq(df.Date)
#'MS'
pd.infer_freq(df.Date1)
#'D'
pd.infer_freq(df.Date2)
#'AS-JAN'

Alternatively you could also make use of the datetime functionality of the columns.

df.Date.dt.freq
#'MS'

Of course if your data doesn't actually have a real frequency, then you won't get anything.

pd.infer_freq(df.Date3)
#

The frequency descriptions are docmented under offset-aliases.