1
votes

I have a data frame (df) with columns named date, Year, Month, Day, Hour and Energy. It is multiyear Time Series which I want to convert into a averaged single year time series with (8760 points i.e 365 * 24 points) where the column Energy_Mean is averaged out value.

df is

date           Year  Month  Day  Hour       Energy
1/1/1999 0:00  1999   Jan    1     1        45.0     
1/1/1999 1:00  1999   Jan    1     2        73.5     
1/1/1999 2:00  1999   Jan    1     3        82.4     
1/1/1999 3:00  1999   Jan    1     4        90.0     
1/1/1999 4:00  1999   Jan    1     5        72.2
.
.
.
12/31/1999 23:00  1999 Dec  12    24       77.0
.
.
.
12/31/2019 23:00  2019 Dec  12    24       84.3

Task is to convert it into an averaged form as shown below:

Month  Day  Hour      Energy_Mean
Jan    1     1        22.45     
Jan    1     2        73.5     
Jan    1     3        57.4     
Jan    1     4        88.0     
Jan    1     5        33.2
.
.
.
Dec    31    24       77.0

Trying to figure out whether pivot_table or groupby is a better method of pandas to use to convert time series into 8760 count data frame. Additionally, I want the output to be sorted by months NOT alphabetically. Like Jan, Feb, March, April NOT April, August..

My Code is:

p50_8760 = df.groupby(['Month', 'Day', 'Hour'])['Energy'].mean()
df_p50_8760 = p50_8760.to_frame()

The output file does not have column names or the data points count of 8760 data points.

1

1 Answers

0
votes

According to the response in this SO question Pandas: group by and Pivot table difference pivot_table and groupby might be both equally well suited as they only differ in the shape of the result. So pick that one that you find easier to work with.

For my example I will use a pivot_table.

In order to sort by a Months' index instead of alphabetically by name, I add an additional column 'Month_ind'. You could do the mapping by hand, of course. Because we already have a datetime column present, I chose to let Pandas do this step.

The numerical column 'Month_ind' can then be used to sort in the end:

df = pd.read_csv('data/multi_year_ts.csv')

df['date'] = pd.to_datetime(df['date']) # convert column to datetime

df['Month_ind'] = df['date'].map(lambda e: e.month)

pivot = pd.pivot_table(df, index=['Month_ind', 'Day', 'Hour'], columns=['Year'], values=['Energy'])

print(pivot.sort_values('Month_ind'))

Result:

                   Energy                  
Year                 1999  2005  2007  2019
Month_ind Day Hour                         
1         1   1      45.0  60.4  55.2   NaN
              2      73.5   NaN   NaN   NaN
              3      82.4   NaN   NaN   NaN
              4      90.0   NaN   NaN   NaN
              5      72.2   NaN   NaN   NaN
12        12  24     77.0   NaN   NaN  84.3

Note that the values are not correct (and mostly NaN) as I only had a very small test sample.

To get the mean value for a specific Hour on a given date over all years, transpose the pivot first:

print(pivot.T.mean())

Final Result:

Month_ind  Day  Hour
1          1    1       53.533333
                2       73.500000
                3       82.400000
                4       90.000000
                5       72.200000
12         12   24      80.650000
dtype: float64