Using Pandas to average data across excel sheets with matching column data

Question

See data snippet. I have similar data across multiple sheets in Excel with each sheet being a different year. For each month, day, and time (12 and 00), I would like to take the average of column K across all sheets. However, there is not identical # of rows of data for each sheet. I would like to take the average of K across all sheets but only if data in columns A,B,E,G all match across the sheets. So from the data example, I would get the average value of K2 for all sheets where in every sheet month=1, date=1, time=am,and mb=100.

I think it would make more sense to organize your data into a MultiIndex dataframe, where the primary key is the sheet name. Unfortunately columns in pandas are not "ordered" as they are in Excel, so you'll need some sort of identifying key applied in Excel (perhaps the index from =COLUMN()) — rgk

Erfan Erfan · Accepted Answer · 2019-03-22T15:43:58

Without having your data I made two example DataFrames

First of all you need to import the data from excel into pandas with:

df1 = pd.read_excel('name_file.xlsx', sheet_name='year2018')

And do this for each year.

After that you can use my example to get your expected output:

# Example dataframe 1
print(df1)

   Month  Date  Year Time   mb  dir
0      1     1  2018   AM  100  265
1      1     1  2018   AM  150  265
2      1     1  2018   AM  200  270
3      1     1  2018   AM  250  270
4      1     1  2018   PM  100  265
5      1     1  2018   PM  150  265
6      1     1  2018   PM  200  265
7      1     1  2018   PM  250  265

#Example dataframe2
print(df2)

   Month  Date  Year Time   mb  dir
0      1     1  2019   AM  100  275
1      1     1  2019   AM  150  275
2      1     1  2019   AM  200  280
3      1     1  2019   AM  250  280
4      1     1  2019   PM  100  275
5      1     1  2019   PM  150  275
6      1     1  2019   PM  200  275
7      1     1  2019   PM  250  280

We can use pandas.concat to append the dataframes together (in your case these can be more than two).

df_all = pd.concat([df1, df2], ignore_index=True)
print(df_all)

    Month  Date  Year Time   mb  dir
0       1     1  2018   AM  100  265
1       1     1  2018   AM  150  265
2       1     1  2018   AM  200  270
3       1     1  2018   AM  250  270
4       1     1  2018   PM  100  265
5       1     1  2018   PM  150  265
6       1     1  2018   PM  200  265
7       1     1  2018   PM  250  265
8       1     1  2019   AM  100  275
9       1     1  2019   AM  150  275
10      1     1  2019   AM  200  280
11      1     1  2019   AM  250  280
12      1     1  2019   PM  100  275
13      1     1  2019   PM  150  275
14      1     1  2019   PM  200  275
15      1     1  2019   PM  250  280

Now we can use pandas.Groupby.Series.mean to get your expected output:

print(df_all.groupby(['Month', 'Date', 'Time', 'mb']).dir.mean().reset_index())

   Month  Date Time   mb    dir
0      1     1   AM  100  270.0
1      1     1   AM  150  270.0
2      1     1   AM  200  275.0
3      1     1   AM  250  275.0
4      1     1   PM  100  270.0
5      1     1   PM  150  270.0
6      1     1   PM  200  270.0
7      1     1   PM  250  272.5

Using Pandas to average data across excel sheets with matching column data

1 Answers