1
votes

I have a directory with excel files that I'm looping through and reading a sheet from each into a Pandas dataframe. Each file contains a month of data (example name = "Savings January 2019.xlsx"). There is not a column in the excel sheet for a date, so I'd like to add a column for "Date" to the dataframe and read each file by month and year in the workbook name (ex. "January 2019") and add 'MM-DD-YYYY' (ex. '01-01-2019' as the date value for each row read in.

Below is my working loop to read 12 excel workbooks in without dates, producing only all 12 month totals. I need the dates so I can visualize the data by month.

df_total = pd.DataFrame()

for file in files:        # loop through Excel files (each file adds date value based on file name)

    if file.endswith('.xlsx'):
        excel_file = pd.ExcelFile(file)
        sheets = excel_file.sheet_names

        for sheet in sheets:               # loop through sheets inside an Excel file
            df = excel_file.parse(sheet_name = "Group Savings")
            df_total = df_total.append(df)

Current df:

     State        Group      Value
0   Illinois    000000130   470.93
1   Illinois    000000130   948.33
2   Illinois    000000784   3498.42
3   Illinois    000000784   16808.16
4   Illinois    000002077   7.00

Need df:

     State        Group        Date           Value
0   Illinois    000000130   01-01-2019        470.93
1   Illinois    000000130   01-01-2019        948.33
2   Illinois    000000784   01-01-2019       3498.42
3   Illinois    000000784   02-01-2019       6808.16
4   Illinois    000002077   02-01-2019          7.00

I've done a little research and think it's something like creating the column then adding the date value but can't figure out how to parse the filename to do so and am clearly a beginner here.

for sheet in sheets:               # loop through sheets inside an Excel file
   df = excel_file.parse(sheet_name = "Group Savings")
   df_total = df_total.append(df)
   df_total['Date'] = #if excel_file contains 'January 2019', then df_total['Date'] == '01-01-2019
2

2 Answers

1
votes

You had the concept right, and your code is almost there. All you need to add is the date parsing now.

You can use Python's strptime() to parse the date in the filename.

https://docs.python.org/3/library/datetime.html

For example, if you have a filename like "Savings January 2019.xlsx", then you can parse it like below. Note that this isn't the only way to parse the string, there are several other variations that could work with this method.

from datetime import datetime
string = 'Savings January 2019.xlsx'
month_str = string.split(' ')[1]
year_str = string.split(' ')[2].split('.')[0]
date_object = datetime.strptime(month_str + year_str, "%B%Y")

Here is a good overview of python date string formatting: https://strftime.org/

Once you have your date object, you just need to add it to your dataframe now.

df['Date'] = date_object
0
votes

Thanks for the help Robert! Here was the final code. Please note the filename was actually longer and had some company info I left-out, hence the changes in the .split

from datetime import datetime

#create empty dataframe
df_total = pd.DataFrame()

# loop through Excel files
for file in files:                         
    if file.endswith('.xlsx'):
        excel_file = pd.ExcelFile(file)

        # parse excel filename to take month and year and save as date object for Date column
        month_str = file.split(' ')[4]      
        year_str = file.split(' ')[5].split('.')[0]
        date_object = datetime.strptime(month_str + year_str, "%B%Y")   
        
        # loop excel sheets and add "Date" column, populating with date from parsed filename
        sheets = excel_file.sheet_names
        for sheet in sheets:          # loop through sheets inside an Excel file         
            df = excel_file.parse(sheet_name = "Group Savings")
            df_total = df_total.append(df)
            df_total['Date'] = date_object