0
votes

Goal: Create a dataframe from all Excel files in a folder that ignores duplicating column headers from previously read Excel files. Look at tables for illustration and clarity.

Problem: Let's say File1.xlsx has column1, and column2 as column headers. And File2.xlsx has column2, and column3 as column headers. The code I have currently "combines" column2 from File1.xlsx, and File2.xlsx. I was wondering if it's possible to create a dataframe that separates each column commensurate to each separate file. I will create tables below to better demonstrate what I am looking for.

File1.xlsx: column1, column2

File2.xlsx: column2, column3

File3.xlsx: column1, column3

Note: All the Excel files are in the folder in alphabetical/numerical order.

### Reading only the Excel files in the folder
FileList_xlsx = [f for f in files if f[-4:] == "xlsx"]
print(FileList_xlsx) # Prints list of Excel files in the folder

### Initializing Data Frame 
df = pd.DataFrame()

### Read Excel files into Python
for f in FileList_xlsx:
    test_df = pd.read_excel(f)
    df = df.append(test_df, ignore_index=True, sort = False)

### Code above only gives 3 columnns in the dataframe (as shown in the example below), 
### when I want 6 columns, even if they're duplicated in another Excel file in the folder
### I'm reading from.

What I'm getting:

column1 column2 column3
File 1 File 1
File 1 File 1
File 2 File 2
File 2 File 2
File 3 File 3
File 3 File 3

What I want:

column1 column2 column2 column3 column1 column3
File 1 File 1
File 1 File 1
File 2 File 2
File 2 File 2
File 3 File 3
File 3 File 3

Tip: To keep things simple, I made the data cells the appropriate file is supposed to be in.

Bonus: If you can also help with adding the option of populating the rest of the empty cells with NAs, that would be useful too!

Let me know if you're having trouble understanding my question, and I'll do my best to clarify.

1
You cannot have two columns in the same df with the same name. Do you want to replace the empty cells with 'NA' or with np.NaN?Jeff Mitchell
Thank you for that tidbit, Jeff! Would it be possible to maybe have like File1.column1, File1.column2, File2.column2, File2.column3, File3.column1, and File3.column3 for duplicate columns within multiple Excel files? The np.NaN, because I want to drop them in future scripting.llew88

1 Answers

1
votes

You could use df.add_prefix() to add a prefix to each column in the test_df as you load it and then append this to the df:

n = 1
for f in FileList:
    test_df = pd.read_excel(f)
    test_df = test_df.add_prefix(f'File{n}.')
    df = df.append(test_df, ignore_index=True, sort = False)
    n+=1

This will give you a unique column for each column you load. Empty cells will be np.NaN.

File1.column1 | File1.column2 | File2.column1 | File2.column2 ...