Goal: Create a dataframe from all Excel files in a folder that ignores duplicating column headers from previously read Excel files. Look at tables for illustration and clarity.
Problem: Let's say File1.xlsx has column1, and column2 as column headers. And File2.xlsx has column2, and column3 as column headers. The code I have currently "combines" column2 from File1.xlsx, and File2.xlsx. I was wondering if it's possible to create a dataframe that separates each column commensurate to each separate file. I will create tables below to better demonstrate what I am looking for.
File1.xlsx: column1, column2
File2.xlsx: column2, column3
File3.xlsx: column1, column3
Note: All the Excel files are in the folder in alphabetical/numerical order.
### Reading only the Excel files in the folder
FileList_xlsx = [f for f in files if f[-4:] == "xlsx"]
print(FileList_xlsx) # Prints list of Excel files in the folder
### Initializing Data Frame
df = pd.DataFrame()
### Read Excel files into Python
for f in FileList_xlsx:
test_df = pd.read_excel(f)
df = df.append(test_df, ignore_index=True, sort = False)
### Code above only gives 3 columnns in the dataframe (as shown in the example below),
### when I want 6 columns, even if they're duplicated in another Excel file in the folder
### I'm reading from.
What I'm getting:
column1 | column2 | column3 |
---|---|---|
File 1 | File 1 | |
File 1 | File 1 | |
File 2 | File 2 | |
File 2 | File 2 | |
File 3 | File 3 | |
File 3 | File 3 |
What I want:
column1 | column2 | column2 | column3 | column1 | column3 |
---|---|---|---|---|---|
File 1 | File 1 | ||||
File 1 | File 1 | ||||
File 2 | File 2 | ||||
File 2 | File 2 | ||||
File 3 | File 3 | ||||
File 3 | File 3 |
Tip: To keep things simple, I made the data cells the appropriate file is supposed to be in.
Bonus: If you can also help with adding the option of populating the rest of the empty cells with NAs, that would be useful too!
Let me know if you're having trouble understanding my question, and I'll do my best to clarify.