I am extracting data from many CSV files and saving it into one file. This process is repetitive and every so often, I have new versions of the files and so the underlying data needs to be updated. I am having a hard time getting the data that is desired. Here is an example of what I am trying to do:
old file (/old/Test1.csv)
tests col1 col2 col3
test1 11 8 5
test3 9 5 7
test5 12 9 -1
new file (/new/Test1.csv)
tests col2 col3 col4
test1 8 7 15
test3 5 9 10
test7 1 4 9
test9 11 10 12
Note that in the new file, col1 is absent, there is a new column col4, row test5 is absent and there are two new rows test7 and test9. The desired output should:
- Contain all columns and rows from test1.csv and test2.csv
- Include updated data in test2.csv for (row,col) pairs from test1.csv
- If there are no updates to (row,col) pairs from test1.csv, then the data in test1.csv should be used.
- Any empty cells in the updated file should be filled with 0.
For the data shown above, this should be the updated file (Test1_update.csv)
tests col1 col2 col3 col4
test1 11 8 7 15
test3 9 5 9 10
test5 12 9 -1 0
test7 0 1 4 9
test9 0 11 10 12
I am able to get to this point using the code below:
tests col1 col2 col3 col4
test1 Nan 8 7 15
test3 Nan 5 9 10
test5 Nan Nan Nan Nan
test7 Nan 1 4 9
test9 Nan 11 10 12
import pandas as pd
import numpy as np
df1 = pd.read_csv('\\dir\\test1.csv', index_col=0)
df2 = pd.read_csv('\\dir\\test2.csv', index_col=0)
new_index = list(set(list(df1.index.values)+list(df2.index.values)))
new_cols = list(set(list(df1.columns.values)+list(df2.columns.values)))
df3 = pd.DataFrame(index=new_index, columns=new_cols)
df4 = df2.reindex(df3.index)
df4 = df4.join(df3, rsuffix='_P')
df4 = df4.loc[:,~df4.columns.str.endswith('_P')]
print df4