2
votes

I have huge data split into 4 csv files. They are supposed to have the same columns and each file is the continuation of the previous. I import the 4 CSV files in Pandas and before merging them I want to compare the columns for all 4 to identify any difference.

  1. How do I do that with Pandas/Python?
  2. Do I use merge or append? csv1 covers 2001 to 2004, csv2 covers 2005 to 2007 and so on, it's just split by time.
2
You should put what you tried.What do you mean by different.... Update your question please.Tzomas
Do your csv files contain column names? If yes, do you know the correct column names which should be present?pansen
Yes csv files contain column names, I want all files (csv2 onwards) to have the same columns as the first file (csv1)MCG Code
The idea is to check if all column names in all files are the sameMCG Code

2 Answers

3
votes

When knowing your column names beforehand, you can explicitly pass them to the pd.read_csv via the usecols parameter. In case of a column name mismatch between your csv file and the predefined column names, an ValueError will be raised automatically.

To merge your csv-files you can use pd.concat:

# define your column names
column_names = ["Col A", "Col B", "Col C", "Col D"]

# setup file paths
base_path = os.path.join("E:/","Datasets","Dataset01") # adopted your example here
file_names = ["file1.csv", "file2.csv", "file3.csv", "file4.csv"]
abs_paths = [os.path.join(base_path, file_name)
             for file_name in file_names]

dfs = pd.concat([pd.read_csv(abs_path, usecols=columns_names) 
                 for abs_path in abs_paths])

In case you want to check if all columns are identical across your csv files, you can simply load only the header of the csv files while using nrows=0:

cols = [pd.read_csv(abs_path, nrows=0).columns
        for abs_path in abs_paths]

cols_identical = [all(cols[0] == colx) for colx in cols[1:]]
all_cols_same = all(cols_identical) 
2
votes

Suppose you have df1,df2 related with csv1,csv2

1. all(df1.columns == df2.columns)
2. pd.concat([df1, df2]) 

Please check concat for detail.