0
votes

Requirement: I have a folder with multiple csv files. I need to perform following:

  1. Scan the input folder for all csv files (file1.csv, file2.csv ..... filen.csv etc) & perform below steps
  2. Open the first csv file (file1.csv) & store the file header in a list & then copy the entire content (including header) of file1.csv & put in combined.csv => Then move file1.csv to a /done folder under the parent folder
  3. Move to next file file2.csv => Compare the header of file2.csv with the header of the first file to ensure they match exactly. If the headers match then copy contents (excluding header) of file2.csv & put in combined.csv => Then move file2.csv to a /done folder under the parent folder. If the header of file2.csv do not match then exclude combining this file, leave it in the same parent folder & move to next file for combining

I have placed sample files on the link gdrive folder with sample CSV files I am open to any solution either with CSV or pandas as long as it does what I want

As a starting point, I was initially working on comparing the header as below which works. However, I am unsure on how to move further

Code:

    import csv        
    def compare_two_csv_headers(csv_file1, csv_file2):
        with open(csv_file1, newline='') as f:
            reader = csv.reader(f)
            frow1 = next(reader)  # gets the first line
            print(frow1)
        with open(csv_file2, newline='') as f:
            reader = csv.reader(f)
            frow2 = next(reader)  # gets the first line
            print(frow2)
        if frow1==frow2:
            print('Same header')
        else:
            print('Different header')

    csv_file1 = 'D:/2009/cm01JAN2009bhav.csv'
    csv_file2 = 'D:/2009/cm01DEC2009bhav.csv'
    compare_two_csv_headers(csv_file1, csv_file2)

Here are the top 10 lines from the first csv file

SYMBOL,SERIES,OPEN,HIGH,LOW,CLOSE,LAST,PREVCLOSE,TOTTRDQTY,TOTTRDVAL,TIMESTAMP,
20MICRONS,EQ,46.5,47,45.7,46.05,46,46.55,7092,328975.25,31-DEC-2009,
3IINFOTECH,EQ,85.8,86.7,84.5,85.15,85.35,85.05,2423812,207760480.3,31-DEC-2009,
3MINDIA,EQ,1855.05,1879.9,1855.05,1865.75,1874.95,1850.45,85,158679.1,31-DEC-2009,
AARTIDRUGS,EQ,107.4,108.75,103.65,104.45,104.9,106.05,84012,8929759.4,31-DEC-2009,
AARTIIND,EQ,51,51.9,48.9,49.2,49.1,50.45,149365,7517110.3,31-DEC-2009,
AARVEEDEN,EQ,64,64.5,63.05,63.85,63.1,62.7,2172,138651.5,31-DEC-2009,
ABAN,EQ,1265,1297,1265,1283.65,1283.2,1260.05,1381290,1773221519.75,31-DEC-2009,
ABB,EQ,756.2,770.85,756.2,767.1,769.55,756.3,292376,223660807.4,31-DEC-2009,
ABCIL,EQ,85.4,89,84.9,86.85,86.95,84.7,59183,5170993.2,31-DEC-2009,
1

1 Answers

0
votes

Consider using pandas methods to iteratively check columns and run import instead of scanning first lines with csv. Also, use os to manage the file names extract and locations with shutil for moving done files. Below builds a list of dataframes for final concatenation outside loop.

import os, shutil
import pandas as pd

def import_csvs(csv_file):

    path = r'/path/to/csv/files'
    csv_files = sorted([f for f in os.listdir(path) if f[-3:] == 'csv'])

    # INITIALIZE DATAFRAME LIST
    df_list = []
    # READ FIRST DF (ASSUMED FIRST IN ALPHABETICAL ORDER)
    first_df = pd.read_csv(os.path.join(path, csv_files[0]))
    # APPEND FIRST DF   
    df_list.append(first_df)

    # MOVE FIRST CSV
    shutil.move(os.path.join(path, csv_files[0]), os.path.join(path,'done',csv_files[0]))

    # LOOP ALL OTHER CSVs SKIPPING FIRST
    for f in csv_files[1:]:                 
       # IMPORT CSV
       tmp = pd.read_csv(os.path.join(path, f))

       # CHECK DF COLUMNS EXACTLY MATCH
       if list(tmp.columns) == list(first_df.columns):  
          # APPEND DF TO LIST
          df_list.append(tmp)

          # MOVE COMPLETED FILE
          shutil.move(os.path.join(path, f), os.path.join(path, 'done', f))

    final_df = pd.concat(df_list)

    return final_df