2
votes

I have a directory of multiple (500+) .csv files, each with multiple columns. For each .csv file, I want to calculate the number of observations, as well as both the sum and mean for one specific column. If I was doing this for one .csv file, I would import the .csv as a dataframe and use something really like:

len(df)
df["column_name"].sum()
df["column_name"].mean()

However, I want to perform these three calculations on each individual .csv file and save them all out to a new dataframe, along with the name of the original .csv file. For example, for 3 files called 'file1.csv', 'file2.csv' and 'file3.csv', I want to create a new dataframe that looks like:

Column    No. Observations    Sum    Mean
file1     10                  120    12
file2     20                  100    5
file3     5                   50     10

I've created a new dataframe:

new = pd.DataFrame(columns=['Column', 'No. Observations', 'Sum', 'Mean'])

I've used os.walk to list all the files in the directory:

import os
path = "/Users/user/directory_name"

for root, directories, files in os.walk(path, topdown=False):
    for name in files:
        print(os.path.join(root, name))

So, at the moment I can generate the list of files in the directory, I can perform the necessary calculations on each file and I can create a new dataframe that contains the columns I want.

What I can't do is iterate over each individual file in the directory, perform the calculations, and save the results to the new dataframe. How do I put it all together?

1

1 Answers

0
votes

For each file create tuple, append to list out and last pass to DataFrame constructor:

import os
path = "/Users/user/directory_name"

out = []
for root, directories, files in os.walk(path, topdown=False):
    for name in files:
        file = os.path.join(root, name)
        df = pd.read_csv(file)
        size = len(df)
        s = df["column_name"].sum()
        m = df["column_name"].mean()
        out.append((name, size, s, m)) 
        
new = pd.DataFrame(out, columns=['Column', 'No. Observations', 'Sum', 'Mean'])