I have a directory of multiple (500+) .csv files, each with multiple columns. For each .csv file, I want to calculate the number of observations, as well as both the sum and mean for one specific column. If I was doing this for one .csv file, I would import the .csv as a dataframe and use something really like:
len(df)
df["column_name"].sum()
df["column_name"].mean()
However, I want to perform these three calculations on each individual .csv file and save them all out to a new dataframe, along with the name of the original .csv file. For example, for 3 files called 'file1.csv', 'file2.csv' and 'file3.csv', I want to create a new dataframe that looks like:
Column No. Observations Sum Mean
file1 10 120 12
file2 20 100 5
file3 5 50 10
I've created a new dataframe:
new = pd.DataFrame(columns=['Column', 'No. Observations', 'Sum', 'Mean'])
I've used os.walk to list all the files in the directory:
import os
path = "/Users/user/directory_name"
for root, directories, files in os.walk(path, topdown=False):
for name in files:
print(os.path.join(root, name))
So, at the moment I can generate the list of files in the directory, I can perform the necessary calculations on each file and I can create a new dataframe that contains the columns I want.
What I can't do is iterate over each individual file in the directory, perform the calculations, and save the results to the new dataframe. How do I put it all together?