R - Apply function on multiple data frames

votes

I loaded several data sheets as data frames in R with:

temp = list.files(pattern="*.csv")
for (i in 1:length(temp)) assign(temp[i], read.csv(temp[i]))

Now I would like to apply a function on all data frames. I thought about something like:

kappa1_mean_h_stem <- lapply(df.list, mean_h_stem)

Where df.list contains a list of all data frames.

    mean_h_stem <- function(x) {
  mean(x[1,3])
}

I want the function to return the mean for a specific column. But it tells me, I had the wrong number of dimensions.

r functionapplydimensions

Do you expect kappa1_mean_h_stem to contain a single mean across all data frames, or one mean per data frame? – Len Greski

2 Answers

votes

The reason for your error is I think that you passed x[1,3] which would get the value from the first row of the third column only. I assume you want to calculate the mean of the same column across all the data.frames, so I made a slight modification to your function so you can pass data and the name or position of the column:

mean_h_stem <- function(dat, col){ mean(dat[,col], na.rm=T)}

Column can be selected using an integer:

lapply(df.list, mean_h_stem, 2)

Or a column name, expressed as a string:

lapply(df.list, mean_h_stem, 'col_name')

Passing the second argument like this can feel a little unintuitive, so you can do it in a clearer way:

lapply(df.list, function(x) mean_h_stem(dat = x, col ='col_name'))

This will only work for single columns at a time per your question, but you could easily modify this to do multiple.

As an aside, to read in the csv files, you could also use an lapply with read.csv:

temp <- list.files(pattern='*.csv')
df.list <- lapply(temp, read.csv)

votes

An unclear part of the question is whether the mean of a given column should be calculated after extracting the column across all data frames, or once per data frame.

Techniques described in the other answer will return a list of means calculated once per file. If the intent of the question is to extract a column across all data frames, then calculate a mean, we'd need to extract first, unlist() the result, and use it as input to mean().

We illustrate this with Pokémon stats data that I maintain on Github.

download.file("https://raw.githubusercontent.com/lgreski/pokemonData/master/PokemonData.zip",
              "pokemonData.zip",
              method="curl",mode="wb")
unzip("pokemonData.zip",exdir="./pokemonData")

thePokemonFiles <- list.files("./pokemonData",pattern="*.csv",
                              full.names=TRUE)
dataList <- lapply(thePokemonFiles,read.csv)

At this point dataList is a list of 8 data frames, each containing the statistics for one generation of Pokémon, which we can inspect with the RStudio Environment Viewer.

If we want to find the mean() of the Attack stat by generation, we can use lapply() as follows.

lapply(dataList,function(x,y) mean(x[[y]]),"Attack")

Here we use an anonymous function with two arguments, x, and y, where x represents the data frames in dataList, and y represents the column Attack.

This returns a list of 8 means.

> lapply(dataList,function(x,y) mean(x[[y]]),"Attack")
[[1]]
[1] 72.91391

[[2]]
[1] 68.26

[[3]]
[1] 73.93617

[[4]]
[1] 79.99138

[[5]]
[1] 82.44242

[[6]]
[1] 94.92366

[[7]]
[1] 86.5082

[[8]]
[1] 83.66387

What if the desired result is a single mean for Attack across all 8 data frames?

In that case, we can use lapply() to extract the desired column, unlist() the result to create a numeric vector, and pass that as an argument to mean().

Given how R can nest functions, we can accomplish this in a single line of code.

mean(unlist(lapply(dataList,function(x,y) x[,y],"Attack")))

...and the output:

> mean(unlist(lapply(dataList,function(x,y) x[,y],"Attack")))
[1] 80.46699

Augmenting the results

Given some of the comments on the question and answers, I thought I'd illustrate how we might generate a result that's a bit easier to consume than a list of numbers.

If we assign a name to each data frame when we create the initial list, we can then use the names to drive subsequent processing, including name information in an output data frame.

# assign names to dataList
df_names <- paste0("gen_",sprintf("%02d",1:8))
names(dataList) <- df_names

Once the names have been assigned, we can use the df_names vector to drive lapply() with an anonymous function.

resultList <- lapply(df_names,function(x,y) {
     df <- dataList[[x]] # use [[ to return a data frame vs. list()
     theMean <- mean(df[,y],na.rm=TRUE) # use [,] extract for variable substitution
     result_df <- data.frame(x,theMean)
     names(result_df) <- c("dataFrame",paste0("mean_",y))
     result_df 
 },"Attack")

At this point, resultList is a list of 8 data frames, each containing one observation. we use a combination of do.call() and rbind() to combine the data frames into a single output data frame.

# combine into single dataframe
do.call(rbind,resultList)

The resulting data frame allows us to see which mean belongs to each input data frame.

> do.call(rbind,resultList)
  dataFrame mean_Attack
1    gen_01    72.91391
2    gen_02    68.26000
3    gen_03    73.93617
4    gen_04    79.99138
5    gen_05    82.44242
6    gen_06    94.92366
7    gen_07    86.50820
8    gen_08    83.66387

NOTE: I generated the list of data frame names based on my knowledge of the contents. One could also use list.files() with full.names = FALSE to generate the names.