An unclear part of the question is whether the mean of a given column should be calculated after extracting the column across all data frames, or once per data frame.
Techniques described in the other answer will return a list of means calculated once per file. If the intent of the question is to extract a column across all data frames, then calculate a mean, we'd need to extract first, unlist()
the result, and use it as input to mean()
.
We illustrate this with Pokémon stats data that I maintain on Github.
download.file("https://raw.githubusercontent.com/lgreski/pokemonData/master/PokemonData.zip",
"pokemonData.zip",
method="curl",mode="wb")
unzip("pokemonData.zip",exdir="./pokemonData")
thePokemonFiles <- list.files("./pokemonData",pattern="*.csv",
full.names=TRUE)
dataList <- lapply(thePokemonFiles,read.csv)
At this point dataList
is a list of 8 data frames, each containing the statistics for one generation of Pokémon, which we can inspect with the RStudio Environment Viewer.
If we want to find the mean()
of the Attack
stat by generation, we can use lapply()
as follows.
lapply(dataList,function(x,y) mean(x[[y]]),"Attack")
Here we use an anonymous function with two arguments, x
, and y
, where x
represents the data frames in dataList
, and y
represents the column Attack
.
This returns a list of 8 means.
> lapply(dataList,function(x,y) mean(x[[y]]),"Attack")
[[1]]
[1] 72.91391
[[2]]
[1] 68.26
[[3]]
[1] 73.93617
[[4]]
[1] 79.99138
[[5]]
[1] 82.44242
[[6]]
[1] 94.92366
[[7]]
[1] 86.5082
[[8]]
[1] 83.66387
What if the desired result is a single mean for Attack
across all 8 data frames?
In that case, we can use lapply()
to extract the desired column, unlist()
the result to create a numeric vector, and pass that as an argument to mean()
.
Given how R can nest functions, we can accomplish this in a single line of code.
mean(unlist(lapply(dataList,function(x,y) x[,y],"Attack")))
...and the output:
> mean(unlist(lapply(dataList,function(x,y) x[,y],"Attack")))
[1] 80.46699
Augmenting the results
Given some of the comments on the question and answers, I thought I'd illustrate how we might generate a result that's a bit easier to consume than a list of numbers.
If we assign a name to each data frame when we create the initial list, we can then use the names to drive subsequent processing, including name information in an output data frame.
# assign names to dataList
df_names <- paste0("gen_",sprintf("%02d",1:8))
names(dataList) <- df_names
Once the names have been assigned, we can use the df_names
vector to drive lapply()
with an anonymous function.
resultList <- lapply(df_names,function(x,y) {
df <- dataList[[x]] # use [[ to return a data frame vs. list()
theMean <- mean(df[,y],na.rm=TRUE) # use [,] extract for variable substitution
result_df <- data.frame(x,theMean)
names(result_df) <- c("dataFrame",paste0("mean_",y))
result_df
},"Attack")
At this point, resultList
is a list of 8 data frames, each containing one observation. we use a combination of do.call()
and rbind()
to combine the data frames into a single output data frame.
# combine into single dataframe
do.call(rbind,resultList)
The resulting data frame allows us to see which mean belongs to each input data frame.
> do.call(rbind,resultList)
dataFrame mean_Attack
1 gen_01 72.91391
2 gen_02 68.26000
3 gen_03 73.93617
4 gen_04 79.99138
5 gen_05 82.44242
6 gen_06 94.92366
7 gen_07 86.50820
8 gen_08 83.66387
NOTE: I generated the list of data frame names based on my knowledge of the contents. One could also use list.files()
with full.names = FALSE
to generate the names.
kappa1_mean_h_stem
to contain a single mean across all data frames, or one mean per data frame? – Len Greski