0
votes

I'm having difficulty figuring out how to subset some specific data from dataframes stored in a list. I've read numerous articles on this site as well as UCLA and Adv-R and I'm just not making any progress.

Advanced-R for Subsetting UCLA Advanced R for Subsetting

My function reads in arguments that help it identify what data I'm interested in pulling out across a range of files. So, dat1, dat2 and dat3 in files 1:15 stored in a directory of files (1:999).

Using an lapply and read.CSV I have read all of my files (1:15) into a list of dataframes.

 x <- lapply(directory[id], function(i) {
        read.csv(i, header = TRUE)
         } )

An example looks like this via str(x) [of just the first element]:

List of 15
 $ :'data.frame':   1461 obs. of  4 variables:
  ..$ DateObv   : Factor w/ 1461 levels "2003-01-01","2003-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
  ..$ dat1: num [1:1461] NA NA NA NA NA NA NA NA NA NA ...
  ..$ dat2: num [1:1461] NA NA NA NA NA NA NA NA NA NA ...
  ..$ ID     : int [1:1461] 1 1 1 1 1 1 1 1 1 1 ...

So in the argument to my function I want to tell it give me dat1 from files 1:15 and then I'll do a mean of the results.

I thought maybe I could use another lapply to subset dat1 specifically into a vector but it keeps returning a NULL value, or "list()" or just errors that set object cannot be subset, or subset missing argument. I've tried subset, bracket notation.

How do you recommend that I take a subset of the list of dataframes so that I get back all dat1's or dat2's into a single vector that I can run a mean against?

Thank you for your time and consideration.

2
I guess you can make use of something like lapply(x,[[,'dat1'), which will return a list of vectors corresponding to the 'dat1' columns from each data frame - Marat Talipov
What exactly is the code you were trying that gave you the error? i would think unlist(lapply(x, "[[", "dat1")) might work. An actual reproducible example would be more useful here than just a description of the structure. - MrFlick
Hello @mrflick here's a sample of observation 1. Date dat1 dat2 ID 10/11/2003 NA NA 1 10/12/2003 5.99 0.428 1 10/13/2003 NA NA 1 10/14/2003 NA NA 1 10/15/2003 NA NA 1 10/16/2003 NA NA 1 10/17/2003 NA NA 1 10/18/2003 4.68 1.04 1 10/19/2003 NA NA 1 10/20/2003 NA NA 1 10/21/2003 NA NA 1 10/22/2003 NA NA 1 10/23/2003 NA NA 1 10/24/2003 3.47 0.363 1 10/25/2003 NA NA 1 10/26/2003 NA NA 1 10/27/2003 NA NA 1 10/28/2003 NA NA 1 10/29/2003 NA NA 1 10/30/2003 2.42 0.507 1 - Zach
@MrFlick I tried your method and it just returns "NULL", which leads me to believe that my dat1 argument isn't actually being used. I know that if I do a simple print(dat1) that it will give me my argument within the function. y <- unlist(lapply(x, "[[", "dat1")) - Zach
What you posted above in the comments is not a reproducible example. Try dput()-ing a sample object or build a sample list in your original question. Read the link i provided for other examples. There must be something different going on that what you described. - MrFlick

2 Answers

1
votes

I love plyr for this sort of thing. I would do something like this if you want the mean for each data.frame:

 library(plyr)
 ldply(x, summarize, Mean = mean(dat1))

or, if you want a long vector of all the dat1 columns and you want to take the mean of all of them, I'd still use plyr but do this:

 x <- rbind.fill(x)
 mean(x$dat1)
0
votes

create a similar data set:

> x = list(data.frame(dat1 = 1:3,dat2=10), data.frame(dat1 = 2:4,dat2=10))
> str(x)
List of 2
 $ :'data.frame':   3 obs. of  2 variables:
  ..$ dat1: int [1:3] 1 2 3
  ..$ dat2: num [1:3] 10 10 10
 $ :'data.frame':   3 obs. of  2 variables:
  ..$ dat1: int [1:3] 2 3 4
  ..$ dat2: num [1:3] 10 10 10

use lapply to select variable dat1:

> lapply(x, function(X) X$dat1)
[[1]]
[1] 1 2 3

[[2]]
[1] 2 3 4

bind the resulting list to a vector with c, call mean on the resulting vector, and add na.rm=TRUE to remove the NA values:

> mean(do.call(c, lapply(x, function(X) X$dat1)),na.rm=TRUE)
[1] 2.5