0
votes

I'm faced with this problem here using specdata, something I've obtained from https://d396qusza40orc.cloudfront.net/rprog%2Fdata%2Fspecdata.zip. The dataset here is not the issue but I decided to display the link here in case anyone wishes to look into it.

Firstly, I'm trying to do a manual calculation of the mean of either nitrate or sulfate. For the manual calculation portion, Im using nitrate instead. There are 332 files inside the "specdata" folder, and for the below case, I'm simply using ID 23

#manual coding first
dat <- data.frame()
files <- list.files("specdata", full.names = T)
id <- 23
for (i in 23){
  dat <- rbind(dat, read.csv(files[i]))
}

datsubset <- subset(dat, dat$ID == id)
#start here
pollutant <- subset(dat, dat$nitrate > 0, select = c("nitrate"))#problem is here
mean(datsubset[, "pollutant"], na.rm = T)

This is how my head(dat) looks like.

head(dat)
        Date sulfate nitrate ID
1 2002-01-01      NA      NA 23
2 2002-01-02      NA      NA 23
3 2002-01-03      NA      NA 23
4 2002-01-04      NA      NA 23
5 2002-01-05      NA      NA 23
6 2002-01-06      NA      NA 23

From my understanding, I have successfully subsetted the data.frame to display only rows which has ID = 23.

The problem I am facing is the conversion of data.frame into a numeric form, for the pollutant variable.

mean(datsubset[, "pollutant"], na.rm = T) Error in [.data.frame(datsubset, , "pollutant") : undefined columns selected

If I replaced the last 2 lines of code with ....

#manual coding first
dat <- data.frame()
files <- list.files("specdata", full.names = T)
id <- 23
for (i in 23){
  dat <- rbind(dat, read.csv(files[i]))
}

datsubset <- subset(dat, dat$ID == id)
#start here
##
y <- datsubset[, "nitrate"]# numeric
mean(y, na.rm = T) #works!

so my question here is, how do I write my code in such a way for the first code, that makes it work?

The reason for me asking this is if I can't get the manual coding way to work, its impossible for me to get the function to work. If anyone is interested, this is the function I created.

#mean for pollutant with id

pollutantmean <- function(directory, pollutant, id = 1:332){
  dat <- data.frame()
  files <- list.files(directory, full.names = T)

  for (i in id){
    dat <- rbind(dat, read.csv(files[i]))
  }

  datsubset <- subset(dat, dat$ID == id)
  mean(datsubset[, "pollutant"], na.rm = T) # error here!
}

I received a similar error as above after i tried to use the function, pollutantmean("specdata", "nitrate", 23)

Error in [.data.frame(datsubset, , "pollutant") : undefined columns selected

Would appreciate if someone can point me in the right direction/readings for this!

Update 2: Instead of trying too hard to get the subsetting to work, I decided to go for the direction I understand about [] and decide to use an if else statement.

pollutantmean <- function(directory, pollutant, id = 1:332){
  dat <- data.frame()
  files <- list.files(directory, full.names = T)

  for (i in id){
    dat <- rbind(dat, read.csv(files[i]))
  }
  datsubset <- subset(dat, dat$ID == id)

  if (pollutant == "nitrate"){
    mean(datsubset[, "nitrate"], na.rm = T) 
  }else {
    mean(datsubset[, "sulfate"], na.rm = T)
  }
}

If anyone can understand my struggles, please feel free to share your thoughts here! thanks!

1
pollutant is not a column in dat or datsubset. Like the message says. I think you need to go and read a basic guide to understand what Data["this_row", "that_column"] means.Stephen Henderson

1 Answers

0
votes

Once you create the variable pollutant, you should use that variable in the subscripts []. So just say:

mean(datsubset[, pollutant], na.rm = T)

i.e. without the quotes, because else it's just a literal character string.

Changing that in your function pollutantmean() should be sufficient.