0
votes

I'm doing Assignment Part 2 at the following address:

https://www.coursera.org/learn/r-programming/supplement/amLgW/programming-assignment-1-instructions-air-pollution

Question: The zip file contains 332 comma-separated-value (CSV) files containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in the file "200.csv". Each file contains three variables:

Date: the date of the observation in YYYY-MM-DD format (year-month-day) sulfate: the level of sulfate PM in the air on that date (measured in micrograms per cubic meter) nitrate: the level of nitrate PM in the air on that date (measured in micrograms per cubic meter) For this programming assignment you will need to unzip this file and create the directory 'specdata'. Once you have unzipped the zip file, do not make any modifications to the files in the 'specdata' directory. In each file you'll notice that there are many days where either sulfate or nitrate (or both) are missing (coded as NA). This is common with air pollution monitoring data in the United States.

Part 2

Write a function that reads a directory full of files and reports the number of completely observed cases in each data file. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases.

My code is as following:

complete <- function(directory="d:/dev/r/documents/specdata", id)   {
df <- data.frame(no=integer(), nobs=integer())
for (i in id)   {
    sum=0
    myfilename = paste(directory,"/",formatC(i, width=3, flag="0"),".csv",
                       sep="") 
    masterfile = read.table(myfilename, header=TRUE, sep=",")
    for (j in 1:nrow(masterfile)){
        if (!is.na(masterfile[j, 2]) && !is.na(masterfile[j, 3])){
            sum = sum + 1
        }
    }
    df[i,]<-c(i, sum)
}
df
}

Note that I put all the 001.csv, 002.csv, ... in the directory d:/dev/r/documents/specdata, and that's why I have this string as default in the parameter. You can see that I use nested for loops to make this work, and I realize that I should be able to replace at least one of the for loop with lapply. But I'm struggling with this as I'm quite familiar with C++ so I really have no idea how to implement lapply. I read a few codes on Stackoverflow and I understand most of them, but when it came to writing my own codes I could not make it work.

Thanks in advance! In the mean time I will try again.

2

2 Answers

1
votes

You can start with replacing the inner cycle first with something like this:

rows_to_sum <- !is.na(masterfile[, 2]) & !is.na(masterfile[, 3])
df[i,] <- sum(masterfile[rows_to_sum, 1])
1
votes

This assignment gives you a hint by using the phrase "complete cases" multiple times. You should check out the R function complete.cases(). It would replace the need for your inner for loop.

For each file, run complete.cases(file). Count the number of TRUE elements in the returned vector. Output the name of the file and the above count.