2
votes

I am new to R and I am practicing to write R functions. I have 100 cvs separate data files stored in my directory, and each is labeled by its id, e.g. "1" to "100. I like to write a function that reads some selected files into R, calculates the number of complete cases in each data file, and arrange the results into a data frame. Below is the function that I wrote. First I read all files in "dat". Then, using rbind function, I read the selected files I want into a data.frame. Lastly, I computed the number of complete cases using sum(complete.cases()). This seems straightforward but the function does not work. I suspect there is something wrong with the index but have not figured out why. Searched through various topics but could not find a useful answer. Many thanks!

 `complete = function(directory,id) {
  dat = list.files(directory, full.name=T)
  dat.em = data.frame()
  for (i in id) {
    dat.ful= rbind(dat.em, read.csv(dat[i]))
    obs = numeric()
    obs[i] = sum(complete.cases(dat.ful[dat.ful$ID == i,]))
  }
  data.frame(ID = id, count = obs)
} 
complete("envi",c(1,3,5)) `

get error and a warning message: Error in data.frame(ID = id, count = obs) : arguments imply differing number of rows: 3, 5

1

1 Answers

3
votes

One problem with your code is that you reset obs to numeric() each time you go through the loop, so obs ends up with only one value (the number of complete cases in the last file in dat).

Another issue is that the line dat.ful = rbind(dat.em, read.csv(dat[i])) resets dat.ful to contain just the data frame being read in that iteration of the loop. This won't cause an error, but you don't actually need to store the previous data frames, since you're just checking the number of complete cases for each data frame you read in.

Here's a different approach using lapply instead of a loop. Note that instead of giving the function a vector of indices, this function takes a vector of file names. In your example, you use the index instead of the file name as the file "id". It's better to use the file names directly, because even if the file names are numbers, using the index will give an incorrect result if, for some reason, your vector of file names is not sorted in ascending numeric order, or if the file names don't use consecutive numbers.

# Read files and return data frame with the number of complete cases in each csv file
complete = function(directory, files) {

  # Read each csv file in turn and store its name and number of complete cases 
  # in a list
  obs.list = lapply(files, function(x) {
    dat = read.csv(paste0(directory,"/", x))
    data.frame(fileName=x, count=sum(complete.cases(dat)))
  })

  # Return a data frame with the number of complete cases for each file
  return(do.call(rbind, obs.list)) 
} 

Then, to run the function, you need to give it a directory and a list of file names. For example, to read all csv files in the current working directory, you can do this:

  filesToRead = list.files(pattern=".csv")

  complete(getwd(), filesToRead)