Read multiple xlsx files with multiple sheets into one R data frame

Question

I have been reading up on how to read and combine multiple xlsx files into one R data frame and have come across some very good suggestions like, How to read multiple xlsx file in R using loop with specific rows and columns, but non fits my data set so far.

I would like R to read in multiple xlsx files with that have multiple sheets. All sheets and files have the same columns but not the same length and NA's should be excluded. I want to skip the first 3 rows and only take in columns 1:6, 8:10, 12:17, 19.

So far I tried:

file.list <- list.files(recursive=T,pattern='*.xlsx')

dat = lapply(file.list, function(i){
    x = read.xlsx(i, sheetIndex=1, sheetName=NULL, startRow=4,
              endRow=NULL, as.data.frame=TRUE, header=F)
# Column select 
    x = x[, c(1:6,8:10,12:17,19)]
# Create column with file name  
    x$file = i
# Return data
    x
  })

  dat = do.call("rbind.data.frame", dat)

But this only takes all the first sheet of every file

Does anyone know how to get all the sheets and files together in one R data frame?

Also, what packages would you recommend for large sets of data? So far I tried readxl and XLConnect.

You have explicitly asked for only the first sheet in your function: x = read.xlsx(i, sheetIndex=1,....) — mkt
Also, if you're looking to optimize speed for large datasets, it's worth looking up the data.table package. Among other things, its fread function allows you to only read in the columns that you need, instead of reading all columns and then subsetting. But I'm not sure that it will work with .xlsx files. — mkt
Your lapply has looped over files, you need to ass a second loop over sheets to get what you want. — Choubi
Thanks for the suggestion. Do you know if there is a way to ask for all sheets with with read.xlsx? — Elisah

GPierre GPierre · Accepted Answer · 2016-07-05T07:50:49

I would use a nested loop like this to go through each sheet of each file. It might not be the fastest solution but it is the simplest.

require(xlsx)    
file.list <- list.files(recursive=T,pattern='*.xlsx')  #get files list from folder

for (i in 1:length(files.list)){                                           
  wb <- loadWorkbook(files.list[i])           #select a file & load workbook
  sheet <- getSheets(wb)                      #get sheet list

  for (j in 1:length(sheet)){ 
    tmp<-read.xlsx(files.list[i], sheetIndex=j, colIndex= c(1:6,8:10,12:17,19),
                   sheetName=NULL, startRow=4, endRow=NULL,
                   as.data.frame=TRUE, header=F)   
    if (i==1&j==1) dataset<-tmp else dataset<-rbind(dataset,tmp)   #happend to previous

  }
}

You can clean NA values after the loading phase.

Read multiple xlsx files with multiple sheets into one R data frame

4 Answers