I'll create some sample xlsx files using openxlsx
:
wb <- openxlsx::createWorkbook()
openxlsx::addWorksheet(wb, "tab1")
openxlsx::writeData(wb, "tab1", data.frame(a = 1101:1103, b = 1111:1113))
openxlsx::addWorksheet(wb, "tab2")
openxlsx::writeData(wb, "tab2", data.frame(a = 1201:1203, b = 1211:1213))
openxlsx::addWorksheet(wb, "tab3")
openxlsx::writeData(wb, "tab3", data.frame(a = 1301:1303, b = 1311:1313))
openxlsx::saveWorkbook(wb, "book1.xlsx")
wb <- openxlsx::createWorkbook()
openxlsx::addWorksheet(wb, "tab1")
openxlsx::writeData(wb, "tab1", data.frame(a = 2101:2103, b = 2111:2113))
openxlsx::addWorksheet(wb, "tab2")
openxlsx::writeData(wb, "tab2", data.frame(a = 2201:2203, b = 2211:2213))
openxlsx::addWorksheet(wb, "tab3")
openxlsx::writeData(wb, "tab3", data.frame(a = 2301:2303, b = 2311:2313))
openxlsx::saveWorkbook(wb, "book2.xlsx")
wb <- openxlsx::createWorkbook()
openxlsx::addWorksheet(wb, "tab1")
openxlsx::writeData(wb, "tab1", data.frame(a = 3101:3103, b = 3111:3113))
openxlsx::addWorksheet(wb, "tab2")
openxlsx::writeData(wb, "tab2", data.frame(a = 3201:3203, b = 3211:3213))
openxlsx::addWorksheet(wb, "tab3")
openxlsx::writeData(wb, "tab3", data.frame(a = 3301:3303, b = 3311:3313))
openxlsx::saveWorkbook(wb, "book3.xlsx")
General flow
I'm not sure why you prefer to keep one frame per sheet; if you're doing the same thing to different groups of data, it can still make a lot of sense to have a single frame, keeping as much of the context as possible so that grouping comes naturally.
While base R does do grouping operations, I find them to be slightly less intuitive/flexible than when using the data.table
or dplyr
packages, so I'll stick with those two for the processing here (and leave you to determine which if either you wish to use, and then adapt your processing to do it group-wise).
Either way, here's my flow:
- We need a function that reads in all sheets within a workbook, then iterate this over the vector of file names;
- I'll demonstrate putting all data into one frame (my recommendation); and then
- I'll demonstrate grouping them by worksheet.
I'll start with data.table
, but I'll provide the equivalents in dplyr
later.
basic read-all-sheets function
readOneBook <- function(fn) {
shtnms <- openxlsx::getSheetNames(fn)
sheets <- lapply(setNames(nm = shtnms), openxlsx::readWorkbook, xlsxFile = fn)
sheets
}
readOneBook("book1.xlsx")
# $tab1
# a b
# 1 1101 1111
# 2 1102 1112
# 3 1103 1113
# $tab2
# a b
# 1 1201 1211
# 2 1202 1212
# 3 1203 1213
# $tab3
# a b
# 1 1301 1311
# 2 1302 1312
# 3 1303 1313
So we'll create a list for workbooks (which are lists of sheets) with
workbooks <- lapply(setNames(nm = list.files(pattern = "\\.xlsx$")), readOneBook)
data.table
Here's a list where each element is a workbook:
library(data.table)
lapply(workbooks, rbindlist, idcol = "sheet")
# $book1.xlsx
# sheet a b
# 1: tab1 1101 1111
# 2: tab1 1102 1112
# 3: tab1 1103 1113
# 4: tab2 1201 1211
# 5: tab2 1202 1212
# 6: tab2 1203 1213
# 7: tab3 1301 1311
# 8: tab3 1302 1312
# 9: tab3 1303 1313
# $book2.xlsx
# sheet a b
# 1: tab1 2101 2111
# 2: tab1 2102 2112
# 3: tab1 2103 2113
# 4: tab2 2201 2211
# 5: tab2 2202 2212
# 6: tab2 2203 2213
# 7: tab3 2301 2311
# 8: tab3 2302 2312
# 9: tab3 2303 2313
# $book3.xlsx
# sheet a b
# 1: tab1 3101 3111
# 2: tab1 3102 3112
# 3: tab1 3103 3113
# 4: tab2 3201 3211
# 5: tab2 3202 3212
# 6: tab2 3203 3213
# 7: tab3 3301 3311
# 8: tab3 3302 3312
# 9: tab3 3303 3313
And then combining this into one big frame:
rbindlist(
lapply(workbooks, rbindlist, idcol = "sheet"),
idcol = "workbook"
)
# workbook sheet a b
# 1: book1.xlsx tab1 1101 1111
# 2: book1.xlsx tab1 1102 1112
# 3: book1.xlsx tab1 1103 1113
# 4: book1.xlsx tab2 1201 1211
# 5: book1.xlsx tab2 1202 1212
# ---
# 23: book3.xlsx tab2 3202 3212
# 24: book3.xlsx tab2 3203 3213
# 25: book3.xlsx tab3 3301 3311
# 26: book3.xlsx tab3 3302 3312
# 27: book3.xlsx tab3 3303 3313
The list of sheets is slightly different, requiring a bit of "transpose" functionality. This safeguards against (1) sheets that are not present in all workbooks; and (2) different order of sheets.
commonsheets <- Reduce(intersect, lapply(workbooks, names))
commonsheets
# [1] "tab1" "tab2" "tab3"
lapply(setNames(nm = commonsheets),
function(sht) rbindlist(lapply(workbooks, `[[`, sht), idcol = "workbook"))
# $tab1
# workbook a b
# 1: book1.xlsx 1101 1111
# 2: book1.xlsx 1102 1112
# 3: book1.xlsx 1103 1113
# 4: book2.xlsx 2101 2111
# 5: book2.xlsx 2102 2112
# 6: book2.xlsx 2103 2113
# 7: book3.xlsx 3101 3111
# 8: book3.xlsx 3102 3112
# 9: book3.xlsx 3103 3113
# $tab2
# workbook a b
# 1: book1.xlsx 1201 1211
# 2: book1.xlsx 1202 1212
# 3: book1.xlsx 1203 1213
# 4: book2.xlsx 2201 2211
# 5: book2.xlsx 2202 2212
# 6: book2.xlsx 2203 2213
# 7: book3.xlsx 3201 3211
# 8: book3.xlsx 3202 3212
# 9: book3.xlsx 3203 3213
# $tab3
# workbook a b
# 1: book1.xlsx 1301 1311
# 2: book1.xlsx 1302 1312
# 3: book1.xlsx 1303 1313
# 4: book2.xlsx 2301 2311
# 5: book2.xlsx 2302 2312
# 6: book2.xlsx 2303 2313
# 7: book3.xlsx 3301 3311
# 8: book3.xlsx 3302 3312
# 9: book3.xlsx 3303 3313
dplyr
Same functionality, same effective data, so I'll just show the commands (which are really just replacing rbindlist
with bind_cols
and an argument-name change).
library(dplyr)
# list, one workbook per element
lapply(workbooks, rbindlist, idcol = "sheet")
# one big frame
bind_rows(
lapply(workbooks, bind_rows, .id = "sheet"),
.id = "workbook"
)
# list, one common sheet per element
lapply(setNames(nm = commonsheets),
function(sht) bind_rows(lapply(workbooks, `[[`, sht), .id = "workbook"))
lapply(list_of_filenames, my_import_function)
. (3) The method to rbind them all by tab would be better informed with sample data. Can you provide three samples of data with a few rows of sample data each? We don't need all columns, all tabs, all files, just a few. Thanks! – r2evansdata.frame(a=1:3,b=11:13,d=21:23)
may suffice, but if you're talking about names-in-common then there needs to be something like that in your three frames of sample data. – r2evans