1
votes

I have 4 Excel datasets with 15 sheets each.

First, I want to import all datasets to R as a list, so that the list contains each dataset (df1, df2, df3, df4) and each dataset contains all the 15 sheets (sheets1, sheets2, sheets3,...,sheets15). The sheets have identical names in each dataset. And the datasets all start with the same word, let's say "coffee". So have the datasets "coffee_1.xlsx", "coffee_2.xlsx", "coffee_3.xlsx" and "coffee_4.xlsx". Is there a way, how I can import all datasets at once?

Second I want to rbind all datasets by sheets. So, e.g., sheet1 of df1 should be combined with sheet1 of df2, df3, and df4.

I do not want to do it manually as I have to repeat the procedure for 100 datasets with 15 sheets each.

So far I have tried to import all datasets separately and combined them to a bigger list as follows:

df.list<-list(df.list1,df.list2,df.list3,df.list4,df.list5)

Each of the lists contains the 15 sheets. Then I tried to rbind them using do.call:

df.list.big<-do.call(rbind,df.list)

But I could not rbind the data sheet by sheet. This, I would really appreciate your help. Thanks!

4
(1) By "tab", do you mean "worksheet within a workbook" (in xlsx-speak) or columns within a frame? (2) Import all at once? Use a function that imports 1, then lapply(list_of_filenames, my_import_function). (3) The method to rbind them all by tab would be better informed with sample data. Can you provide three samples of data with a few rows of sample data each? We don't need all columns, all tabs, all files, just a few. Thanks!r2evans
(1) Yes, I mean sheets in a workbook. Sorry for the misunderstanding. I have edited my question. (2) I would like to provide some data but I do not know how to express the sheets in the list of dataframes. So each df.list(1,2,3 and 4) consists of sheet1, sheet2, etc.ZayzayR
Sample data does not have to be real. data.frame(a=1:3,b=11:13,d=21:23) may suffice, but if you're talking about names-in-common then there needs to be something like that in your three frames of sample data.r2evans

4 Answers

3
votes

I'll create some sample xlsx files using openxlsx:

wb <- openxlsx::createWorkbook()
openxlsx::addWorksheet(wb, "tab1")
openxlsx::writeData(wb, "tab1", data.frame(a = 1101:1103, b = 1111:1113))
openxlsx::addWorksheet(wb, "tab2")
openxlsx::writeData(wb, "tab2", data.frame(a = 1201:1203, b = 1211:1213))
openxlsx::addWorksheet(wb, "tab3")
openxlsx::writeData(wb, "tab3", data.frame(a = 1301:1303, b = 1311:1313))
openxlsx::saveWorkbook(wb, "book1.xlsx")

wb <- openxlsx::createWorkbook()
openxlsx::addWorksheet(wb, "tab1")
openxlsx::writeData(wb, "tab1", data.frame(a = 2101:2103, b = 2111:2113))
openxlsx::addWorksheet(wb, "tab2")
openxlsx::writeData(wb, "tab2", data.frame(a = 2201:2203, b = 2211:2213))
openxlsx::addWorksheet(wb, "tab3")
openxlsx::writeData(wb, "tab3", data.frame(a = 2301:2303, b = 2311:2313))
openxlsx::saveWorkbook(wb, "book2.xlsx")

wb <- openxlsx::createWorkbook()
openxlsx::addWorksheet(wb, "tab1")
openxlsx::writeData(wb, "tab1", data.frame(a = 3101:3103, b = 3111:3113))
openxlsx::addWorksheet(wb, "tab2")
openxlsx::writeData(wb, "tab2", data.frame(a = 3201:3203, b = 3211:3213))
openxlsx::addWorksheet(wb, "tab3")
openxlsx::writeData(wb, "tab3", data.frame(a = 3301:3303, b = 3311:3313))
openxlsx::saveWorkbook(wb, "book3.xlsx")

General flow

I'm not sure why you prefer to keep one frame per sheet; if you're doing the same thing to different groups of data, it can still make a lot of sense to have a single frame, keeping as much of the context as possible so that grouping comes naturally.

While base R does do grouping operations, I find them to be slightly less intuitive/flexible than when using the data.table or dplyr packages, so I'll stick with those two for the processing here (and leave you to determine which if either you wish to use, and then adapt your processing to do it group-wise).

Either way, here's my flow:

  1. We need a function that reads in all sheets within a workbook, then iterate this over the vector of file names;
  2. I'll demonstrate putting all data into one frame (my recommendation); and then
  3. I'll demonstrate grouping them by worksheet.

I'll start with data.table, but I'll provide the equivalents in dplyr later.

basic read-all-sheets function

readOneBook <- function(fn) {
  shtnms <- openxlsx::getSheetNames(fn)
  sheets <- lapply(setNames(nm = shtnms), openxlsx::readWorkbook, xlsxFile = fn)
  sheets
}
readOneBook("book1.xlsx")
# $tab1
#      a    b
# 1 1101 1111
# 2 1102 1112
# 3 1103 1113
# $tab2
#      a    b
# 1 1201 1211
# 2 1202 1212
# 3 1203 1213
# $tab3
#      a    b
# 1 1301 1311
# 2 1302 1312
# 3 1303 1313

So we'll create a list for workbooks (which are lists of sheets) with

workbooks <- lapply(setNames(nm = list.files(pattern = "\\.xlsx$")), readOneBook)

data.table

Here's a list where each element is a workbook:

library(data.table)
lapply(workbooks, rbindlist, idcol = "sheet")
# $book1.xlsx
#    sheet    a    b
# 1:  tab1 1101 1111
# 2:  tab1 1102 1112
# 3:  tab1 1103 1113
# 4:  tab2 1201 1211
# 5:  tab2 1202 1212
# 6:  tab2 1203 1213
# 7:  tab3 1301 1311
# 8:  tab3 1302 1312
# 9:  tab3 1303 1313
# $book2.xlsx
#    sheet    a    b
# 1:  tab1 2101 2111
# 2:  tab1 2102 2112
# 3:  tab1 2103 2113
# 4:  tab2 2201 2211
# 5:  tab2 2202 2212
# 6:  tab2 2203 2213
# 7:  tab3 2301 2311
# 8:  tab3 2302 2312
# 9:  tab3 2303 2313
# $book3.xlsx
#    sheet    a    b
# 1:  tab1 3101 3111
# 2:  tab1 3102 3112
# 3:  tab1 3103 3113
# 4:  tab2 3201 3211
# 5:  tab2 3202 3212
# 6:  tab2 3203 3213
# 7:  tab3 3301 3311
# 8:  tab3 3302 3312
# 9:  tab3 3303 3313

And then combining this into one big frame:

rbindlist(
  lapply(workbooks, rbindlist, idcol = "sheet"),
  idcol = "workbook"
)
#       workbook sheet    a    b
#  1: book1.xlsx  tab1 1101 1111
#  2: book1.xlsx  tab1 1102 1112
#  3: book1.xlsx  tab1 1103 1113
#  4: book1.xlsx  tab2 1201 1211
#  5: book1.xlsx  tab2 1202 1212
# ---                           
# 23: book3.xlsx  tab2 3202 3212
# 24: book3.xlsx  tab2 3203 3213
# 25: book3.xlsx  tab3 3301 3311
# 26: book3.xlsx  tab3 3302 3312
# 27: book3.xlsx  tab3 3303 3313

The list of sheets is slightly different, requiring a bit of "transpose" functionality. This safeguards against (1) sheets that are not present in all workbooks; and (2) different order of sheets.

commonsheets <- Reduce(intersect, lapply(workbooks, names))
commonsheets
# [1] "tab1" "tab2" "tab3"
lapply(setNames(nm = commonsheets),
       function(sht) rbindlist(lapply(workbooks, `[[`, sht), idcol = "workbook"))
# $tab1
#      workbook    a    b
# 1: book1.xlsx 1101 1111
# 2: book1.xlsx 1102 1112
# 3: book1.xlsx 1103 1113
# 4: book2.xlsx 2101 2111
# 5: book2.xlsx 2102 2112
# 6: book2.xlsx 2103 2113
# 7: book3.xlsx 3101 3111
# 8: book3.xlsx 3102 3112
# 9: book3.xlsx 3103 3113
# $tab2
#      workbook    a    b
# 1: book1.xlsx 1201 1211
# 2: book1.xlsx 1202 1212
# 3: book1.xlsx 1203 1213
# 4: book2.xlsx 2201 2211
# 5: book2.xlsx 2202 2212
# 6: book2.xlsx 2203 2213
# 7: book3.xlsx 3201 3211
# 8: book3.xlsx 3202 3212
# 9: book3.xlsx 3203 3213
# $tab3
#      workbook    a    b
# 1: book1.xlsx 1301 1311
# 2: book1.xlsx 1302 1312
# 3: book1.xlsx 1303 1313
# 4: book2.xlsx 2301 2311
# 5: book2.xlsx 2302 2312
# 6: book2.xlsx 2303 2313
# 7: book3.xlsx 3301 3311
# 8: book3.xlsx 3302 3312
# 9: book3.xlsx 3303 3313

dplyr

Same functionality, same effective data, so I'll just show the commands (which are really just replacing rbindlist with bind_cols and an argument-name change).

library(dplyr)

# list, one workbook per element
lapply(workbooks, rbindlist, idcol = "sheet")

# one big frame
bind_rows(
  lapply(workbooks, bind_rows, .id = "sheet"),
  .id = "workbook"
)

# list, one common sheet per element
lapply(setNames(nm = commonsheets),
       function(sht) bind_rows(lapply(workbooks, `[[`, sht), .id = "workbook"))
1
votes

I have an approach that can get the job done. You will use three packages:

library(readxl)
library(dplyr)
library(purrr)

In this case, I will make the assumption that all your data is in your working directory and that all your workbooks have the same number of sheets.

Step one: list all your files

# list all your workbooks
files <- list.files()

step two: create a function that, using the path to your files and a sheet index, return a data frame with the sheets binded by rows.

read_workbook_sheets <- function(files, sheet_index = 1) {
  
  # import from all workbooks the sheet in the sheet_index position
  data <- purrr::map(files, ~readxl::read_excel(path = .x, sheet = sheet_index))
  
  # bind the all sheets together
  data <- dplyr::bind_rows(data)
  
  # return the dataframe  
  return(data)
}

step three: Use that function in a loop itarating on the sequence of sheets, for example 10

my_list_of_df <- purrr::map(1:10, read_workbooks_sheets(files, .x)) 

PS: Sorry for my english grammar, I am not a native speaker.

0
votes

You could try something like this:

library(tidyverse)
library(readxl)

df <- tibble(filename = list.files(path = ".", pattern = "coffee", full.names = TRUE)) %>% 
    mutate(sheet = map(filename, excel_sheets)) %>% 
    unnest(sheet) %>% 
    mutate(data_from_excel = map2(filename, sheet, read_excel)) %>% 
    group_by(sheet) 

df2 <- df %>% group_split
names(df2) <- group_keys(df) %>% pull

df2 %>% map(~summarize(., bind_rows(data_from_excel)))
0
votes

Editing: You did not produce a reproducible example, so I will make assumptions. Your sheets are named identically, with identical columns. I have made a small dataset that I think fits your description.

I recommend using data.table, if you can:

library(data.table)
df.list1 <- list(sheet1 = data.table(a = 1, b = 1), sheet2 = data.table(a = 2, b = 2))
df.list2 <- list(sheet1 = data.table(a = 3, b = 3), sheet2 = data.table(a = 4, b = 4))
df.list <- list(dataset1 = df.list1, dataset2 = df.list2)
# Now I have a dataset like yours 
# First -- transpose them so "sheets" are on the outside
# Then data.table::rbindlist them, keeping the dataset names, if you like
lapply(purrr::transpose(df.list), data.table::rbindlist, idcol = "dataset")