Skip rows while reading multiple excel worksheets in R

Question

I am using readxl library to read many excel worksheets in the same excel workbook (called data.xlsx) with the following format:

Data starts in row 3.

  row1
  row2
 companyName   1980    1981    1982 ... 2016
 company1       5       6       7        8
 company2       10      20      30       40
 company3       20      40      60       80
 ....

The data range is different in length by each row and column. However, they have the companyName as common key. The year range varies from starting from 1980 or 1990 until 2016. The worksheet name is the data name.

I want to create a single excel where all data are extracted from all worksheets.

 companyName   Year   dataname     values
 company1      1980   sheetname1     5
 company1      1981   sheetname1     6
 company1      1982   sheetname1     7
 company1      ...    sheetname1     ...
 company1      2016   sheetname1     8
 company2      1980   sheetname1     10
 company2      1981   sheetname1     20
 company2      1982   sheetname1     30
 company2      ...    sheetname1     ...
 company2      2016   sheetname1     40
 ....          ....     ...           ...
 company1      2000    sheetname2     xxx
 company1      2001    sheetname2     yyy
  etc
  etc
  etc

This is how far I managed to get too:

  library(tidyverse)
  library(readxl)
  library(data.table)

   #read excel file (from [here][1])
   file.list<-"data.xlsx"

     **#read all sheets (and **skip** first two rows)**

   df.list <- lapply(file.list,function(x) {
     sheets <- excel_sheets(x)
     dfs <- lapply(sheets, function(y) {
       read_excel(x, sheet = y,skip=2)
       })
     names(dfs) <- sheets
     dfs
   })

I have following issues:

the first two rows are not been skipped
how I create one dataframe with only select sheets only (ie. sheet 5, sheet 10 and sheet 15).

Thank you for your help.

Source: R: reading multiple excel files, extract first sheet names, and create new column

What version is your readxl package? I have no issues with skipping rows. Unless not all sheets in your file start with the same number of rows before the headers. — hpesoj626
Hello - using package 1.0.0 of readxl. Yes some sheets need to be excluded, how do I do that please? — Beginner

hpesoj626 hpesoj626 · Accepted Answer · 2018-03-06T10:39:01

I just removed one level of nesting from df.list.

df.list <- lapply(file.list,function(x) {
    sheets <- excel_sheets(x)
    dfs <- lapply(sheets, function(y) {
    read_excel(x, sheet = y,skip=2)
  })
  names(dfs) <- sheets
  dfs 
})[[1]]

This works for me. I can't replicate your problem with skips. Also, if the rows are just blank rows, read_excel() should skip them by default using trim_ws = TRUE.

I used the following list just to demonstrate what to do after the import.

df.list <- structure(list(sheetname1 = structure(list(companyName = c("company1", 
"company2", "company3"), `1980` = c(5, 10, 40), `1981` = c(6, 
20, 50), `1982` = c(7, 30, 60)), .Names = c("companyName", "1980", 
"1981", "1982"), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame")), sheetname2 = structure(list(companyName = c("company1", 
"company2", "company3"), `1980` = c(6, 11, 42), `1981` = c(7, 
21, 52), `1982` = c(8, 31, 62)), .Names = c("companyName", "1980", 
"1981", "1982"), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame")), sheetname3 = structure(list(companyName = c("company1", 
"company2", "company3"), `1990` = c(8, 12, 43), `1991` = c(9, 
22, 53), `1992` = c(10, 32, 63)), .Names = c("companyName", "1990", 
"1991", "1992"), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame"))), .Names = c("sheetname1", "sheetname2", 
"sheetname3"))

The following works even if the years start at 1980 or 1990.

dat <- lapply(df.list, function(x){
  nrows = nrow(x)
  years = names(x[,2:nrows])
  x %>% gather(year, values, -companyName)
}) %>% enframe() %>% unnest()

dat

# # A tibble: 27 x 4
#    name       companyName year  values
#    <chr>      <chr>       <chr>  <dbl>
#  1 sheetname1 company1    1980      5.
#  2 sheetname1 company2    1980     10.
#  3 sheetname1 company3    1980     40.
#  4 sheetname1 company1    1981      6.
#  5 sheetname1 company2    1981     20.
#  6 sheetname1 company3    1981     50.
#  7 sheetname1 company1    1982      7.
#  8 sheetname1 company2    1982     30.
#  9 sheetname1 company3    1982     60.
# 10 sheetname2 company1    1980      6.
# # ... with 17 more rows

You can now use the specific sheetname by using dplyr::filter().

For example:

dat %>% filter(name == "sheetname1")

#   name       companyName year  values
#   <chr>      <chr>       <chr>  <dbl>
# 1 sheetname1 company1    1980      5.
# 2 sheetname1 company2    1980     10.
# 3 sheetname1 company3    1980     40.
# 4 sheetname1 company1    1981      6.
# 5 sheetname1 company2    1981     20.
# 6 sheetname1 company3    1981     50.
# 7 sheetname1 company1    1982      7.
# 8 sheetname1 company2    1982     30.
# 9 sheetname1 company3    1982     60.

Skip rows while reading multiple excel worksheets in R

2 Answers