77
votes

I'm wondering if anyone knows of a way to import data from a "big" xlsx file (~20Mb). I tried to use xlsx and XLConnect libraries. Unfortunately, both use rJava and I always obtain the same error:

> library(XLConnect)
> wb <- loadWorkbook("MyBigFile.xlsx")
Error: OutOfMemoryError (Java): Java heap space

or

> library(xlsx)
> mydata <- read.xlsx2(file="MyBigFile.xlsx")
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
   java.lang.OutOfMemoryError: Java heap space

I also tried to modify the java.parameters before loading rJava:

> options( java.parameters = "-Xmx2500m")
> library(xlsx) # load rJava
> mydata <- read.xlsx2(file="MyBigFile.xlsx")
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
   java.lang.OutOfMemoryError: Java heap space

or after loading rJava (this is a bit stupid, I think):

> library(xlsx) # load rJava
> options( java.parameters = "-Xmx2500m")
> mydata <- read.xlsx2(file="MyBigFile.xlsx")
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
   java.lang.OutOfMemoryError: Java heap space

But nothing works. Does anyone have an idea?

8
Have you considered saving your data into a more universal format, e.g. csv?flodel
gdata is another option. I believe it is not java based, but I could be mistaken.Ricardo Saporta
Why is it that big? Lots of rows (do you need them all?), lots of columns (do you need them all?), lots of individual sheets (do you need them all?), one high-resolution embedded image (you don't need that...)? For spreadsheet and other binary files the size of the file in bytes is often not a useful measure of how big the data in it really is.Spacedman
gdata works... very slowly, about 7 min by sheet but it works.user2722443
I've been working on importing a colleague's monstrous, formula-laden Excel file (150 MB), and gdata was the only Excel package that could pull it off. As here, Java-based packages ran out of memory; openxlsx segfaulted. gdata took 30 minutes per sheet, but it got the job done.Matt Parker

8 Answers

148
votes

I stumbled on this question when someone sent me (yet another) Excel file to analyze. This one isn't even that big but for whatever reason I was running into a similar error:

java.lang.OutOfMemoryError: GC overhead limit exceeded

Based on comment by @DirkEddelbuettel in a previous answer I installed the openxlsx package (http://cran.r-project.org/web/packages/openxlsx/). and then ran:

library("openxlsx")
mydf <- read.xlsx("BigExcelFile.xlsx", sheet = 1, startRow = 2, colNames = TRUE)

It was just what I was looking for. Easy to use and wicked fast. It's my new BFF. Thanks for the tip @DirkEddelbuettel!

BTW, I don't want to poach this answer from Dirk E, so if he posts an answer, please accept it rather than mine!

16
votes
options(java.parameters = "-Xmx2048m")  ## memory set to 2 GB
library(XLConnect)

allow for more memory using "options" before any java component is loaded. Then load XLConnect library (it uses java).

That's it. Start reading in data with readWorksheet .... and so on. :)

9
votes

I do agree with @orville jackson response & it really helped me too.

Inline to the answer provided by @orville jackson. here is the detailed description of how you can use openxlsx for reading and writing big files.

When data size is small, R has many packages and functions which can be utilized as per your requirement.

write.xlsx, write.xlsx2, XLconnect also do the work but these are sometimes slow as compare to openxlsx.

So, if you are dealing with the large data sets and came across java errors. I would suggest to have a look of "openxlsx" which is really awesome and reduce the time by 1/12th.

I've tested all and finally i was really impressed with the performance of openxlsx capabilities.

Here are the steps for writing multiple datasets into multiple sheets.

install.packages("openxlsx")
library("openxlsx")

start.time <- Sys.time()

# Creating large data frame
x <- as.data.frame(matrix(1:4000000,200000,20))
y <- as.data.frame(matrix(1:4000000,200000,20))
z <- as.data.frame(matrix(1:4000000,200000,20))

# Creating a workbook
wb <- createWorkbook("Example.xlsx")
Sys.setenv("R_ZIPCMD" = "C:/Rtools/bin/zip.exe") ## path to zip.exe

Sys.setenv("R_ZIPCMD" = "C:/Rtools/bin/zip.exe") has to be static as it takes reference of some utility from Rtools.

Note: Incase Rtools is not installed on your system, please install it first for smooth experience. here is the link for your reference: (choose appropriate version) https://cran.r-project.org/bin/windows/Rtools/

check the options as per link below (need to select all the check box while installation) https://cloud.githubusercontent.com/assets/7400673/12230758/99fb2202-b8a6-11e5-82e6-836159440831.png

# Adding a worksheets : parameters for addWorksheet are 1. Workbook Name 2. Sheet Name

addWorksheet(wb, "Sheet 1")
addWorksheet(wb, "Sheet 2")
addWorksheet(wb, "Sheet 3")

# Writing data in to respetive sheets: parameters for writeData are 1. Workbook Name 2. Sheet index/ sheet name 3. dataframe name

writeData(wb, 1, x)

# incase you would like to write sheet with filter available for ease of access you can pass the parameter withFilter = TRUE in writeData function.
writeData(wb, 2, x = y, withFilter = TRUE)

## Similarly writeDataTable is another way for representing your data with table formatting:

writeDataTable(wb, 3, z)

saveWorkbook(wb, file = "Example.xlsx", overwrite = TRUE)

end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken

openxlsx package is really good for reading and writing huge data from/ in excel files and has lots of options for custom formatting within excel.

The interesting fact is that we don’t have to bother about java heap memory here.

8
votes

I know this question is a bit old, but There is a good solution for this nowadays. This is a default package when you try to import excel in Rstudio with GUI and It works well in my situation.

library(readxl)

data <- read_excel(filename)
5
votes

As mentioned in the canonical Excel->R question, a recent alternative which has emerged comes from the readxl package, which I've found to be quite fast, compared with, e.g. openxlsx and xlsx.

That said, there's a definite limit of spreadsheet size past which you're probably better off just saving the thing as a .csv and using fread.

3
votes

I also had the same error in both xlsx::read.xlsx and XLConnect::readWorksheetFromFile. Maybe you can use RODBC::odbcDriverConnect and RODBC::sqlFetch, which uses Microsoft RODBC, which is much more efficient.

2
votes

@flodel's suggestion of converting to CSV seems the most straightforward. If for whatever reason, that's not an option, you can read in the file in chunks:

 require(XLConnect)
 chnksz <- 2e3
 s <- <sheet>
 wb <- loadWorkbook(<file>, s)
 tot.rows <- getLastRow(wb)
 last.row =0
 for (i in seq(ceiling( tot.rows / chnksz) )) {
    next.batch <- readWorksheet(wb, s, startRow=last.row+i, endRow=last.row+chnksz+i)
    # optionally save next.batch to disk or 
    # assign it to a list. See which works for you. 
 } 
0
votes

I found this thread looking for an answer to the exact same question. Rather than try to hack an xlsx file from within R what ended up working for me was to convert the file to .csv using python and then import the file into R using a standard scanning function.

Check out: https://github.com/dilshod/xlsx2csv