0
votes

I am trying to read a large (~700Mb) .csv file into R.

The file contains an array of integers less than 256, with a header row and 2 header columns.

I use:

trainSet <- read.csv(trainFileName)

This eventually barfs with:

Loading Data...
R(2760) malloc: *** mmap(size=151552) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
R(2760) malloc: *** mmap(size=151552) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Error: cannot allocate vector of size 145 Kb
Execution halted

Looking at the memory usage, it conks out at about 3Gb usage on a 6Gb machine with zero page file usage at the time of the crash, so there may be another way to fix it.

If I use:

trainSet <- read.csv(trainFileName, header=TRUE, nrows=100)
classes = sapply(train,class); 

I can see that all the columns are being loaded as "integer" which I think is 32 bits.

Clearly using 3Gb to load a part of a 700Mb .csv file is far from efficient. I wonder if there's a way to tell R to use 8 bit numbers for the columns? This is what I've done in the past in Matlab and it's worked a treat, however, I can't seem to find anywhere a mention of an 8 bit type in R.

Does it exist? And how would I tell it read.csv to use it?

Thanks in advance for any help.

3
Possible solutions: A) get more RAM b) Put the data into a database and read chunks as needed using RMySQL. c) Quickly spin up an EC2 instance with large memory and run it there.Maiasaura
@Maiasaura: or just heed the warnings in the docs. ;-)Joshua Ulrich

3 Answers

2
votes

The narrow answer is that the add-on package ff allows you to use a more compact representation.

The downside is that the different representation prevents you from passing the data to standard functions.

So you may need to rethink your approach: maybe sub-sampling the data, or getting more RAM.

2
votes

Q: Can you tell R to use 8 bit numbers

A: No. (Edit: See Dirk's comment's below. He's smarter than I am.)

Q: Will more RAM help?

A: Maybe. Assuming a 64 bit OS and a 64 bit instance of R are the starting point, then "Yes", otherwise "No".

Implicit question A: Will a .csv dataset that is 700 MB be 700 MB when read in by read.csv?

A: Maybe. If it really is all integers, it may be smaller or larger. It's going to take 4 bytes for each integer and if most of your integers were in the range of -9 to 10, they might actually "expand" in size when stored as 4 bytes each. At the moment you are only using 1-3 bytes per value so you would expect about a 50% increase in size You would want to use colClasses="integer"in the read-function. Otherwise they may get stored as factor or as 8 byte "numeric" if there are any data-input glitches.

Implicit question B: If You get the data into the workspace will you be able to work with it?

A: Only maybe. You need at a minimum three times as much memory as your largest objects because of the way R copies on assignment even if it is a copy to its own name.

2
votes

Not trying to be snarky, but the way to fix this is documented in ?read.csv:

 These functions can use a surprising amount of memory when reading
 large files.  There is extensive discussion in the ‘R Data
 Import/Export’ manual, supplementing the notes here.

 Less memory will be used if ‘colClasses’ is specified as one of
 the six atomic vector classes.  This can be particularly so when
 reading a column that takes many distinct numeric values, as
 storing each distinct value as a character string can take up to
 14 times as much memory as storing it as an integer.

 Using ‘nrows’, even as a mild over-estimate, will help memory
 usage.

This example takes awhile to run because of I/O, even with my SSD, but there are no memory issues:

R> # In one R session
R> x <- matrix(sample(256,2e8,TRUE),ncol=2)
R> write.csv(x,"700mb.csv",row.names=FALSE)

R> # In a new R session
R> x <- read.csv("700mb.csv", colClasses=c("integer","integer"),
+ header=TRUE, nrows=1e8)
R> gc()
            used  (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    173632   9.3     350000   18.7    350000   18.7
Vcells 100276451 765.1  221142070 1687.2 200277306 1528.0
R> # Max memory used ~1.5Gb
R> print(object.size(x), units="Mb")
762.9 Mb