4
votes

I am on a 512gb ram server. I have a 84gig CSV (hefty, I know). I am reading only 31 columns of 79, where the excluded are all floats/decimals.

After comparing many methods, it seems the highest performance way to do what I want would be to fread the file. The file size is 84gb, but watching "top" the process uses 160 gigs of memory (RES), even though the size of the eventual data.table is about 20gigs.

I know fread preallocates memory which is why it's so fast. Just wondering - is this normal and is there a way to curb the memory consumption?


Edit: it seems like, if I just ask fread to read 10000 rows (of 300MM), fread will still preallocate 84 gigs of memory.

2
Maybe fread pieces of the file at a time and combine the result in R, but if your server as more than enough RAM I don't see what the issue is. Specifying the colClasses might help if you aren't doing so already. - nrussell
Thanks, I'll try colClasses. The issue is just that I don't want to consume the shared resources of the server, to the extent possible. Also, the files are not guaranteed to be this pleasant. It is market data, and on certain days I imagine the data size may explode. - grad student

2 Answers

3
votes

See R FAQ 7.42. If you want to minimize the resources you use on the server, read the csv using fread once, then save the resulting object using save or saveRDS. Then read that binary file when you need the data.

Or you can use a command line tool like cut, awk, sed, etc to only select the columns you want and write the output to another file. Then you can use fread on that smaller file.