8
votes

I'm using fread in data.table (1.8.8, R 3.0.1) in a attempt to read very large files.

The file in questions has 313 rows and ~6.6 million cols of numeric data rows and the file is around around 12gb. This is a Centos 6.4 with 512GB of RAM.

When I attempt to read in the file:

g=fread('final.results',header=T,sep=' ')
'header' changed by user from 'auto' to TRUE
Error: protect(): protection stack overflow

I tried starting R with --max-ppsize 500000 , which is the max, but the same error.

I also tried setting the stack size to unlimited via

ulimit -s unlimited

Virtual memory was already set to unlimited.

Am I being unrealistic with a file of this size? Did I miss something fairly obvious?

1
Please try v1.8.9 on R-Forge (link on data.table homepage). There are 10 bug fixes to fread there, see NEWS. Large file support is one of them, but on Windows as already should be ok on Linux. 6.6 million columns (!) is new and could well be a new bug. Please confirm with v1.8.9 and we'll go from there... - Matt Dowle
@MatthewDowle Yes I'm not happy with 6 million rows either. Install 1.8.9, same error. I made a much smaller file, 10 rows x 50K cols, same error. 10 rows x 49,999 cols it works. - mpmorley
Did you mean columns in that comment (you wrote 6 million rows)? Very interesting and strange it fails at 50,000 columns exactly. Well done for honing in on that so quickly. I don't recall any column limit like that. Will take a look. - Matt Dowle
Sorry, Yes columns. A question regarding first comment, I have 512Gb of RAM, why wouldn't' the file fit? Thanks and great work with data.table. - mpmorley
Apols, I misread. For some reason I interpreted 512 as MB, since 512GB RAM is quite big then! Even though you wrote 512GB. So, yes should of course read in fine. Does read.table / read.csv work with 50k columns and 6e6 columns? - Matt Dowle

1 Answers

6
votes

Now fixed in v1.8.9 on R-Forge.

  • An unintended 50,000 column limit has been removed in fread. Thanks to mpmorley for reporting. Test added.

The reason was I got this part wrong in the fread.c source :

// *********************************************************************
// Allocate columns for known nrow
// *********************************************************************
ans=PROTECT(allocVector(VECSXP,ncol));
protecti++;
setAttrib(ans,R_NamesSymbol,names);
for (i=0; i<ncol; i++) {
    thistype  = TypeSxp[ type[i] ];
    thiscol = PROTECT(allocVector(thistype,nrow));   // ** HERE **
    protecti++;
    if (type[i]==SXP_INT64)
        setAttrib(thiscol, R_ClassSymbol, ScalarString(mkChar("integer64")));
    SET_TRUELENGTH(thiscol, nrow);
    SET_VECTOR_ELT(ans,i,thiscol);
}

According to R-exts section 5.9.1, that PROTECT inside the loop isn't needed :

In some cases it is necessary to keep better track of whether protection is really needed. Be particularly aware of situations where a large number of objects are generated. The pointer protection stack has a fixed size (default 10,000) and can become full. It is not a good idea then to just PROTECT everything in sight and UNPROTECT several thousand objects at the end. It will almost invariably be possible to either assign the objects as part of another object (which automatically protects them) or unprotect them immediately after use.

So that PROTECT is now removed and all is well. (It seems that the pointer protection stack limit has been reduced to 50,000 since that text was written; Defn.h contains #define R_PPSSIZE 50000L.) I've checked all other PROTECTs in data.table C source for anything similar and found and fixed one in assign.c too (when adding more than 50,000 columns by reference), no others.

Thanks for reporting!