0
votes

I'm using sparklyr and H2O in R to implement develop some ML models. Getting an error on the initial data read. I pull data in using spark_read_csv, set up partitions using sdf_partition then define an H2O data frame using as_h2o_frame

df <- spark_read_csv(sc,
                     "frame_name",                                       
                     "aPathToData.csv")

partitions <- df %>% sdf_partition(training = 0.6, 
                               test_validate = 0.4,
                               seed=12)

train_set    <- as_h2o_frame(sc, 
                             partitions$training,
                             name="train_set")

This returns the error:

Error: C stack usage 38903392 is too close to the limit

I've successfully run this exact code on a much smaller dataset: 145 mb vs my current csv which is 2.3 gb. Still, I have 32 gb of memory and it doesn't seem to be the size of the dataset, I threw away most of the rows and got it down to 32mb, still gives the error. Must be something unique to the dataset other than size.

UPDATE: the error is due to the number of columns in the dataset. When I run as_h2o_frame with a number of columns in the spark data frame over 1689, I get the error. 1689 or fewer columns, no error.

1
can you list the version number for all the different packages you are using? thanks!Lauren
Thanks for responding Lauren. H2o version 3.16.0.2; sparklyr version 0.7.0; rsparkling 0.2.3JPErwin

1 Answers

1
votes

Since the error message seems to be coming from R, the error is more likely a R or sparklyr issue and not a bug in H2O. However, if you can post this issue to the sparkling water repo with a reproducible code example and logs (if possible) the issue can be reviewed, and it will be easier to identify which package is causing the error and direct the bug to the correct project.