I'm using sparklyr and H2O in R to implement develop some ML models. Getting an error on the initial data read. I pull data in using spark_read_csv
, set up partitions using sdf_partition
then define an H2O data frame using as_h2o_frame
df <- spark_read_csv(sc,
"frame_name",
"aPathToData.csv")
partitions <- df %>% sdf_partition(training = 0.6,
test_validate = 0.4,
seed=12)
train_set <- as_h2o_frame(sc,
partitions$training,
name="train_set")
This returns the error:
Error: C stack usage 38903392 is too close to the limit
I've successfully run this exact code on a much smaller dataset: 145 mb vs my current csv which is 2.3 gb. Still, I have 32 gb of memory and it doesn't seem to be the size of the dataset, I threw away most of the rows and got it down to 32mb, still gives the error. Must be something unique to the dataset other than size.
UPDATE: the error is due to the number of columns in the dataset. When I run as_h2o_frame
with a number of columns in the spark data frame over 1689, I get the error. 1689 or fewer columns, no error.