H2O-R: Apply custom library function on each row of H2OFrame

Question

After importing a relatively big table from MySQL into H2O on my machine, I tried to run a hashing algorithm (murmurhash from the R digest package) on one of its columns and save it back to H2O. As I found out, using as.data.frame on a H2OFrame object is not always advised: originally my H2OFrame is ~43k rows large, but the coerced DataFrame contains usually only ~30k rows for some reason (the same goes for using base::apply/base::sapply/etc on the H2OFrame).

I found out there is an apply function used for H2OFrames as well, but as I see, it can only be used with built-in R functions.

So, for example my code would look like this:

data[, "subject"] <- h2o::apply(data[, "subject"], 2, function(x) 
                                digest(x, algo = "murmur32"))

I get the following error:

Error in .process.stmnt(stmnt, formalz, envs) : 
  Don't know what to do with statement: digest

I understand the fact that only the predefined functions from the Java backend can be used to manipulate H2O data, but is there perhaps another way to use the digest package from the client side without converting the data to DataFrame? I was thinking that in the worst case, I will have to use the R-MySQL driver to load the data first, manipulate it as a DataFrame and then upload it to the H2O cloud. Thanks for help in advance.

Do you want to hash the entire column into a single value? That's currently what your code does (if it were a regular R data.frame) since you have set margin = 2 (columns) instead of margin = 1 (rows). Since you are trying to replace the data[,"subject"] column with the results, my guess is that you are actually trying to apply the hash function to each row. I have an answer for you, but I want to make sure I understand what you are trying to do first. — Erin LeDell
Yes, just to make sure, I want to apply the hash function to each row seperately. Thanks for clearing it up. — Hadron

Erin LeDell Erin LeDell · Accepted Answer · 2017-03-27T23:44:39

Due to the way H2O works, it cannot support arbitrary user-defined functions applied to H2OFrames the way that you can apply any function to a regular R data.frame. We already use the Murmur hash function in the H2O backend, so I have added a JIRA ticket to expose it to the H2O R and Python APIs. What I would recommend in the meantime is to copy just the single column of interest from the H2O cluster into R, apply the digest function and then update the H2OFrame with the result.

The following code will pull the "subject" column into R as a 1-column data.frame. You can then use the base R apply function to apply the murmur hash to every row, and lastly you can copy the resulting 1-column data.frame back into the "subject" column in your original H2OFrame, called data.

sub <- as.data.frame(data[, "subject"])
subhash <- apply(sub, 1, digest, algo = "murmur32")
data[, "subject"] <- as.h2o(subhash)

Since you only have 43k rows, I would expect that you'd still be able to do this with no issues on even a mediocre laptop since you are only copying a single column from the H2O cluster to R memory (rather than the entire data frame).

H2O-R: Apply custom library function on each row of H2OFrame

1 Answers