1
votes

I'm looking for an easy and efficient way to replace all of a certain value in a H2O Python data frame. In this case this value is NULL. My dataset contains a very substantial amount of NULLs.

My current way of doing it is extraordinarily slow when I have hundreds of columns in a very large dataset. I assume there can be substantial improvements by doing this in a better way...

I just can't figure out the syntax. Thanks, this will save me an enormous amount of time!

My current approach:

for each_col in table_names_list:
    h2o_df[h2o_df[each_col].isna(), each_col]=0
1

1 Answers

0
votes

In the special case of NAs, you can use the impute() method to replace all of them with a single value (or alternatively, you can impute the mean, median or mode of a column). Here is an example:

import h2o

h2o.init()

df = h2o.H2OFrame([[1,2,3],[4,5,6]])
df.insert_missing_values(fraction=0.5, seed=1)

So the frame will look like this:

  C1    C2    C3
----  ----  ----
 nan   nan     3
 nan     5   nan

Now we can impute by value, but we need to pass along a list of values which is the same length as the number of columns (and in your case, all zeros).

df.impute(column=-1, values=[0 for c in range(df.ncol)])

Now the frame looks like this:

  C1    C2    C3
----  ----  ----
   0     0     3
   0     5     0