replacing spark dataset column values randomly from a set

Question

Have a dataset imputedcsv where I want to randomly replace null values in Gender column with say Male or Female..

imputedcsv.groupBy("Gender").count.show()

+------+-----+
|Gender|count|
+------+-----+
|  null|   24|
|Female|  240|
|  Male|  242|
+------+-----+

One fill the null values by a single value, but how to fill the null value of the column randomly from a set of values say {Male,Female}

imputedcsv.na.fill("Male", Seq("Gender")).groupBy("Gender").count.show()

+------+-----+
|Gender|count|
+------+-----+
|Female|  240|
|  Male|  266|
+------+-----+

Instead of replacing the null values by just one value Male, I need to randomly fill it with either Male or Female.

Something like using sample(c('Male','Female'))

For single value we have How to replace null values with a specific value in Dataframe using spark in Java?

Any help is appreciated.

philantrovert philantrovert · Accepted Answer · 2017-11-21T07:16:55

If you consider equal probabilities of Gender being either Female or Male, you can do something like this :

df.withColumn( "gender", 
    coalesce($"gender", 
                 when(round(rand).cast("int") === lit(0) , lit("Male") )
                 .otherwise(lit("Female"))
             )).show

coalesce makes it work only for null values. round(rand).cast("int") will generate either 0 or 1 each time and then Male or Female will be decide by the when - otherwise construct.

replacing spark dataset column values randomly from a set

3 Answers