1
votes

Have a dataset imputedcsv where I want to randomly replace null values in Gender column with say Male or Female..

imputedcsv.groupBy("Gender").count.show()

+------+-----+
|Gender|count|
+------+-----+
|  null|   24|
|Female|  240|
|  Male|  242|
+------+-----+

One fill the null values by a single value, but how to fill the null value of the column randomly from a set of values say {Male,Female}

imputedcsv.na.fill("Male", Seq("Gender")).groupBy("Gender").count.show()

+------+-----+
|Gender|count|
+------+-----+
|Female|  240|
|  Male|  266|
+------+-----+

Instead of replacing the null values by just one value Male, I need to randomly fill it with either Male or Female.

Something like using sample(c('Male','Female'))

For single value we have How to replace null values with a specific value in Dataframe using spark in Java?

Any help is appreciated.

3

3 Answers

1
votes

If you consider equal probabilities of Gender being either Female or Male, you can do something like this :

df.withColumn( "gender", 
    coalesce($"gender", 
                 when(round(rand).cast("int") === lit(0) , lit("Male") )
                 .otherwise(lit("Female"))
             )).show

coalesce makes it work only for null values. round(rand).cast("int") will generate either 0 or 1 each time and then Male or Female will be decide by the when - otherwise construct.

0
votes

you can achieve it with the use of when & otherwise along with withColumn, like this:

scala> df.groupBy("Gender").count.show

+------+-----+
|Gender|count|
+------+-----+
|  null|    2|
|female|    4|
|  male|    4|
+------+-----+

scala> df.withColumn("gender", when(($"gender".isNull), "male").otherwise($"gender")).groupBy("gender").count.show
+------+-----+
|gender|count|
+------+-----+
|female|    4|
|  male|    6|
+------+-----+

I missed randomly, you can achive it like below:

scala> val gender_set = Set("male","female")
gender_set: scala.collection.immutable.Set[String] = Set(male, female)

scala> import scala.util.Random
import scala.util.Random

scala>  val rnd=new Random
rnd: scala.util.Random = scala.util.Random@668b5a55

scala> df.withColumn("gender", when(($"gender".isNull), gender_set.toVector(rnd.nextInt(gender_set.size))).otherwise($"gender")).groupBy("gender").count.show
+------+-----+
|gender|count|
+------+-----+
|female|    4|
|  male|    6|
+------+-----+


scala> df.withColumn("gender", when(($"gender".isNull), gender_set.toVector(rnd.nextInt(gender_set.size))).otherwise($"gender")).groupBy("gender").count.show
+------+-----+
|gender|count|
+------+-----+
|female|    6|
|  male|    4|
+------+-----+

Thanks.

0
votes

I needed to put @Learner's code in a UDF to work otherwise it would error.

df.groupBy($"Gender").count.show()
+------+-----+
|Gender|count|
+------+-----+
|  null|    3|
|Female|    3|
|  Male|    2|
+------+-----+
val gender_set = Set("Male","Female")

val randGenderUDF = udf(() => 
   gender_set.toVector(rnd.nextInt(gender_set.size))
)

df.withColumn("Gender", when($"Gender".isNull, randGenderUDF()).otherwise($"Gender")).groupBy($"Gender").count.show()
+------+-----+
|Gender|count|
+------+-----+
|Female|    5|
|  Male|    3|
+------+-----+