How to replace null values with a specific value in Dataframe using spark in Java?

Question

I am trying improve the accuracy of Logistic regression algorithm implemented in Spark using Java. For this I'm trying to replace Null or invalid values present in a column with the most frequent value of that column. For Example:-

Name|Place
a   |a1
a   |a2
a   |a2
    |d1
b   |a2
c   |a2
c   |
    |
d   |c1

In this case I'll replace all the NULL values in column "Name" with 'a' and in column "Place" with 'a2'. Till now I am able to extract only the most frequent columns in a particular column. Can you please help me with the second step on how to replace the null or invalid values with the most frequent values of that column.

Rami Rami · Accepted Answer · 2017-06-21T09:21:03

You can use .na.fill function (it is a function in org.apache.spark.sql.DataFrameNaFunctions).

Basically the function you need is: def fill(value: String, cols: Seq[String]): DataFrame

You can choose the columns, and you choose the value you want to replace the null or NaN.

In your case it will be something like:

val df2 = df.na.fill("a", Seq("Name"))
            .na.fill("a2", Seq("Place"))

How to replace null values with a specific value in Dataframe using spark in Java?

4 Answers