Fill null values with empty string in Dataset<Row> using Apache-Spark in java

Question

Please do not mark this question as duplicate. I have checked the below question and it gives solution for python or scala. And for java method is different. How to replace null values with a specific value in Dataframe using spark in Java?

I have a Dataset Dataset<Row> ds which I created from reading a parquet file. So, all column values are string. Some of the values are null. I am using .na().fill("") for replacing null values with empty string

Dataset<Row>  ds1 = ds.na().fill("");

But it is not removing null values. I am unable to understand what can be the reason.

|-- stopPrice: double (nullable = true) |-- tradingCurrency: string (nullable = true)

Could you provide a sample of your CSV (the rows for which "it is not working"), the output you get and the one you expect? It would really help us understand what's wrong. — Oli
Just to let you know, I tried your code and it works. Most likely, your values are not really null but I need more info to be certain. — Oli
Just updated my question. I am reading two files one is parquet and one is csv. both contain similar data. I am storing both to datasets ( Dataset<Row>). then i am using above code to replace null values with empty string. For dataset created from csv it is working and for dataset created from parquet it is not working — user812142
I tried Dataset<Row> ds1 = ds.na().fill(0); and it is working. Does this fill method depends on datatype of column? If yes then i need to convert this column to String first. That will be messy. is there any clean method? — user812142
I had started to write an answer with an example. I posted it anyway, it could help others. Regarding your next question, you can apply a schema when you read a dataframe from a CSV file with spark.read.schema(...).csv("xxx.csv"). If the dataframe is already created however, you need to cast the corresponding columns. In your case, you could probably read the parquet file, extract the schema with df.schema() and use it when parsing the CSV. — Oli

Oli Oli · Accepted Answer · 2019-05-06T14:48:53

From what I see, your column has a numeric type. Also you cannot replace a null value by an illegal value in Spark. Therefore in your case you cannot use a string ("" in your case). Here is an example that illustrate this:

Dataset<Row> df = spark.range(10)
    .select(col("id"),
             when(col("id").mod(2).equalTo(lit(0)), null )
                 .otherwise(col("id").cast("string")).as("string_col"),
             when(col("id").mod(2).equalTo(lit(0)), null )
                 .otherwise(col("id")).as("int_col"));

df.na().fill("").show();

And here is the result

+---+----------+-------+
| id|string_col|int_col|
+---+----------+-------+
|  0|          |   null|
|  1|         1|      1|
|  2|          |   null|
|  3|         3|      3|
|  4|          |   null|
|  5|         5|      5|
|  6|          |   null|
|  7|         7|      7|
|  8|          |   null|
|  9|         9|      9|
+---+----------+-------+

It works for the string, but not for the integer. Note that I used the cast function to turn an int into a string and make the code work. It could be a nice workaround in your situation.

Fill null values with empty string in Dataset<Row> using Apache-Spark in java

1 Answers