18
votes

I have the following dataset and its contain some null values, need to replace the null value using fillna in spark.

DataFrame:

df = spark.read.format("com.databricks.spark.csv").option("header‌​","true").load("/sam‌​ple.csv")

>>> df.printSchema();
root
 |-- Age: string (nullable = true)
 |-- Height: string (nullable = true)
 |-- Name: string (nullable = true)

>>> df.show()
+---+------+-----+
|Age|Height| Name|
+---+------+-----+
| 10|    80|Alice|
|  5|  null|  Bob|
| 50|  null|  Tom|
| 50|  null| null|
+---+------+-----+

>>> df.na.fill(10).show()

when i'll give the na values it dosen't changed the same dataframe appeared again.

+---+------+-----+
|Age|Height| Name|
+---+------+-----+
| 10|    80|Alice|
|  5|  null|  Bob|
| 50|  null|  Tom|
| 50|  null| null|
+---+------+-----+

tried create a new dataframe and store the fill values in dataframe but the result showing like unchanged.

>>> df2 = df.na.fill(10)

how to replace the null values? please give me the possible ways by using fill na. Thanks in Advance.

2
Is there any rules for replacement ? e.g Is replacing nulls in the Height column different than the Name column ? - eliasah
In my case the null value not replaced, if the rule applied or else not specified the rule. the basic fill operation not working properly. checked with the different datasets. - Churchill vins

2 Answers

29
votes

It seems that your Height column is not numeric. When you call df.na.fill(10) spark replaces only nulls with column that match type of 10, which are numeric columns.

If Height column need to be string, you can try df.na.fill('10').show(), otherwise casting to IntegerType() is neccessary.

10
votes

You can also provide a specific default value for each column if you prefer.

df.na.fill({'Height': '10', 'Name': 'Bob'})