0
votes

I have a spark dataframe with few columns as null. I need to create a new dataframe , adding a new column "error_desc" which will mention all the columns with null values for every row. I need to do this dynamically without mentioning each column name.

eg: if my dataframe is below

+-----+------+------+
|Rowid|Record|Value |
+-----+------+------+
|    1|     a|     b|
|    2|  null|     d|
|    3|     m|  null|
+-----+------+------+

my final dataframe should be

+-----+------+-----+--------------+
|Rowid|Record|Value|    error_desc|
+-----+------+-----+--------------+
|    1|     a|    b|          null|
|    2|  null|    d|record is null|
|    3|     m| null| value is null|
+-----+------+-----+--------------+
1
can you provide sample input dataframe and also format as a code it so that we can understand your issue. - Nikk
Rowid Record Value 1 a b 2 null d 3 m null - Spark-dev
I have edit your question format for better understanding, Please commit it - Nikk
@Nikk done.. can you help? - Spark-dev
these values are from files which I am reading dynamically.. so I don't want to write column names in the code.. - Spark-dev

1 Answers

0
votes

I have added few more rows in Input DataFrame to cover more cases. You do not required to hard code any column. Use below UDF, it will give your desire output.

scala> import org.apache.spark.sql.Row

scala> import org.apache.spark.sql.expressions.UserDefinedFunction

scala> df.show()
+-----+------+-----+
|Rowid|Record|Value|
+-----+------+-----+
|    1|     a|    b|
|    2|  null|    d|
|    3|     m| null|
|    4|  null|    d|
|    5|  null| null|
| null|     e| null|
|    7|     e|    r|
+-----+------+-----+


scala> def CheckNull:UserDefinedFunction = udf((Column:String,r:Row) => {
 |     var check:String = ""
 |     val ColList = Column.split(",").toList
 |     ColList.foreach{ x =>
 |       if (r.getAs(x) == null)
 |       {
 |         check = check + x.toString + " is null. "
 |       }}
 |   check
 |   })


scala> df.withColumn("error_desc",CheckNull(lit(df.columns.mkString(",")),struct(df.columns map col: _*))).show(false)
+-----+------+-----+-------------------------------+
|Rowid|Record|Value|error_desc                     |
+-----+------+-----+-------------------------------+
|1    |a     |b    |                               |
|2    |null  |d    |Record is null.                |
|3    |m     |null |Value is null.                 |
|4    |null  |d    |Record is null.                |
|5    |null  |null |Record is null. Value is null. |
|null |e     |null |Rowid is null. Value is null.  |
|7    |e     |r    |                               |
+-----+------+-----+-------------------------------+