0
votes

I have a Spark dataframe and I want to select few rows/records from them based on a matching value for a particular column. I guess I can do this using Filter operation or select operation in a map transformation.

But , i want to update a status column against those rows/records which has not been selected on applying filter.

On applying filter operation , I am getting back in response a new dataframe consisting of matching records.

So, How to know & update the column value of rows which are not selected?

1

1 Answers

1
votes

On applying filter operaiton, you get the new Dataframe cosisting of matching records.

Then, you can use except function in Scala to get the Un-matching records from the input dataframe.

scala> val inputDF = Seq(("a", 1),("b", 2), ("c", 3), ("d", 4), ("e", 5)).toDF("id", "count")
inputDF: org.apache.spark.sql.DataFrame = [id: string, count: int]
scala> val filterDF = inputDF.filter($"count" > 3)
filterDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: string, count: int]

scala> filterDF.show()
+---+-----+
| id|count|
+---+-----+
|  d|    4|
|  e|    5|
+---+-----+

scala> val unmatchDF = inputDF.except(filterDF)
unmatchDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: string, count: int]

scala> unmatchDF.show()
+---+-----+
| id|count|
+---+-----+
|  b|    2|
|  a|    1|
|  c|    3|
+---+-----+

In PySpark you can achieve the same with subtract function.