I have the below pyspark df which can be recreated by the code
df = spark.createDataFrame([(1, "John Doe", "2020-11-30"),(2, "John Doe", "2020-11-27"),(3, "John Doe", "2020-11-29")],
("id", "name", "date"))
+---+--------+----------+
| id| name| date|
+---+--------+----------+
| 1|John Doe|2020-11-30|
| 2|John Doe|2020-11-27|
| 3|John Doe|2020-11-29|
+---+--------+----------+
I am looking to create a udf to calculate difference between 2 rows of dates (using Lag function) excluding weekends as pyspark 2.2.0 does not has an in-built function to do so. eg. the difference between 2020-11-30 & 2020-11-27 should give 1 as they are Monday & Friday respectively.
I have tried to create the below with help from calculate difference between two dates excluding weekends in python :
from pyspark.sql.functions import udf
import numpy as np
workdaUDF = udf(lambda z: workdays(z),IntegerType())
def workdays():
date1 = df.select(F.col('date')).collect()[1][0]
date2 = df.select(F.col('date')).collect()[0][0]
date_diff = np.busday_count(date1,date2)
return date_diff
df.withColumn("date_dif",workdaysUDF(F.col("date"))).show(truncate=False)
But I get the below error
PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Any help on how I can make this work on each row of my dataframe would be really helpful.
PS : My date1 and date2 variables need to be dynamic depending on the value of the date on which the function is being applied to. Also, due to the dataframe size, I cannot use pandas for which I found multiple solutions.
Thank you in advance.