spark read doesn't work inside Scala UDF function

Question

I am trying to use spark.read to get file count inside my UDF, but when i execute the program hangs at that point.

i am calling an UDF in withcolumn of dataframe. the udf has to read a file and return a count of it. But it is not working. i am passing a variable value to UDF function. when i remove the spark.read code and simply return a number it works. but spark.read is not working through UDF

def prepareRowCountfromParquet(jobmaster_pa: String)(implicit spark: SparkSession): Int = {
      print("The variable value is " + jobmaster_pa)
      print("the count is " + spark.read.format("csv").option("header", "true").load(jobmaster_pa).count().toInt)
      spark.read.format("csv").option("header", "true").load(jobmaster_pa).count().toInt
    }
val SRCROWCNT = udf(prepareRowCountfromParquet _)

  df
  .withColumn("SRC_COUNT", SRCROWCNT(lit(keyPrefix)))

SRC_COUNT column should get lines of the file

You can not create nor use a DataFrame inside an UDF, additionally the spark object only exists in the driver, on executors it will be null. For example, take a look to this: stackoverflow.com/questions/48893002/… — Luis Miguel Mejía Suárez

Arnon Rotem-Gal-Oz Arnon Rotem-Gal-Oz · Accepted Answer · 2019-04-14T19:41:12

UDFs cannot use spark context as it exists only in the driver and it isn't serializable.

generally speaking, you need to read all the csvs, calc the counts using a groupBy and then you can do a left join to the df

spark read doesn't work inside Scala UDF function

1 Answers