2
votes

I am trying to use spark.read to get file count inside my UDF, but when i execute the program hangs at that point.

i am calling an UDF in withcolumn of dataframe. the udf has to read a file and return a count of it. But it is not working. i am passing a variable value to UDF function. when i remove the spark.read code and simply return a number it works. but spark.read is not working through UDF

def prepareRowCountfromParquet(jobmaster_pa: String)(implicit spark: SparkSession): Int = {
      print("The variable value is " + jobmaster_pa)
      print("the count is " + spark.read.format("csv").option("header", "true").load(jobmaster_pa).count().toInt)
      spark.read.format("csv").option("header", "true").load(jobmaster_pa).count().toInt
    }
val SRCROWCNT = udf(prepareRowCountfromParquet _)

  df
  .withColumn("SRC_COUNT", SRCROWCNT(lit(keyPrefix))) 

SRC_COUNT column should get lines of the file

1
You can not create nor use a DataFrame inside an UDF, additionally the spark object only exists in the driver, on executors it will be null. For example, take a look to this: stackoverflow.com/questions/48893002/… - Luis Miguel Mejía Suárez

1 Answers

4
votes

UDFs cannot use spark context as it exists only in the driver and it isn't serializable.

generally speaking, you need to read all the csvs, calc the counts using a groupBy and then you can do a left join to the df