0
votes

I am adding a column in spark dataframe using some business logic which returns true/false in scala. The implementation is done using a UDF and the UDF has more than 10 arguments so we need to register the UDF first before using it. Following has been done

spark.udf.register("new_col", new_col)

// writing the UDF
val new_col(String, String, ..., Timestamp) => Boolean = (col1: String, col2: String, ..., col12: Timestamp) => {
     if ( ... ) true
     else false
}

Now when I am trying to write the following spark/Scala job it is not working

val result = df.withColumn("new_col", new_col(col1, col2, ..., col12))

I get the following error

<console>:56: error: overloaded method value udf with alternatives:
  (f: AnyRef,dataType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
  (f: org.apache.spark.sql.api.java.UDF10[_, _, _, _, _, _, _, _, _, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
  (f: org.apache.spark.sql.api.java.UDF9[_, _, _, _, _, _, _, _, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
  (f: org.apache.spark.sql.api.java.UDF8[_, _, _, _, _, _, _, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
  (f: org.apache.spark.sql.api.java.UDF7[_, _, _, _, _, _, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
  (f: org.apache.spark.sql.api.java.UDF6[_, _, _, _, _, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
  (f: org.apache.spark.sql.api.java.UDF5[_, _, _, _, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
  (f: org.apache.spark.sql.api.java.UDF4[_, _, _, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
  (f: org.apache.spark.sql.api.java.UDF3[_, _, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
  (f: org.apache.spark.sql.api.java.UDF2[_, _, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
  (f: org.apache.spark.sql.api.java.UDF1[_, _],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and>
  (f: org.apache.spark.sql.api.java.UDF0[_],returnType: org.apache.spark.sql.types.DataType)org.apache.spark.sql.expressions.UserDefinedFunction <and> ...

On the other hand if I create a temporary view and use spark.sql it works perfectly fine like the following

df.createOrReplaceTempView("data")
val result = spark.sql(
    s"""
    SELECT *, new_col(col1, col2, ..., col12) AS new_col FROM data
    """
    )

Am I missing something? What would be the way to make such query work in spark/scala?

1

1 Answers

1
votes

There are different ways of registering UDFs to be used in DataFrames and SparkSQL

To use in Spark Sql, udf should be registered as

spark.sqlContext.udf.register("function_name", function)

To use in DataFrames

val my_udf = org.apache.spark.sql.functions.udf(function)

As you are using spark.sqlContext.udf.register, it is available in Spark SQL.

EDIT: The following code should work, i have used only 2 col bit it should work up to 22 cols

val new_col :(String, String) => Boolean = (col1: String, col2: String) => {
  true
}

val new_col_udf = udf(new_col)
spark.sqlContext.udf.register("new_col", new_col)

var df = Seq((1,2,3,4,5,6,7,8,9,10,11)).toDF()
df.createOrReplaceTempView("data")
val result = spark.sql(
  s"""SELECT *, new_col(_1, _2) AS new_col FROM data"""
)
result.show()
df = df.withColumn("test", new_col_udf($"_1",$"_2") )
df.show()