0
votes

I have a UDF as below -

  val myUdf = udf((col_abc: String, col_xyz: String) => {
    array(
      struct(
        lit("x").alias("col1"),
        col(col_abc).alias("col2"),
        col(col_xyz).alias("col3")
      )
    )
  }

Now, I want to use this use this in a function as below -

def myfunc(): Column = {
    val myvariable = myUdf($"col_abc", $"col_xyz")
    myvariable
}

And then use this function to create a new column in my DataFrame

val df = df..withColumn("new_col", myfunc())

In summary, I want my column "new_col" to be an type array with values as [[x, x, x]]

I am getting the below error. What am I doing wrong here?

Caused by: java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported

1

1 Answers

0
votes

Two ways.

  1. Don't use a UDF because you're using pure Spark functions:
val myUdf = ((col_abc: String, col_xyz: String) => {
    array(
      struct(
        lit("x").alias("col1"),
        col(col_abc).alias("col2"),
        col(col_xyz).alias("col3")
      )
    )
  }
)

def myfunc(): Column = {
    val myvariable = myUdf("col_abc", "col_xyz")
    myvariable
}

df.withColumn("new_col", myfunc()).show
+-------+-------+---------------+
|col_abc|col_xyz|        new_col|
+-------+-------+---------------+
|    abc|    xyz|[[x, abc, xyz]]|
+-------+-------+---------------+
  1. Use a UDF which takes in strings and returns a Seq of case class:
case class cols (col1: String, col2: String, col3: String)

val myUdf = udf((col_abc: String, col_xyz: String) => Seq(cols("x", col_abc, col_xyz)))

def myfunc(): Column = {
    val myvariable = myUdf($"col_abc", $"col_xyz")
    myvariable
}

df.withColumn("new_col", myfunc()).show
+-------+-------+---------------+
|col_abc|col_xyz|        new_col|
+-------+-------+---------------+
|    abc|    xyz|[[x, abc, xyz]]|
+-------+-------+---------------+

If you want to pass in Columns to the function, here's an example:

val myUdf = ((col_abc: Column, col_xyz: Column) => {
    array(
      struct(
        lit("x").alias("col1"),
        col_abc.alias("col2"),
        col_xyz.alias("col3")
      )
    )
  }
)