I have a PySpark Dataframe with two columns (A
, B
, whose type is double
) whose values are either 0.0
or 1.0
.
I am trying to add a new column, which is the sum of those two.
I followed examples in Pyspark: Pass multiple columns in UDF
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType, StringType
sum_cols = F.udf(lambda x: x[0]+x[1], IntegerType())
df_with_sum = df.withColumn('SUM_COL',sum_cols(F.array('A','B')))
df_with_sum.select(['SUM_COL']).toPandas()
This shows a Series of NULL
s instead of the results I expect.
I tried any of the following to see if there's an issue with data types
sum_cols = F.udf(lambda x: x[0], IntegerType())
sum_cols = F.udf(lambda x: int(x[0]), IntegerType())
still getting Nulls.
I tried removing the array:
sum_cols = F.udf(lambda x: x, IntegerType())
df_with_sum = df.withColumn('SUM_COL',sum_cols(df.A))
This works fine and shows 0/1
I tried removing the UDF, but leaving the array:
df_with_sum = df.withColumn('SUM_COL', F.array('A','B'))
This works fine and shows a series of arrays of [0.0/1.0, 0.0/1.0]
So, array works fine, UDF works fine, it is just when I try to pass an array to UDF that things break down. What am I doing wrong?
0.0/1.0
it should be 0.0 if the datatype is double. if the value is0.0/1.0
then the datatype should be StringType. isn't it so? – Ramesh Maharjan