Pyspark - how to pass a column to a function after casting?

Question

First I called sha2 function from pyspark.sql.functions incorrectly, passing it a column of DoubleType and got the following error:

cannot resolve 'sha2(`metric`, 256)' due to data type mismatch: argument 1 requires binary type, however, '`metric`' is of double type

Then I tried to first cast the columns to a StringType but still getting the same error. I probably miss something on how column transformations are processed by Spark.

I've noticed that when I just call a df.withColumn(col_name, F.lit(df[col_name].cast(StringType()))) without calling .withColumn(col_name, F.sha2(df[col_name], 256))the columns type is changed to StringType.

How should I apply a transformation correctly in this case?

def parse_to_sha2(df: DataFrame, cols: list):
    for col_name in cols:
        df = df.withColumn(col_name, F.lit(df[col_name].cast(StringType()))) \
               .withColumn(col_name, F.sha2(df[col_name], 256))
    return df

Shubham Jain Shubham Jain · Accepted Answer · 2020-07-07T09:33:52

1

votes

You don't need lit here

Try

.withColumn(col_name, F.sha2(df[col_name].cast('string'), 256))

Pyspark - how to pass a column to a function after casting?

2 Answers