2
votes

First I called sha2 function from pyspark.sql.functions incorrectly, passing it a column of DoubleType and got the following error:

cannot resolve 'sha2(`metric`, 256)' due to data type mismatch: argument 1 requires binary type, however, '`metric`' is of double type

Then I tried to first cast the columns to a StringType but still getting the same error. I probably miss something on how column transformations are processed by Spark.

I've noticed that when I just call a df.withColumn(col_name, F.lit(df[col_name].cast(StringType()))) without calling .withColumn(col_name, F.sha2(df[col_name], 256))the columns type is changed to StringType.

How should I apply a transformation correctly in this case?

def parse_to_sha2(df: DataFrame, cols: list):
    for col_name in cols:
        df = df.withColumn(col_name, F.lit(df[col_name].cast(StringType()))) \
               .withColumn(col_name, F.sha2(df[col_name], 256))
    return df
2

2 Answers

1
votes

You don't need lit here

Try

.withColumn(col_name, F.sha2(df[col_name].cast('string'), 256))
0
votes

I believe the issue here is the call to F.lit which creates a literal.

def parse_to_sha2(df: DataFrame, cols: list):
for col_name in cols:
    df = df.withColumn(
            col_name, 
            F.col(col_name).cast(StringType()).alias(f"{col_name}_casted")
         ).withColumn(
            col_name, 
            F.sha2(F.col(f"{col_name}_casted"), 256)
         )
return df

This should generate you a sha value per column.

In case you need all of them you would need to pass all columns to sha since it takes col* of arguments.

Edit: The last bit of comment is not correct, only F.hash takes multiple columns as arguments, md5, crc, sha take only 1 so sorry for that confusion.