0
votes

I'm trying to get the max value from a list of columns and the name of the column with hte max value as described in these post PySpark: compute row maximum of the subset of columns and add to an exisiting dataframe
how to get the name of column with maximum value in pyspark dataframe I've reviewed a number of posts and tried a number of options but have not been successful yet.

column object not callable TypeError: 'Column' object is not callable using WithColumn and pass multiple columns Pyspark: Pass multiple columns in UDF

columns in tables that are loaded to dataframe Rule_Total_Score:double, Rule_No_Identifier_Score:double

rules = ['Rule_Total_Score', 'Rule_No_Identifier_Score']
df = spark.sql('select * from  table')

@f.udf(DoubleType())
def get_max_row_with_None(*cols):
    return float(max(x for x in cols if x is not None))

sdf = df.withColumn("max_rule", get_max_row_with_None(f.struct([df[col] for col in df.columns if col in rules])))
1

1 Answers

1
votes

The UDF accepts a list of columns rather than a struct column, so if you pass in the columns and remove f.struct, it should hopefully work:

@f.udf(DoubleType())
def get_max_row_with_None(*cols):
    if all(x is None for x in cols):
        return None
    else:
        return float(max(x for x in cols if x is not None))

sdf = df.withColumn(
    "max_rule", 
    get_max_row_with_None(*[df[col] for col in df.columns if col in rules])
)