0
votes

I have a pyspark dataframe with few columns

col1    col2    col3
---------------------
1.      2.1.    3.2
3.2.    4.2.    5.1

and I would like to apply three functions f1(x), f2(x), f3(x) each one to the correspondent column of the dataframe, so that I get

col1        col2        col3
-------------------------------
f1(1.)      f2(2.1.)    f3(3.2)
f1(3.2.)    f2(4.2.)    f3(5.1)

I am trying to avoid defining a udf for each column, so my idea would be to build an rdd from each column applying a function (maybe zip with an index, which I could define in the original dataset too), then to join back to the original dataframe.

Is it a viable solution, or is there a way to do it better?

UPDATE: following @Andre' Perez suggestion I could define a udf per each column and use spark sql to apply it or alternatively

import numpy as np
import pyspark.sql.functions as F
f1_udf = F.udf(lambda x: float(np.sin(x)), FloatType())
f2_udf = F.udf(lambda x: float(np.cos(x)), FloatType())
f3_udf = F.udf(lambda x: float(np.tan(x)), FloatType())


df = df.withColumn("col1", f1_udf("col1"))
df = df.withColumn("col2", f2_udf("col2"))
df = df.withColumn("col3", f3_udf("col3"))
1
There is no need to use rdd and joins. You can pass a whole row to an udf and return it: Filter Pyspark Dataframe with udf on entire rowcronoik
are these functions user defined or standard pyspark functions?Raghu
These are user defined functions, not standard builtin functionsGaluoises
Thanks @cronoik for the suggestion, but I think I would need to pass a whole row to multiple udfsGaluoises
You don't want to use an udf at all or just one udf instead of 3? Currently it is not clear for me what you want to achieve. In case you don't want to use an udf, you should explain what your functions are currently doing (better post them directly). In case you just want to use one udf instead of 3, just can follow the instructions from the link I have shared. At the moment your question can not be answered.cronoik

1 Answers

1
votes

Maybe it is better to register those functions as a UDF (even though you said you don't want to follow this approach).

spark.udf.register("func1", f1)
spark.udf.register("func2", f2)
spark.udf.register("func3", f3)

I would then register the DataFrame as a temporary view and run a Spark SQL query on it with the registered functions.

df.createOrReplaceTempView("dataframe")
df2 = spark.sql("select func1(col1), func2(col2), func3(col3) from dataframe")