use udf inside for loop to create multiple columns in Pyspark

Question

I have a spark dataframe with some columns (col1,col2,col3,col4,col5...till 32) now i have create a function (udf) which takes 2-input parameters and return some float values.

Now i want to create new columns(in increasing order like col33,col32,col33,col34..) using above function with one parameter increasing and other parameter is constant

def fun(col1,col2):
   if true:
      do someting
   else:
      do someting

I have converted this function to udf

udf_func = udf(fun,Floatype())

Now I want to use this function to create new columns in dataframe how to do that?

I tried

for i in range(1,5):
   BS.withColumns("some_name with increasing number like abc_1,abc_2",udf_func(col1<this should be col1,col2..till 4>,col6<this is fixed>

How to achieve this in PySpark?

Can you give an example of the DataFrame you're starting out with and the intended result? — kfkhalili
@kfkhalili I have added the dataframe sample in which i wanted to create new columns like i have shown in the 2-nd dataframe using that function which i have created ,also inside the function one of the parameter will be column from (col1-col5 and second parameter of fn will be only col5) — rakesh
I'm not sure I understand your use case, but perhaps the answer can help you. — kfkhalili

kfkhalili kfkhalili · Accepted Answer · 2020-09-02T13:15:41

You can only create one column at a time using withColumn, so we'll have to call it several times.

# We set up the problem
columns = ["col1", "col2", "col3"]
data = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)

df.show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|   1|   2|   3|
#|   4|   5|   6|
#|   7|   8|   9|
#+----+----+----+

Since your condition is based an if-else condition, you can do the logic within each iteration using when and otherwise. Since I don't know your use case, I check for a trivial condition that if colX is even, we add it to col3, if odd, we subtract.

We create a new column each iteration based on the number at the end of the column name, plus the number of columns (in our case 3), to generate 4, 5, 6.

# You'll need a function to extract the number at the end of the column name
import re
def get_trailing_number(s):
  m = re.search(r'\d+$', s)
  return int(m.group()) if m else None

from pyspark.sql.functions import col, when
from pyspark.sql.types import FloatType
rich_df = df
for i in df.columns:
  rich_df = rich_df.withColumn(f'col{get_trailing_number(i) + 3}', \
   when(col(i) % 2 == 0, col(i) + col("col3"))\
   .otherwise(col(i) - col("col3")).cast(FloatType()))

rich_df.show()
#+----+----+----+----+----+----+
#|col1|col2|col3|col4|col5|col6|
#+----+----+----+----+----+----+
#|   1|   2|   3|-2.0| 5.0| 0.0|
#|   4|   5|   6|10.0|-1.0|12.0|
#|   7|   8|   9|-2.0|17.0| 0.0|
#+----+----+----+----+----+----+

Here's a UDF version of the function

def func(col, constant):
  if (col % 2 == 0):
    return float(col + constant)
  else:
    return float(col - constant)

func_udf = udf(lambda col, constant: func(col, constant), FloatType())

rich_df = df
for i in df.columns:
  rich_df = rich_df.withColumn(f'col{get_trailing_number(i) + 3}', \
                               func_udf(col(i), col("col3")))

rich_df.show()
#+----+----+----+----+----+----+
#|col1|col2|col3|col4|col5|col6|
#+----+----+----+----+----+----+
#|   1|   2|   3|-2.0| 5.0| 0.0|
#|   4|   5|   6|10.0|-1.0|12.0|
#|   7|   8|   9|-2.0|17.0| 0.0|
#+----+----+----+----+----+----+

It's hard to say more without understanding what you're trying to do.

use udf inside for loop to create multiple columns in Pyspark

1 Answers