HI,Could you please help me resolving Issue while creating new column in Pyspark: I explained the issue as below:

Question

query I'm using:

I want to replace existing columns with new values on condition, if value of another col = ABC then column remain same otherwise should give null or blank. It's giving result as per logic but only for last column it encounters in loop.

import pyspark.sql.functions as F

for i in df.columns:
    if i[4:]!='ff':        
        new_df=df.withColumn(i,F.when(df.col_ff=="abc",df[i])\
       .otherwise(None))

df:
+------+----+-----+-------+
| col1 |col2|col3 | col_ff|
+------+----+-----+-------+
|   a  | a  | d   | abc   |
|   a  | b  | c   | def   |
|   b  | c  | b   | abc   |
|   c  | d  | a   | def   |
+------+----+-----+-------+

required output:

+------+----+-----+-------+
| col1 |col2|col3 | col_ff|
+------+----+-----+-------+
|   a  | a  | d   | abc   |
| null |null|null | def   |
|   b  | c  | b   | abc   |
| null |null|null | def   |
+------+----+-----+-------+

pault pault · Accepted Answer · 2018-10-22T15:41:54

The problem in your code is that you're overwriting new_df with the original DataFrame df in each iteration of the loop. You can fix it by first setting new_df = df outside of the loop, and then performing the withColumn operations on new_df inside the loop.

For example, if df were the following:

df.show()
#+----+----+----+------+
#|col1|col2|col3|col_ff|
#+----+----+----+------+
#|   a|   a|   d|   abc|
#|   a|   b|   c|   def|
#|   b|   c|   b|   abc|
#|   c|   d|   a|   def|
#+----+----+----+------+

Change your code to:

import pyspark.sql.functions as F

new_df = df
for i in df.columns:
    if i[4:]!='ff':        
        new_df = new_df.withColumn(i, F.when(F.col("col_ff")=="abc", F.col(i)))

Notice here that I removed the .otherwise(None) part because when will return null by default if the condition is not met.

You could also do the same using functools.reduce:

from functools import reduce  # for python3
new_df = reduce(
    lambda df, i: df.withColumn(i, F.when(F.col("col_ff")=="abc", F.col(i))),
    [i for i in df.columns if i[4:] != "ff"], 
    df
)

In both cases the result is the same:

new_df.show()
#+----+----+----+------+
#|col1|col2|col3|col_ff|
#+----+----+----+------+
#|   a|   a|   d|   abc|
#|null|null|null|   def|
#|   b|   c|   b|   abc|
#|null|null|null|   def|
#+----+----+----+------+

HI,Could you please help me resolving Issue while creating new column in Pyspark: I explained the issue as below:

1 Answers