pyspark : error while creating new column in pyspark

Question

I have a pyspark dataframe

a = [
    (0.31, .3, .4, .6, 0.4),
    (.01, .2, .92, .4, .47),
    (.3, .1, .05, .2, .82),
    (.4, .4, .3, .6, .15),
]

b = ["column1", "column2", "column3", "column4", "column5"]

df = spark.createDataFrame(a, b)

Now I want to create a new column based on below condition

df.withColumn('new_column' ,(norm.ppf(F.col('column1')) - norm.ppf(F.col('column1') * F.col('column1'))) / (1 - F.col('column2')) ** 0.5)

but its giving error. Please help!

Update : I have replaced corrected the column name

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-38-8dfe7d50be84> in <module>
----> 1 df.withColumn('new_column' ,(norm.ppf(F.col('PD')) - norm.ppf(F.col('PD') * F.col('PD'))) / (1 - F.col('rho_start')) ** 0.5)

~/anaconda3/envs/python3/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py in ppf(self, q, *args, **kwds)
   1995         args = tuple(map(asarray, args))
   1996         cond0 = self._argcheck(*args) & (scale > 0) & (loc == loc)
-> 1997         cond1 = (0 < q) & (q < 1)
   1998         cond2 = cond0 & (q == 0)
   1999         cond3 = cond0 & (q == 1)

~/anaconda3/envs/python3/lib/python3.6/site-packages/pyspark/sql/column.py in __nonzero__(self)
    633 
    634     def __nonzero__(self):
--> 635         raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
    636                          "'~' for 'not' when building DataFrame boolean expressions.")
    637     __bool__ = __nonzero__

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

Michael Szczesny Michael Szczesny · Accepted Answer · 2020-09-25T15:31:57

At this point it's unclear what your columns PD and rho_start could be. But I can give you an example of how to call a scipy function with pyspark.

Setup the dataframe

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
a = [
    (0.31, .3, .4, .6, 0.4),
    (.01, .2, .92, .4, .47),
    (.3, .1, .05, .2, .82),
    (.4, .4, .3, .6, .15),
]

b = ["column1", "column2", "column3", "column4", "column5"]

df = spark.createDataFrame(a, b)
df.show()

Out:

+-------+-------+-------+-------+-------+
|column1|column2|column3|column4|column5|
+-------+-------+-------+-------+-------+
|   0.31|    0.3|    0.4|    0.6|    0.4|
|   0.01|    0.2|   0.92|    0.4|   0.47|
|    0.3|    0.1|   0.05|    0.2|   0.82|
|    0.4|    0.4|    0.3|    0.6|   0.15|
+-------+-------+-------+-------+-------+

You can use pandas_udf to vectorize the computation

import pandas as pd
from scipy.stats import *    
from pyspark.sql.functions import pandas_udf

@pandas_udf('double')
def vectorized_ppf(x):
    return pd.Series(norm.ppf(x))

df.withColumn('ppf', vectorized_ppf('column1')).show()

Out:

+-------+-------+-------+-------+-------+-------------------+
|column1|column2|column3|column4|column5|                ppf|
+-------+-------+-------+-------+-------+-------------------+
|   0.31|    0.3|    0.4|    0.6|    0.4|-0.4958503473474533|
|   0.01|    0.2|   0.92|    0.4|   0.47|-2.3263478740408408|
|    0.3|    0.1|   0.05|    0.2|   0.82|-0.5244005127080409|
|    0.4|    0.4|    0.3|    0.6|   0.15|-0.2533471031357997|
+-------+-------+-------+-------+-------+-------------------+

Use `udf` when `pandas_udf` is not available

Sometimes it's hard to get pandas_udf to work correctly. You can use udf as an alternative.
Define the scipy function as udf

from scipy.stats import *
import pyspark.sql.functions as F
from pyspark.sql.types import DoubleType

@F.udf(DoubleType())
def ppf(x):
    return float(norm.ppf(x))

Call the udf ppf to create new_column with values of column1

df1 = df.withColumn('new_column' , ppf('column1'))
df1.show()

Out:

+-------+-------+-------+-------+-------+-------------------+
|column1|column2|column3|column4|column5|         new_column|
+-------+-------+-------+-------+-------+-------------------+
|   0.31|    0.3|    0.4|    0.6|    0.4|-0.4958503473474533|
|   0.01|    0.2|   0.92|    0.4|   0.47|-2.3263478740408408|
|    0.3|    0.1|   0.05|    0.2|   0.82|-0.5244005127080409|
|    0.4|    0.4|    0.3|    0.6|   0.15|-0.2533471031357997|
+-------+-------+-------+-------+-------+-------------------+

Micro-Benchmark

I ran pandas_udf(vectorized) and udf with different input sizes.

Test ran on a two core databricks cluster with Spark 3.0
Functions returned df.select(ppf('column1')).collect()

pyspark : error while creating new column in pyspark

1 Answers

Use udf when pandas_udf is not available

Micro-Benchmark

Use `udf` when `pandas_udf` is not available