0
votes

Is it possible to send a spark dataframe as an argument to a pandas UDF and get a pandas dataframe as return. Below is an example code set I am working with and I am getting error during calling the function:

import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf



spark = SparkSession \
    .builder \
    .appName("PrimeBatch") \
    .master("local[*]") \
    .getOrCreate()

srcFile = <Some CSV file>
df = spark.read.option("header",True)\
    .csv(srcFile)

# Declare the function and create the UDF
@pandas_udf("Count int")
def count_udf(v: pd.DataFrame) -> pd.DataFrame:
    return v.count()

p_df = count_udf(df)
p_df

The error I am getting when running the code is as below:

TypeError: Invalid argument, not a string or column: DataFrame[] of type <class 'pyspark.sql.dataframe.DataFrame'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

Thanks in advance!!!

1

1 Answers

0
votes

In general, a Pandas UDF would take Pandas.Series. The count_udf function you have defined is just a normal function which takes a pandas DataFrame and returns a pandas DataFrame.

if you want to convert spark DataFrame into a pandas DataFrame then you can try the follows:

pandas_df  = df.toPandas()

you can refer the following link to better understand how to apply a panda UDF:

  1. Introducing vectorized udfs for pyspark
  2. Spark Pandas UDF