Is it possible to send a spark dataframe as an argument to a pandas UDF and get a pandas dataframe as return. Below is an example code set I am working with and I am getting error during calling the function:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf
spark = SparkSession \
.builder \
.appName("PrimeBatch") \
.master("local[*]") \
.getOrCreate()
srcFile = <Some CSV file>
df = spark.read.option("header",True)\
.csv(srcFile)
# Declare the function and create the UDF
@pandas_udf("Count int")
def count_udf(v: pd.DataFrame) -> pd.DataFrame:
return v.count()
p_df = count_udf(df)
p_df
The error I am getting when running the code is as below:
TypeError: Invalid argument, not a string or column: DataFrame[] of type <class 'pyspark.sql.dataframe.DataFrame'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
Thanks in advance!!!