I'm trying to convert some Pandas code to Spark for scaling. myfunc is a wrapper to a complex API that takes a string and returns a new string (meaning I can't use vectorized functions).
def myfunc(ds):
for attribute, value in ds.items():
value = api_function(attribute, value)
ds[attribute] = value
return ds
df = df.apply(myfunc, axis='columns')
myfunc takes a DataSeries, breaks it up into individual cells, calls the API for each cell, and builds a new DataSeries with the same column names. This effectively modifies all cells in the DataFrame.
I'm new to Spark and I want to translate this logic using pyspark. I've converted my pandas DataFrame to Spark:
spark = SparkSession.builder.appName('My app').getOrCreate()
spark_schema = StructType([StructField(c, StringType(), True) for c in df.columns])
spark_df = spark.createDataFrame(df, schema=spark_schema)
This is where I get lost. Do I need a UDF, a pandas_udf? How do I iterate across all cells and return a new string for each using myfunc? spark_df.foreach() doesn't return anything and it doesn't have a map() function.
I can modify myfunc from DataSeries -> DataSeries to string -> string if necessary.