1
votes
schema = StructType([
    StructField("title", StringType(), False),
    StructField("stringdataA", StringType(), False),
#     StructField("list", ArrayType( StructType([
#         StructField("A", IntegerType()  , False),
#         StructField("B", StringType()   , False),
#         StructField("C", TimestampType(), False)
#     ]))),
    StructField("stringdataB",  StringType(), False)])

    @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
    def make_data(x):
        ~~ make data fitted in shcema

groupedList = df.groupby("groupkey").apply(make_data)

'make_data' function will make data which is fitted in a schema I defined, but when I added list( map()) structure field in schema. It gave me a error like below. Is that really not supported schema structure?

Is there any way to get list( map()) structure data that I can handle?

NotImplementedError: Invalid returnType with grouped map Pandas UDFs: StructType(List(StructField(title,StringType,false),StructField(stringdataA,StringType,false),StructField(list,ArrayType(StructType(List(StructField(A,IntegerType,false),StructField(B,StringType,false),StructField(C,TimestampType,false))),true),true),StructField(stringdataB,StringType,false))) is not supported

2

2 Answers

1
votes

I think your list elements are StructType which is not supported:

https://github.com/apache/spark/blob/4a4e7aeca79738d5788628d67d97d704f067e8d7/python/pyspark/sql/types.py#L1581

If you want to confirm, try to call pyspark.sql.types.to_arrow_schema(schema) and see what happens.

0
votes

Since StructType is not supported, one workaround is to use json.dumps(data) to dump your data before returning it. The schema for this will have the StringType() field.

Later you could use json.loads() to convert to Array/list