Grouped dataframe data with apache arrow

Question

schema = StructType([
    StructField("title", StringType(), False),
    StructField("stringdataA", StringType(), False),
#     StructField("list", ArrayType( StructType([
#         StructField("A", IntegerType()  , False),
#         StructField("B", StringType()   , False),
#         StructField("C", TimestampType(), False)
#     ]))),
    StructField("stringdataB",  StringType(), False)])

    @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
    def make_data(x):
        ~~ make data fitted in shcema

groupedList = df.groupby("groupkey").apply(make_data)

'make_data' function will make data which is fitted in a schema I defined, but when I added list( map()) structure field in schema. It gave me a error like below. Is that really not supported schema structure?

Is there any way to get list( map()) structure data that I can handle?

NotImplementedError: Invalid returnType with grouped map Pandas UDFs: StructType(List(StructField(title,StringType,false),StructField(stringdataA,StringType,false),StructField(list,ArrayType(StructType(List(StructField(A,IntegerType,false),StructField(B,StringType,false),StructField(C,TimestampType,false))),true),true),StructField(stringdataB,StringType,false))) is not supported

0x26res 0x26res · Accepted Answer · 2019-02-25T12:19:10

I think your list elements are StructType which is not supported:

https://github.com/apache/spark/blob/4a4e7aeca79738d5788628d67d97d704f067e8d7/python/pyspark/sql/types.py#L1581

If you want to confirm, try to call pyspark.sql.types.to_arrow_schema(schema) and see what happens.

Grouped dataframe data with apache arrow

2 Answers