1
votes

I want to remove an array from an array of arrays (in Array column) in a dataframe (pyspark).

import pyspark.sql.functions as F
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

data = [("1", "A", 2), ("1", None, 0), ("1", "B", 3), ("2", None, 0), ("2", "C", 4), ("2", "D", 1), ("2", None, 0)]
dfschema = StructType([StructField("id", StringType()), StructField("value", StringType()), StructField("amount", IntegerType())])
df = spark.createDataFrame(data, schema=dfschema)

grouped = (
    df.
    groupby("id").
    agg(
        F.collect_list(
            F.struct(
                F.col("value"), 
                F.col("amount")
            )
        ).alias("collected")
    )
)
grouped.show(truncate=False)

+---+------------------------------+                                            
|id |collected                     |
+---+------------------------------+
|1  |[[A, 2], [, 0], [B, 3]]       |
|2  |[[, 0], [C, 4], [D, 1], [, 0]]|
+---+------------------------------+

Here's the result I want

+---+-----------------------+
|id |collected              |
+---+-----------------------+
|1  |[[A, 2], [B, 3]]       |
|2  |[[C, 4], [D, 1]]       |
+---+-----------------------+

I tried using F.array_remove(..., [, 0]) but that gives an error. Not quite sure how I can define the element I want to remove. Thanks!

1
Edited example to encourage solutions that do not use positional logic.CPak

1 Answers

1
votes

For Spark 2.4+, you can use array_except :

grouped.withColumn("collected",
                   array_except(col("collected"),
                                array(struct(lit(None).cast("string").alias("value"), lit(0).alias("amount")))
                                )
                   ) \
       .show()

Gives:

+---+----------------+
|id |collected       |
+---+----------------+
|1  |[[A, 2], [B, 3]]|
|2  |[[C, 4], [D, 1]]|
+---+----------------+