0
votes

I have a pyspark dataframe, where I want to group by some index, and combine all the values in each column into one list per column.

Example input:

id_1| id_2| id_3|timestamp|thing1|thing2|thing3
A   | b   | c   |time_0   |1.2   |1.3    |2.5
A   | b   | c   |time_1   |1.1   |1.5    |3.4
A   | b   | c   |time_2   |2.2   |2.6    |2.9
A   | b   | d   |time_0   |5.1   |5.5    |5.7
A   | b   | d   |time_1   |6.1   |6.2    |6.3
A   | b   | e   |time_0   |0.1   |0.5    |0.9
A   | b   | e   |time_1   |0.2   |0.3    |0.6

Example output:

id_1|id_2|id_3|        timestamp     |thing1       |thing2       |thing3
A   |b  |  c |[time_0,time_1,time_2]|[1.2,1.1,2.2]|[1.3,1.5,2.6|[2.5,3.4,2.9]
A   |b  |  d |[time_0,time_1]       |[5.1,6.1]    |[5.5,6.2]   |[5.7,6.3]
A   |b  |  e |[time_0,time_1]       |[0.1,0.2]    |[0.5,0.3]   |[0.9,0.6]

How can I do this efficiently?

2
What have you tried? Take a look at collect_list function and ‘agg’ functionPushkr

2 Answers

1
votes

Use collect_list() as people have suggested above as well.

# Creating the DataFrame
df =sqlContext.createDataFrame([('A','b','c','time_0',1.2,1.3,2.5),('A','b','c','time_1',1.1,1.5,3.4),
                               ('A','b','c','time_2',2.2,2.6,2.9),('A','b','d','time_0',5.1,5.5,5.7),
                               ('A','b', 'd','time_1',6.1,6.2,6.3),('A','b','e','time_0',0.1,0.5,0.9),
                               ('A','b', 'e','time_1',0.2,0.3,0.6)],
                               ['id_1','id_2','id_3','timestamp','thing1','thing2','thing3'])
df.show()
+----+----+----+---------+------+------+------+
|id_1|id_2|id_3|timestamp|thing1|thing2|thing3|
+----+----+----+---------+------+------+------+
|   A|   b|   c|   time_0|   1.2|   1.3|   2.5|
|   A|   b|   c|   time_1|   1.1|   1.5|   3.4|
|   A|   b|   c|   time_2|   2.2|   2.6|   2.9|
|   A|   b|   d|   time_0|   5.1|   5.5|   5.7|
|   A|   b|   d|   time_1|   6.1|   6.2|   6.3|
|   A|   b|   e|   time_0|   0.1|   0.5|   0.9|
|   A|   b|   e|   time_1|   0.2|   0.3|   0.6|
+----+----+----+---------+------+------+------+

In addition to using agg(), you can write familiar SQL syntax to operate on it, but first we have to register our DataFrame as temporary SQL view -

df.createOrReplaceTempView("df_view")
df = spark.sql("""select id_1, id_2, id_3,
                  collect_list(timestamp) as timestamp,
                  collect_list(thing1) as thing1,
                  collect_list(thing2) as thing2,
                  collect_list(thing3) as thing3 
                  from df_view 
                  group by id_1, id_2, id_3""")
df.show(truncate=False)
+----+----+----+------------------------+---------------+---------------+---------------+
|id_1|id_2|id_3|timestamp               |thing1         |thing2         |thing3         |
+----+----+----+------------------------+---------------+---------------+---------------+
|A   |b   |d   |[time_0, time_1]        |[5.1, 6.1]     |[5.5, 6.2]     |[5.7, 6.3]     |
|A   |b   |e   |[time_0, time_1]        |[0.1, 0.2]     |[0.5, 0.3]     |[0.9, 0.6]     |
|A   |b   |c   |[time_0, time_1, time_2]|[1.2, 1.1, 2.2]|[1.3, 1.5, 2.6]|[2.5, 3.4, 2.9]|
+----+----+----+------------------------+---------------+---------------+---------------+

Note: The """ has been used to have multiline statements for the sake of visibility and neatness. With simple 'select id_1 ....' that wouldn't work if you try to spread your statement over multiple lines. Needless to say, the final result will be the same.

0
votes

Here is the example github TestExample1

 exampleDf = self.spark.createDataFrame(
            [('A', 'b', 'c', 'time_0', 1.2, 1.3, 2.5),
             ('A', 'b', 'c', 'time_1', 1.1, 1.5, 3.4),
             ],
            ("id_1", "id_2", "id_3", "timestamp", "thing1", "thing2", "thing3"))

        exampleDf.show()

        ans = exampleDf.groupBy(col("id_1"), col("id_2"), col("id_3")) \
            .agg(collect_list(col("timestamp")),
                 collect_list(col("thing1")),
                 collect_list(col("thing2")))

        ans.show()

+----+----+----+---------+------+------+------+
|id_1|id_2|id_3|timestamp|thing1|thing2|thing3|
+----+----+----+---------+------+------+------+
|   A|   b|   c|   time_0|   1.2|   1.3|   2.5|
|   A|   b|   c|   time_1|   1.1|   1.5|   3.4|
+----+----+----+---------+------+------+------+

+----+----+----+-----------------------+--------------------+--------------------+
|id_1|id_2|id_3|collect_list(timestamp)|collect_list(thing1)|collect_list(thing2)|
+----+----+----+-----------------------+--------------------+--------------------+
|   A|   b|   c|       [time_0, time_1]|          [1.2, 1.1]|          [1.3, 1.5]|
+----+----+----+-----------------------+--------------------+--------------------+